title: Acceleration of LLM - Matrix Multiplication tags: [LLM, acceleration, Matrix multiplication, torchview]
After read "Manual Autograd" in unsloth's blog, I try to parse model and found more related point where we can optimize.
torchview is a great similar tool to use.
I want to show what torchview can do after I try it.
Showing node and related information:
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world!", return_tensors="pt")
model_graph = draw_graph(model, input_data=inputs,
save_graph = True,
filename = 'output')
print (len(model_graph.edge_list))
for a, b in model_graph.edge_list:
print (a, b, type(a), type(b))
Attention: there are much softmax or activation functions in general model, the only three consecutive matrix multiplication is (maxtrix_intput * W_q) * (maxtrix_intput * W_k)
, but it can not be optimized because there is no much difference between $d_input$ and $d_hidden$.
Parse module: torchview can not parse the specific module so far, there are so much special case in module, like llamaAttention. But, if we have specific input data, it can follow a specific path to execute the code, it seems that torchview works in this way because input data or input size is necessary for torchview, I didn't research much more about that.
Optmization of matrix multiplication still can be used in other module, like
Failling on this indicate that I always think too much but read insufficiently. Simple idea can not work in most situations.