There are two main methods to acclerate LLM and another tricky methods
Low-Rank of large matrices when fine-tune
informaiton
reference
SVD decomposition for large QKV projection matrices to reduce required memory
low-rank projection with a novel method named FAVOR
Project high dimension into low dimension for KV cache, which can reduct much memory ustage.
Matrices multiplication by blocks
attention calculation with blocks
FlashDecoding++: Faster Large Language Model Inference on GPUs, three parts
Softmax with block and Unified Maximum Value, result of block softmax can be directly used and merging is unnecesary. Optimized from FlashAttention.
{: width="600"}
Flat GEMM(small batch size when reference) Optimization with Double Buffering. [didn't understand]
Heuristic Dataflow with Hardware Resource Adaption, choose difference optimizaiton methods for different M value(batch size and sequence length) [didn't total understand]
{: width="600"}
reference
No code(2024.11)
Main idea:
Avoid to store final large matrix throught block computation, which can save lots of memory when vocabulary is large.
Softmax matrix is sparse, when all values are smaller than precision of data type, computation are unnecessary.
{: width="600"}
{: width="600"}
output top-k predictions for next multiple positions parallelly through adding LM heads for next several positons, which can reduce inference latency.
{: width="600"}
compress KV cacha for long sequence tasks
Simple introduce and compare different hardware acceleration method in terms of efficiency and performance
Lossless Acceleration of Large Language Models: copy reference to inference because there many same text sentence between them to accelerate inference 2) SwitchHead
Accelerating Transformers with Mixture-of-Experts Attention: select different experts matrices for every head in attention by input content to reduce computation and memory usage. + published: 2024 3) DropBP:
Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation:
{: width="800"}
Quantization
Optimizer
RNN
Trick
Long sequence
2:4
Pruning
cache
trade-off
PE