There are two main methods to acclerate LLM and another tricky methods
already read papers: 13
Low-Rank of large matrices when fine-tune
informaiton
reference
SVD decomposition for large QKV projection matrices to reduce required memory
low-rank projection with a novel method named FAVOR
Matrices multiplication by blocks
attention calculation with blocks
FlashDecoding++: Faster Large Language Model Inference on GPUs, three parts
Softmax with block and Unified Maximum Value, result of block softmax can be directly used and merging is unnecesary. Optimized from FlashAttention.
{: width="600"}
Flat GEMM(small batch size when reference) Optimization with Double Buffering. [didn't understand]
Heuristic Dataflow with Hardware Resource Adaption, choose difference optimizaiton methods for different M value(batch size and sequence length) [didn't total understand]
{: width="600"}
reference
Main idea:
Softmax matrix is sparse, when all values are smaller than precision of data type, computation are unnecessary.
{: width="600"}
{: width="600"}
1) Medusa
output top-k predictions for next multiple positions parallelly through adding LM heads for next several positons, which can reduce inference latency.
{: width="600"}
2) SnapKV
compress KV cacha for long sequence tasks
1) triton
2) Hardware Acceleration of LLMs: A comprehensive survey and comparison
Simple introduce and compare different hardware acceleration method in terms of efficiency and performance
1) Inference with Reference
Lossless Acceleration of Large Language Models: copy reference to inference because there many same text sentence between them to accelerate inference 2) SwitchHead
Accelerating Transformers with Mixture-of-Experts Attention: select different experts matrices for every head in attention by input content to reduce computation and memory usage.
+ published: 2024
3) DropBP:
Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation:
{: width="800"}
Quantization
Optimizer
RNN
Trick
Long sequence
2:4
Pruning
cache
trade-off
PE