仓库源文站点原文

Summary

Methods

There are two main methods to acclerate LLM and another tricky methods

already read papers: 13

Reference

Categories

Low-rank

LoRA

Low-Rank of large matrices when fine-tune

informaiton

reference

Linformer

SVD decomposition for large QKV projection matrices to reduce required memory

Performers

low-rank projection with a novel method named FAVOR

Block

FlashAttention

Matrices multiplication by blocks

Self-attention Does Not Need O(n2) Memory

attention calculation with blocks

FlashDecoding++

FlashDecoding++: Faster Large Language Model Inference on GPUs, three parts

Cut cross entropy

Main idea:

![Performance](/images/2025/0115-01.png){: width="600"}

Basic

Parallelization

1) Medusa

output top-k predictions for next multiple positions parallelly through adding LM heads for next several positons, which can reduce inference latency.

scalability{: width="600"}

2) SnapKV

compress KV cacha for long sequence tasks

Infrastructure

1) triton

2) Hardware Acceleration of LLMs: A comprehensive survey and comparison

Simple introduce and compare different hardware acceleration method in terms of efficiency and performance

Trick

1) Inference with Reference

Lossless Acceleration of Large Language Models: copy reference to inference because there many same text sentence between them to accelerate inference 2) SwitchHead

Accelerating Transformers with Mixture-of-Experts Attention: select different experts matrices for every head in attention by input content to reduce computation and memory usage.

+ published: 2024

3) DropBP:

Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation:

scalability{: width="800"}

To Read

Quantization

Optimizer

RNN

Trick

Long sequence

2:4

Pruning

cache

trade-off

PE