仓库源文站点原文


title: Outline of LLM acceleration tags: [LLM, fine-tune, acceleration] math: true comments: true

pin: true

Summary

Methods

There are two main methods to acclerate LLM and another tricky methods

already read papers: 12

Reference

Categories

Low-rank

LoRA

Low-Rank of large matrices when fine-tune

informaiton

reference

Linformer

SVD decomposition for large QKV projection matrices to reduce required memory

Performers

low-rank projection with a novel method named FAVOR

Block

FlashAttention

Matrices multiplication by blocks

Self-attention Does Not Need O(n2) Memory

attention calculation with blocks

+ FlashDecoding++

FlashDecoding++: Faster Large Language Model Inference on GPUs, three parts

Basic

Parallelization

New Solutions on LLM Acceleration, Optimization, and Application

1) Medusa: output top-k predictions for next multiple positions parallelly through adding LM heads for next several positons, which can reduce inference latency.

scalability{: width="600"}

2) SnapKV: compress KV cacha for long sequence tasks

Infrastructure

triton

Hardware Acceleration of LLMs: A comprehensive survey and comparison Simple introduce and compare different hardware acceleration method in terms of efficiency and performance

Trick

scalability{: width="800"}

To Read

RNN

Trick

Long sequence

2:4

Pruning

cache

trade-off

PE