Quantization
Compressing model weights to lower precision (4-bit, 8-bit) so they're cheaper to run.
Quantization shrinks a model by representing its weights with fewer bits 16-bit becomes 8-bit, 8-bit becomes 4-bit, sometimes 2-bit. The model becomes 2-8x smaller and faster, with surprisingly small quality loss when done well.
Quantization is the reason a 70B-parameter Llama can run on a laptop or phone today. Methods like GPTQ, AWQ, and bitsandbytes have made quality-preserving quantization mainstream.
For production self-hosting, quantization is one of the most effective optimizations. For frontier model APIs you usually don't see it directly, but providers absolutely use it under the hood.