AI Definition

Quantization

Compressing model weights to lower precision (4-bit, 8-bit) so they're cheaper to run.

Quantization shrinks a model by representing its weights with fewer bits 16-bit becomes 8-bit, 8-bit becomes 4-bit, sometimes 2-bit. The model becomes 2-8x smaller and faster, with surprisingly small quality loss when done well.

Quantization is the reason a 70B-parameter Llama can run on a laptop or phone today. Methods like GPTQ, AWQ, and bitsandbytes have made quality-preserving quantization mainstream.

For production self-hosting, quantization is one of the most effective optimizations. For frontier model APIs you usually don't see it directly, but providers absolutely use it under the hood.

Related concepts

Inference

Running a trained model to produce outputs (as opposed to training the model in the first place).

LLM (Large Language Model)

A neural network trained on huge amounts of text to predict and generate language.

Want help applying this in production?

Our engineers ship AI features into production every week. Tell us what you're building.

Get a Free Quote Contact Us