Deep Learning and AI

What is LLM Quantization and How to Use Them?

May 30, 2025 • 4 min read

Introduction

With the increasing number of parameters, Large Language models are increasingly becoming larger and larger to store in memory. Data centers are built to run these full-sized LLMs with hundreds of billions of parameters. As local AI and running models on personal hardware becomes more common, quantized LLMs have become a staple in running local AI models.

What is Quantization in LLMs

Quantization reduces model size by modifying the floating-point precision of weights in an LLM's parameters. Using this technique, models maintain most of their performance while significantly reducing memory and computational requirements.

Most LLMs use neural networks with nodes and weights defined in FP32 (32-bit) format to represent numbers. When values are defined using fewer bits, for example, 8-bit, we can save memory across the board.

During quantization, higher-precision numbers (like FP32) are converted to lower-precision formats (like INT8 or FP8). This process dramatically reduces model size. While this sacrifices some precision, the impact on performance is minimal, resulting in a faster and more memory-efficient model.

Understanding Quantization Types (e.g., Q4_0, Q4_K_M)

Quantization types describe how a model’s weights and activations are compressed to reduce size and improve performance, particularly on consumer-grade GPUs. Each quantization scheme has its own trade-offs between speed, memory usage, and output quality.

If you're unsure which quantization type to choose, start with the Q4_K_M for most use cases—it offers a strong balance between performance and quality on modern GPUs.

Here's how to read the components of common quantization labels:

The Prefix number refers to the number of bits used per weight. Lower-bit quantization saves memory and boosts speed, but impacts model accuracy if too aggressive.
- Q4 - 4-bit quantization (high compression)
- Q5 - 5-bit (slightly better accuracy)
- Q8 - 8-bit (near full precision, but larger)
The Suffix number and letter define the specific quantization scheme used.
- _0 - Original baseline quant method with uniform quantization
- _K - K-bit group-wise quantization – higher accuracy than _0
- _S - Symmetric quantization – uses same scale for positive and negative values
- _M - Mixed group quantization – further improves accuracy

These Suffixes can appear in combinations. For example:

Q4_K_S - 4-bit, group-wise quantization with symmetric scales
Q5_K_M - 5-bit, group-wise quantization with mixed scales

Type	Speed	Accuracy	Best For
`Q4_0`	⭐⭐⭐⭐	⭐⭐	Maximum speed, limited accuracy
`Q4_K_M`	⭐⭐⭐	⭐⭐⭐	Balanced choice for general use
`Q5_K_M`	⭐⭐	⭐⭐⭐⭐	Higher quality with moderate speed tradeoff
`Q8_0`	⭐	⭐⭐⭐⭐⭐	8-bit higher accuracy, requires more VRAM. Closest to the full model.

How to Run a Quantized LLM Model?

Many resources are available for running open-source quantized LLMs. On Hugging Face, you can search for these models by filtering for 8-bit and 4-bit precision. These models are ready to download and use in Python notebooks for testing or running in your own code at home.

Quantized models are also accessible through Ollama. Simply find your preferred model and check if quantized versions are available. For example, Gemma3 offers both Q8 and Q4 models. DeepSeek R1 provides Q8 and Q4 versions, along with distilled quantized models too for even more memory savings.

Read our blog on How to Set Up DeepSeek R1 as a Local LLM on how you can deploy your own LLM at home. If you’re not into DeepSeek, pull any model you like from Ollama.

Conclusion

Think of quantization like compressing an image’s resolution. While the image may lose detail and be more pixelated on closer inspection, from afar, the image retained its most important features. Quantization is a breakthrough in making LLMs more accessible and practical. Here are the key benefits:

Significantly reduced memory requirements
Compatibility with lower-powered systems
Faster inference speeds

While quantization offers clear advantages in resource efficiency, organizations should carefully evaluate its impact on model accuracy through testing. When implemented correctly, this technique effectively balances accessibility and performance, making it valuable for deploying LLMs in environments with limited resources.

The trade-off between model accuracy and efficiency has made quantization a crucial optimization strategy for deploying and democratizing the use of open-source LLMs locally. We can run these large language models on hardware of all kinds of performance levels, from laptops for PoC to multi-GPU servers in our businesses.

Blog