Skip to content

Quantizing LLMs to Run on Smaller Systems with Llama.cpp

 

This podcast turned into a journey down a rabbit hole exploring Llama.cpp and tradeoffs about quantizing LLMs to fit onto smaller systems such as laptops for inference.

“In the beginning…”

Yulia started by reminding us what it was like to run inference on LLMs before Llama.cpp came into being – needing to rent a lot of expensive hardware.  One of our audience participants checked the spot price on AWS for H100 GPU Inference, which was $12.32 / GPU / hour.

 

“Then something happened…”

And that happening as the leaking of Meta’s LLaMa LLM on Torrent.

 

Which sparked the creation of Llama.cpp barely a week later:

 

Finally, a key commit, just a few weeks later from Justine Tumney made Llama.cpp suitable for bringing LLM inference to commodity hardware.

 

This compressed timeline shows how rapid innovation is possible in open source communities.  It also transformed what might have originally been seen as a tragedy, the leaking of Meta’s source code, into an opportunity.  Suddenly Meta was seeing the benefits of code optimization from the open source community that they did not need to build in house.

Yulia discussed the flexibility of Llama.cpp, such its ability to support different hardware and different transformers.

A Quick Dive into Quantization

We then switched to discuss quantization, which essentially reduces the bitwidth of weights in LLMs in order to reduce their size - trading precision for performance.  We discussed different techniques as well as compared quantization to “pruning”. Pruning eliminates parameters that are not important to an application, as opposed to reducing precision.

The bottom line is that reducing precision is a tradeoff, but may still work out well depending on the application.  Yulia shared this graph of scores of LLMs by the Hellaswag benchmark vs. the bit width, and it shows only steep declines below 4 bits.

 

We geeked out about 1 bit quantization and then agreed we need to do further deep dives into Llama. CPP and quantization in future episodes. Yulia shared this example of a pull request for Llama.cpp for ever tighter quantization.

 

If this topic interests you, watch the replay video, as this recap is merely a summary.  I also recommend subscribing to the AIFoundry.org calendar to keep abreast of upcoming live podcasts and community meetings.