blog

How Quantization Improves Efficiency of Large Neural Networks & LLMs

Written by Greg Chase | Aug 19, 2024 12:00:00 PM

In this episode, Jan Akhremchik, an AI Researcher with AIFoundry, talks with me about different ways to improve the efficiency of neural networks and LLMs. Machine learning models have always been bounded by size, but something exciting happened in 2017. Google Research released their paper “Attention is All You Need,” which described the Transformers architecture they used to improve text translation results dramatically.  

 

Transformers architecture allowed neural networks to remember relationships across larger data sets, making larger models more beneficial. The AI community applied this approach to large language models (LLMs), and suddenly, we saw an exponential increase in their size. This is because the dramatic increase in the quality of results was worth the higher expense of training these models.  However, even with this improvement in the quality of results, we’re now dealing with even more expensive compute and storage costs to train and infer these models.

 

 

Jan then briefly describes different techniques that improve the efficiency of inference models: pruning, knowledge distillation, and quantization.

With pruning, we remove unneeded information unrelated to the problems we are looking to solve.  With knowledge distillation, we train a smaller model to be specialized for our use case with the larger model.  Incidentally, knowledge distillation is one recommended use case for the recently released Llama 3.1 model.  Finally, with quantization, we reduce the resolution of weights, such as by changing 32-bit floating point numbers into 8-bit integers to reduce the overall data size of a model.

 

Jan then dives deeper into how quantization works. Long story short, we map the values of the weights of the original model into a smaller range (such as an 8-byte integer) but in the same proportions so as to maintain an approximation of the original relationship between values.

 

 

Jan then explains there are two different ways to apply this quantization: after training a model or in conjunction with your training process. In other words, with post-training quantization, you take an as-is model and then apply quantization to fit your requirements for size. With quantization-aware training, you train the model and then quantize it, considering your use case and size requirements.  The tradeoffs are that post-training quantization is easier and potentially faster, such as when you start with a pre-existing 3rd party model such as Llama 3.1.  However, quantization-aware training generally creates a better resultant model since the whole process takes in account your use case and size requirements in advance.

 

 

We then discuss prerequisites, starting points, and business requirements as to how you might select your approach to choose and fit your models.  These can include whether one has their data, access to machine learning talent, and how much budget and time they have.  For example, if time to market is paramount, you might use an AI service such as OpenAI or Anthropic or start with a foundation model such as Llama 3.1.  However, if you have access to machine learning talent, sufficient data, and sufficient time and money, you might train your open neural networks or LLMs that are optimal for your use case.