Halloween has passed, however, I am still in kind of shock every day doing my work and trying to figure out how the AI world works at different levels.
And continuing chapter one, what about hardware optimization, and speaking more precisely, decreasing compute costs through it? In short, our team in AIFoundry.org and part of the community is working on two related potential solutions here: quantization and, specifically, matrix multiplications with quantized data.
When working with quantized models matrix multiplication is especially critical. Quantization is a process where the weights of a neural network are compressed into smaller data types (such as 8-bit or even 4/2-bit integers) to reduce the memory footprint and speed up computations - and the accuracy on certain tasks can stay almost the same! However, this compression brings challenges when trying to execute matrix multiplication directly on quantized data.
So, how low-level optimizations for AI models explains the intricacies of working with quantized data, particularly in the context of the Lama CPP library, which supports quantized models? Here is the breakdown:
- Matrix multiplication is at the heart of neural network operations. When dealing with quantized data, the common approach is to first dequantize the compressed data—convert it back to floating-point values—before applying traditional matrix multiplication libraries (such as cuBLAS). However, this dequantization step adds time and memory overhead, which undermines the performance benefits of quantization.- The real challenge is developing matrix multiplication algorithms (or "kernels") that can directly operate on quantized data without needing to dequantize it first. This would eliminate the need for intermediate conversion and reduce the I/O (input/output) overhead, improving the overall speed of computations.
Matrix Multiplication Scaling Problem
In mathematical terms, matrix multiplication scales with the size of the matrices. The time complexity of matrix multiplication is proportional to the cube of the matrix size (O(n^3)), while the I/O operation scales with the square of the size (O(n^2)). This mismatch means that as matrix sizes increase, the performance becomes I/O-bound, particularly when dealing with smaller matrices or vectors. Dequantizing the data exacerbates this problem by increasing the amount of data that needs to be moved in and out of memory.
Alternative - Direct - Approach: Custom Matrix Multiplication Kernels
Some of the community explorers’ solution is to develop custom matrix multiplication kernels that can directly operate on the quantized data. This allows them to skip the dequantization step and instead use integer operations, which are more efficient than floating-point operations. While this is theoretically faster, it requires writing highly specialized code that can take months to optimize, yielding only marginal performance gains.
In practice, these kernels aren't yet faster than established libraries like cuBLAS, but they offer a path forward to more efficient computations by leveraging the greater throughput of integer operations (int8) compared to floating-point operations (fp16).
The Goal: Training Quantized Models Directly
A major goal of such a work is to make it possible to train models directly on quantized data without needing to revert to floating-point precision. Currently, training is typically done at a higher precision (such as fp16), and then the model is quantized afterward, which brings rounding errors. The aim is to train models on quantized data from the start, which could reduce these errors and lead to more efficient training processes.
Making AI Affordable: Own Your Own AI
Additionally, the ability to train on lower-end hardware, such as consumer GPUs, is another target. While high-end GPUs like the H100 are costly, enabling training on more accessible hardware (like an RTX 4090) at a reduced speed would open up fine-tuning capabilities to a broader range of users. An important goal for significant part of the community - to make training and inference models approachable for a bigger audience, truly engaged and inspired by contributing to tech development.
Concluding, there is much complexity and importance of optimizing matrix multiplication for quantized models. By skipping the dequantization step and operating directly on compressed data, AI practitioners can improve speed and reduce memory overhead. However, achieving this requires significant development effort in writing custom matrix multiplication kernels. The future of quantized model training may involve a hybrid approach, where models are trained on lower-precision data from the start, allowing broader accessibility to fine-tuning on consumer-grade hardware.
___
AIFoundry.org Team