Skip to content

Local AI: what in fact is local models inference?

So this is definitely not an instruction post, but rather my naive effort to build a top-down picture of a system we can call “local models inference”. And I am doing this by looking for the right questions and asking them - in fact, all of you remember The Hitchhiker's Guide to the Galaxy, right? Having an answer does not get you anywhere, if you do not have the right question. 

So, what is in fact included in the local models inference stack? Which subset of engineering building blocks and functionalities can we call like this?

 

What does local inference look like in theory?

Building an open-source local AI model inference engine involves incorporating several core components, each playing a key role in ensuring that the model functions effectively and efficiently. How inference engine works theoretically at the high-level:

  1. Load Model: the model loader reads and loads the trained model into memory.
  2. Preprocessing: input data is prepared, such as tokenization for text or resizing for images.
  3. Run Inference: preprocessed data is fed to the model using the inference engine, which utilizes framework-specific libraries.
  4. Optimize Execution: use optimizations e.g. pruning, adding/deleting layers and quantization to speed up inference.
  5. Hardware Utilization: utilize hardware-specific operations, such as selecting an instance with clusters and GPU acceleration.
  6. Postprocessing: format the raw output to meaningful data, e.g., labels or bounding boxes.
  7. Interface Handling: the input-output interface handles incoming requests and returns model predictions.
  8. Log and Monitor: track the requests, performance metrics, and errors.

 

In term of how far Inference-as-a-Service can abstracted away, current companies doing this seem to provide two levels of local systems configuration:

  • API-only experience: no GPU/CPU or hardware customization available here
  • Certain customization of the local inference system. In this case you can basically customize two blocks: 
    • Container image configuration: defining images in Python, or using YAML file
    • GPU resources: define which GPUs and in which clusters to use and for which tasks 

 

So why are exactly these two blocks predefined to be customizable? This is most probably because they are essential components that directly impact the performance and environment in which the model operates. But what else can be customized?

 

Other customizable aspects include model deployment logic (e.g. batching strategies), pre/post-processing data pipelines (e.g., normalization, tokenization), model optimization techniques (e.g. quantization), networking and APIs (latency and security), and memory/storage configurations (sharing memory and caching), all of which further enhance the adaptability and performance of the system.



In general, where Inference-as-a-Service should begin and end?

 

Moving bottom up now, let’s look at the three levels of abstraction while using models locally - how far can it go? Basically, there are:

  • Model Deployment: streamlined deployment process for pre-trained and fine-tuned models.
  • Inference Optimization: advanced features such as batching, caching, and distributed processing for high-throughput environments.
  • Training & Fine-Tuning: support for distributed training and finetuning of models across multiple nodes and frameworks (PyTorch, TensorFlow).

For me the question here is: which level(s) of abstraction should/need IaaS to cover? How does it depend on a company tech and infrastructure stack, data privacy must-have, need for composability and controllability and certain jurisdiction regulations?



What if we go deeper in the infrastructure?

 

Further questions come up when we dive into how those levels of abstraction can be organized in terms of the infrastructure.

    1. Inference Server Strategy: build a custom inference server or rely on existing solutions?
    2. Model Storage: should we integrate with public model repositories (e.g., Hugging Face) or build private storage?
    3. Model Transport & Optimization: what protocols should we support for transporting and optimizing models?
    4. Model Reference for Implementation: which models should we support by default (e.g., Llama3, GPT-based models)?
    5. Hardware Requirements: what hardware should be our primary focus for model inference and training?
  • Kubernetes: do we focus on Kubernetes as a necessary technology for implementing the platform? 

 

Building a use-case: what can the boundaries be?

Defining boundaries is always hard. If you build a local models inference engine, which lines and where should be drawn? I understand there are two generic levels of such an engine: everything related to model handling and everything about data handling.

Models handling:

  • Storage in a private repository
  • Fine tuning
  • Serving

Data handling:

  • Data scientists/ML engineers tools e.g. notebooks
  • Data pipelines support
  • Data management e.g. external dataset management, data augmentation

 

Should an effective local inference engine provide both levels as an AI-in-the-box? Or maybe this will be too complicated (and e.g. locking in a company using it) and only models handling level needs to be abstracted away?

 

We invite you to explore this with us - join our online and in-person events: https://lu.ma/aifoundryorg