So this is definitely not an instruction post, but rather my naive effort to build a top-down picture of a system we can call “local models inference”. And I am doing this by looking for the right questions and asking them - in fact, all of you remember The Hitchhiker's Guide to the Galaxy, right? Having an answer does not get you anywhere, if you do not have the right question.
So, what is in fact included in the local models inference stack? Which subset of engineering building blocks and functionalities can we call like this?
What does local inference look like in theory?
Building an open-source local AI model inference engine involves incorporating several core components, each playing a key role in ensuring that the model functions effectively and efficiently. How inference engine works theoretically at the high-level:
In term of how far Inference-as-a-Service can abstracted away, current companies doing this seem to provide two levels of local systems configuration:
So why are exactly these two blocks predefined to be customizable? This is most probably because they are essential components that directly impact the performance and environment in which the model operates. But what else can be customized?
Other customizable aspects include model deployment logic (e.g. batching strategies), pre/post-processing data pipelines (e.g., normalization, tokenization), model optimization techniques (e.g. quantization), networking and APIs (latency and security), and memory/storage configurations (sharing memory and caching), all of which further enhance the adaptability and performance of the system.
In general, where Inference-as-a-Service should begin and end?
Moving bottom up now, let’s look at the three levels of abstraction while using models locally - how far can it go? Basically, there are:
For me the question here is: which level(s) of abstraction should/need IaaS to cover? How does it depend on a company tech and infrastructure stack, data privacy must-have, need for composability and controllability and certain jurisdiction regulations?
What if we go deeper in the infrastructure?
Further questions come up when we dive into how those levels of abstraction can be organized in terms of the infrastructure.
Building a use-case: what can the boundaries be?
Defining boundaries is always hard. If you build a local models inference engine, which lines and where should be drawn? I understand there are two generic levels of such an engine: everything related to model handling and everything about data handling.
Models handling:
Data handling:
Should an effective local inference engine provide both levels as an AI-in-the-box? Or maybe this will be too complicated (and e.g. locking in a company using it) and only models handling level needs to be abstracted away?
We invite you to explore this with us - join our online and in-person events: https://lu.ma/aifoundryorg