Skip to content

Sizing Up Components of an AI Model & Prompt Testing System

Image of machine that looks like its from a 1950's laboratory with an analog meter showing accuracy of an AI prompt.

On Saturday, July 13, the AIFoundry.org community met for its first virtual AI Hack Lab.   Our theme was building an automated prompt testing engine that could automatically test application prompts many times against many models.  The goals of such a system would be to determine which models and model versions work best for a given series of prompts and also to test the accuracy of responses to those prompts.  It was a fun, fast few hours of lightning talks, collaborative design discussions, and a bit of mob programming in the end.

We started with Paul Zabelin demoing an example problem, “how to keep your AI-driven application from talking like a pirate”. Paul then demoed an internal prompt testing engine that his consulting company Artium uses to increase accuracy of responses from OpenAI.

 

Next up, Jesse Alford gave us a quick motivational speech about why a good testing system was needed for determining the right models and testing accuracy of responses from prompts - without accuracy, AI applications fail to get past the POV stage.

 

We then heard about some components that can be considered to be part of such a system.

First up, we showcased a new project that the engineers at AIFoundry.org are building called Llamagator. Llamagator connects multiple LLMs into a single chat interface, making it possible to compare replies from different models or different versions of models to a single prompt.  We will be focusing our next virtual AI Hack Lab on helping people make use of Llamagator to connect both multiple models, versions of models, and exploring how to automate the prompt side.

 

Next, Victor Miller gave an overview of the recently released fully open LLM, LLM360. This family of models makes training data, code, and weights available under an Apache V2 license, as opposed to Llama 3, which only provides weights under a limited license. LLM360 can be considered a similar choice to OLMo, which we discussed in the prior demo.  A question arose as to whether LLM360 runs well on Llama.cpp.  This is an area for further investigation.

 

Finally, we heard from Tyler Slaton from Acord Labs about GPTScript. This open-source function-calling prompting framework provides an option for implementing a significant portion of the prompt automation framework.

 

We then considered multiple possibilities of next steps, and eventually settled on mob programming around setting up a manual way to test prompts in an open Google Collab Notebook.

 

The day ended with us landing our first code in a new AIFoundry.org project repo called TDD-Prompt.

The AIFoundry.org Community would like to extend a huge thank you to Paul Zabelin and Jesse Alford for helping organize this event.

At our next virtual AI Hack Lab, we’ll be exploring the LLM side of this idea by getting our hands dirty with Llamagator, and connecting different models and model versions running on Llama.cpp.  If you’d like to join us, sign up here: