Testing OpenAI Applications - Elastic APJ Virtual User Group

Testing OpenAI Applications Elastic APJ Virtual User Group

Introductions! I’m Adrian from the Elastic Observability team, mostly work
on OpenTelemetry. Baby in GenAI programming (<9 months full-time), but spent a lot of time testing it recently! Co-led wazero, zero dependency webassembly runtime for go, a couple years ago. Tons of portability things in my open source history github.com/codefromthecrypt x.com/adrianfcole

Agenda • Introduction to OpenAI and ChatGPT • Using OpenAI
Playground to learn how to code • How to make a basic integration test • How to use recorded HTTP requests in unit tests • Introduction to Ollama for OpenAI API hosting • Walk through of Elastic’s OpenTelemetry Python tests • Teaser on Evals • Summing up and Q&A

Introduction to OpenAI Generative Models generate new data based on
known data • Causal language models are unidirectional: predict words based on previous ones • Large Language Models (LLMs) are typically trained for chat, code completion, etc. OpenAI is an organization and cloud platform, 49% owned by Microsoft • OpenAI develops the GPT (Generative Pre-trained Transformer) LLM • GPT is hosted by OpenAI and Azure, accessible via API or apps like ChatGPT

Introduction to ChatGPT The LLM can answer correctly if geography
was a part of the text it was trained on ChatGPT is natural language processing (NLP) application that uses the GPT LLM • Like the GPT model family ChatGPT is hosted by OpenAI and Azure

Context window The context window includes all messages considered when
generating a response. In other words, any chat message replays all prior questions and answers. • It can include chat history, instructions, documents, images (if multimodal), etc. • Limits are not in characters or words, but tokens roughly ¾ of a word. 41 tokens in the context window 33 tokens generated

OpenAI playground teaches you the API github.com/openai/openai-openapi platform.openai.com/playground/chat

Key things about programming OpenAI This part is required This
part is optional This part reads ENV variables

Making a basic chat library Taking the code from the
playground we can separate input from implementation • This hides parts we don’t know well • This gives us a way to test the code

First test goes against the real platform You need to
minimally set OPENAI_API_KEY and choose a model • Tests need access to the platform and will cost a little each time

Why can’t we test for an exact response? You cannot
100pct control the output of an LLM, the answer may vary • There are settings like seed and temperature to reduce creativity • Real runs may include keywords, but not in an exact order

Recording exact requests for exact tests OpenAI is an HTTP
service, which makes it unit testable • You can record real HTTP responses and play them back with VCR • This allows us to make exact assertions • pytest-vcr is an easy way to use VCR pytest-vcr.readthedocs.io

Exact responses can be chatty though..

There’s still a bit more work to do.. OpenAI’s client
is designed to require OPENAI_API_KEY!

Oh!!! Even more work! VCR recordings don’t scrub any secrets
by default. You must be careful about this data! • Do you have any auth keys? • Are you leaking org info or personal data?

Scrub conﬁg Good VCR conﬁg • Makes visible what is
logged • Considers requests and responses • Considers case sensitivity • Still needs users to pay attention!

What about CI? VCR tests are just normal unit tests,
so should run in CI • These only require real credentials on change, to produce new recordings Integration tests require credentials for OpenAI, so some may choose against it • Credentials can be misconﬁgured and leak the account • Repetitive revisions or model choices can cost a considerable amount Do we just skip integration tests? • Skipping in CI is better than deleting tests • There are alternatives to running tests against OpenAI the platform

Not all LLMs are only available as a service OpenAI
is an organization and cloud platform, 49% owned by Microsoft • OpenAI develops the GPT (Generative Pre-trained Transformer) LLM • GPT is hosted by OpenAI and Azure, accessible via API or apps like ChatGPT LLaMa is an LLM developed by Meta • You can download it for free from meta and run it with llama-stack • It is also hosted on platforms, and downloadable in llama.cpp’s GGUF format Thousands of LLMs exist • Many types of LLMs, diﬀering on source, size, training, modality (photo, sound, etc) • We’ll use the small and well documented Qwen2.5 LLM developed by Alibaba

Let’s use Qwen 2.5! Where do I get this? •
Dense, easy-to-use, decoder-only language models, available in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B sizes, and base and instruct variants. • Pretrained on our latest large-scale dataset, encompassing up to 18T tokens. • Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. • More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. • Context length support up to 128K tokens and can generate up to 8K tokens. • Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.

Hello Ollama github.com/ollama/ollama Written in Go Docker like experience Llama.cpp
backend Simpliﬁed model ﬂow $ ollama serve $ ollama pull qwen2.5:0.5b $ ollama run qwen2.5:0.5b "What is the name of the ocean that contains Bouvet Island?" The ocean that contains Bouvet Island is the Atlantic Ocean.

OpenAI API is a de’facto standard Ollama includes a lot
of features in the OpenAI API • Change the base URL to localhost • Add a fake API key • Leave the rest alone! github.com/openai/openai-openapi

You may have to clarify your requests a bit

Integration tests look like they’ll always pass!

But sometimes, it doesn’t?! Ocean of Icebergs, huh…

Hallucination LLMs can hallucinate, giving irrelevant, nonsense or factually incorrect
answers • LLMs rely on statistical correlation and may invent things to ﬁll gaps • Mitigate with selecting relevant models, prompt engineering, etc. User: What is the name of the ocean that contains Bouvet Island? Assistant: Ocean of icebergs. $ ollama ls qwen2.5:14b 7cdf5a0187d5 9.0 GB 3 days ago qwen2.5:latest 845dbda0ea48 4.7 GB 3 days ago qwen2.5:0.5b a8b0c5157701 397 MB 5 days ago More parameters can be more expensive, but might give more relevant answers *

For starters, maybe just retry the test High parameters results
in a lot more resources • As a human, you’d just retry if something ﬂaked • Use pytest-retry to do this for you pypi.org/project/pytest-retry

One way to CI in GitHub Run unit and integration
tests on pull requests • Use separate jobs for unit and integration Run integration tests against ollama, not openai • Always use the latest version of ollama • Pull the model in a separate step, before tests execute Keep a CONTRIBUTING.md with instructions on how to do everything • Don’t assume people will remember or same will always be around • Keep documentation up to to date with practice

Real life example! Elastic Distribution of OpenTelemetry (EDOT) Python instrumentations
(openai logs, traces and metrics!) https://github.com/elastic/elastic-otel-python-instrumentations

What’s next? We discussed how to do unit and integration
tests at an API call level • This is good to ensure you are using what you think you are We didn’t discuss how to understand if the LLM giving sensible responses • Quality is subjective and domain speciﬁc, it can be evaluated by humans or LLMs. • The keyword is model Evaluations and exists both in MLOps and LLMOps Connect to some of my open source friends until I make a presentation on this ;) • Karthik Kalyanaraman from Langtrace • Mikyo King from Arize AI Phoenix https://www.linkedin.com/in/karthikkalyanaraman https://www.linkedin.com/in/mikeldking

Takeaways and Thanks! OpenAI requires your best and most creative
testing skills Unit Tests should record real HTTP requests in whatever way is best for your language If using python, use pytest-vcr Integration Tests should use OpenAI, but allow local model usage as well. Ollama is a very good option for local model hosting, and Qwen 2.5 is a great model Tests themselves should be strict in unit tests and flexible in integration tests LLMs responses are not entirely predictable, and can sometimes miss. Be aware of this. Evaluations come next, but not yet as practiced in integration test frameworks Due to subjective nature, thresholds are needed. Share practice until this is normal! github.com/codefromthecrypt x.com/adrianfcole www.linkedin.com/in/adrianfcole

Testing OpenAI Applications - Elastic APJ Virtu...

Testing OpenAI Applications - Elastic APJ Virtual User Group

Adrian Cole

More Decks by Adrian Cole

Other Decks in Technology

Featured

Transcript