Lessons learned building a GenAI powered app

Proprietary + Confidential Marc Cohen Developer Advocate at Google [email protected]
Lessons learned building a GenAI powered app Mete Atamel Developer Advocate at Google @meteatamel atamel.dev speakerdeck.com/meteatamel

Proprietary + Conﬁdential Before GenAI GenAI arrives Architecture After GenAI
Lessons Learned 01 02 03 04 05 Agenda

Before GenAI

Every great invention started out as someone’s weekend project. This
is gonna be huge!

August 2016

Proprietary + Confidential Demo: Initial app with Open Trivia DB
https://opentdb.com/api_config.php

Initial problems • Limited list of topics • Limited questions
and answers • Limited format: multiple choice with 4 answers • English only • No images • Expanding quiz content is difficult

GenAI arrives

March 2023 Is it possible to have a more dynamic
quiz app with infinite content using GenAI?

(pronounced like mosaic)

Proprietary + Confidential Demo: GenAI powered app

Architecture

Flutter for client

Cloud Run hosts ui and api servers

Cloud Firestore as backend

Five Key Data Structures 1. admins 2. generators 3. quizzes
4. sessions 5. results

Four Key Personas 1. admin 2. creator 3. host 4.
player

Data Access Model

Vertex AI on Google Cloud for LLMs

Quiz Generators Name Type Format OpenTrivia static multiple choice Palm
genAI multiple choice (possible: free-form) Gemini (pro, ultra) genAI multiple choice (possible: free-form)

Image Generator Name Type Description ImageGen (v1, v2) genAI Uses
ImageGen model to generate images for quizzes

After GenAI

• Limited Unlimited list of topics • Limited Unlimited questions
and answers • Limited Unlimited format • English only Any language • No Unlimited images • Expanding quiz content is difficult easy with GenAI Revisit: Initial problems

• Learning curve with GenAI • Inconsistent or no outputs
from LLMs • Slow LLM calls • Hallucinations • Hard to check the accuracy and quality of LLM outputs • Fast changing landscape (models, APIs, libraries, etc.) New problems with GenAI

Lessons Learned

🎓 General Surprisingly easy to do hard things with GenAI
• Quiz/image generation with a single API call Hard to do things well and consistently • Good results require prompt engineering • You will get inconsistent outputs • Hard to measure the output quality

🎓 General Accept uncertainty of LLMs • Same prompt, same
model ⇒ different output • Same prompt, same model gets updated ⇒ different output • Same prompt, different model ⇒ different output

🎓 General Free upgrades with new/updated models • Palm ⇒
Gemini-Pro: better quizzes • Gemini-Pro ⇒ Gemini-Ultra: even better quizzes • Imagen v1 ⇒ v2: better images • No or little code changes

🎓 General Do you even need an LLM? • In
grading free-form ⇒ LLM vs. TheFuzz library • Image of the app ⇒ ImageGen vs. good old photo editor • Sometimes you don't need an expensive LLM call

🎓 Prompting Be specific and clear with prompts More detailed
prompts != better results Manage prompts like code • Version prompts for safe iteration • Prompt + output parsers go hand-in-hand

🎓 Coding with LLMs Code defensively • LLM call can
fail => Retry and keep the user informed • LLM can give you malformed JSON ⇒ Can you still parse JSON somehow? • LLM can return empty results ⇒ Can you live with no quizzes or no image? • LLM can be too cautious ⇒ Do you need to change safety settings?

🎓 Coding with LLMs Pin model versions • gemini-1.0-pro refers
to the latest and can change to gemini-1.0-pro@001, gemini-1.0-pro@002, … • Use a specific version such as gemini-1.0-pro@001

🎓 Coding with LLMs Consider using a higher level library
like LangChain • You can use Gemini from Google AI Studio and Vertex AI but each has different libraries • In Vertex AI, libraries for PaLM and Gemini are different • Other non-Google models have their own libraries • LangChain can help to abstract all of this away

🎓 Coding with LLMs Good old software engineering tricks •
Minimize LLM calls by batching prompts • Use parallel calls (eg. quiz and image generation runs in parallel) • Cache common responses

Unit/functional tests are as important as ever • Easy to
check existence or format • Is this a quiz with 5 questions and 4 answers? • Is the image generated or not? 🎓 Testing and Validation

Testing quality and accuracy is more difficult • Is the
quiz actually on the topic of history? • Is the answer actually correct? • Is the generated image appropriate for the quiz? (still an open question) Need a way to measure LLM outputs • Automate it, and use as a benchmark to work towards 🎓 Testing and Validation

Use LLM to evaluate LLM outputs 🎓 Testing and Validation

How do you know if the validator works? • Use
OpenTrivia as corpus of accurate quizzes • See how validator performs against OpenTrivia 🎓 Testing and Validation

Every multiple choice quiz can be decomposed into four assertions,
of the form: Q: question A: answer For example…Who was the first US president? A. Thomas Jefferson B. Alexander Hamilton C. George Washington D. Bill Clinton can be decomposed into these four assertions: • Q: Who was the first US president? A: Thomas Jefferson is False • Q: Who was the first US president? A: Alexander Hamilton is False • Q: Who was the first US president? A: George Washington is True • Q: Who was the first US president? A: Bill Clinton is False

Evaluation “In one (and only one) word, are the following
assertions true or false?” Q: Who was the first US president? A: Thomas Jefferson Q: Who was the first US president? A: Alexander Hamilton Q: Who was the first US president? A: George Washington Q: Who was the first US president? A: Bill Clinton LLM: False False True False

🎓 Testing and Validation PaLM initially got around 80% accuracy
Gemini Ultra got 91% accuracy

🎓 Testing and Validation Ultimately, you need grounding for more
accuracy (eg. grounding with Google Search)

Is it possible to have a more dynamic and richer
quiz app with the help of GenAI? 7 years 7 weeks 7 years to 7 weeks

Thank you Marc Cohen Developer Advocate at Google [email protected] Mete
Atamel Developer Advocate at Google @meteatamel atamel.dev speakerdeck.com/meteatamel

Lessons learned building a GenAI powered app

Lessons learned building a GenAI powered app

More Decks by Mete Atamel

Other Decks in Technology

Featured

Transcript