Putting The Genie in the Bottle - A Crash Course on running LLMs on Android

@iurysza iurysouza.dev Putting The Genie in the Bottle A Crash
Course on running LLMs on Android

About me •GDE for Android •Senior Engineer @ Klarna @iurysza
iurysouza.dev

But why tho? Do LLMs on mobile make any sense?

•Weren't they super hardware intensive? •We already have Chat-GPT/ Gemini
at home on the cloud. But why tho?

•SOTA models get the most attention •Benchmarks are crushed every
week What gets buried in the news

Gemini 2.5 Pro GPT-5 Claude Opus 4.1 Grok 4 Deepseek
V3 Mistral Small 3.1 Lamma 3.1 Gemma 3 SmolLM Phi-3 The Model Iceberg

•There's another category of breakthroughs happening in parallel •Highly efficient
and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models

and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models Gemma 3 Model Performance vs. Size

and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models Mistral 3.1 Model Performance vs. Latency

and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models Top Left Corner Models

and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models Model Size vs Eval

and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models Gemma3 27b ranking in LLM arena Circa March-25

and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models # of GPUs required vs Elo Score

Why Run Locally? Even if they’re getting good, why should
you still bother? •Performance & Latency •Privacy & Security •Cost & Accessibility •Offline functionality

Two options for running on-device Gen-AI Running on the Edge

A system wide LLM •Built in directly in the platform
•Gemini Nano • via AI Edge SDK (Oct-24) • via ML Kit GenAI APIs (May-25)

AI Edge SDK •Easiest approach for developers •The model is
shared by different apps •Support for LoRA Adapters •Model updates are managed by the platform •Built-in safety features •Available on High-end selected devices only

AI Edge SDK •Requirements

AI Edge SDK •Requirements • A compatible device •Google: Pixel
9 & 10 series •Samsung: Galaxy S25 series •OnePlus: 13 series •Top Models from: • Honor, Realme, Vivo, Xiaomi • iQOO, POCCO, OPPO, Motorola

AI Edge SDK •Requirements • A compatible device • AI
Core APK Android AICore App

AI Edge SDK •Requirements • A compatible device • AI
Core APK • Private Compute Services APK Private Compute Services App

AI Edge SDK Using it on your app implementation("com.google.ai.edge.aicore:aicore:0.0.1-exp01")

private val model = GenerativeModel( generationConfig = generationConfig { context
= applicationContext maxOutputTokens = 600 temperature = 0.9f topK = 16 topP = 0.1f } ) val response = model.generateContentStream(prompt) AI Edge SDK Using it on your app

AI Edge SDK Using it on your app With it
you can: •Check if the device is supported. •Tune safety settings. •Run inference at high performance. •Optionally, provide a LoRA adapter.

AI Edge SDK How does it work? Roughly AI Edge
SDK Architecture

AI Edge SDK How does it work? AICore •OS-level integration
•Model deployment management •Hardware abstraction layer AI Edge SDK Architecture

AI Edge SDK How does it work? Private Compute Services/Core
•PCS + PCC • Privacy through Data Isolation • PCC: Process data safely. • PCS: Model updates. AI Edge SDK Architecture

•PCS + PCC • Privacy through Data Isolation • PCC: Process data safely. • PCS: Model updates. Now Playing uses PCC

•PCS + PCC • Privacy through Data Isolation • PCC: Process data safely. • PCS: Model updates. Smart Reply uses PCC

ML Kit GenAI APIs •High-level APIS for: • Summarization •
Proofreading • Rewriting • Image Description

Adding it to your App ML Kit GenAI APIs val
articleToSummarize = “We are excited to announce a set of on-device ... " val options = SummarizerOptions.builder(context) .setInputType(InputType.ARTICLE) .setOutputType(OutputType.ONE_BULLET) .setLanguage(Language.ENGLISH) .build() val summarizer = Summarization.getClient(options)

ML Kit GenAI APIs val articleToSummarize = “We are excited
to announce a set of on-device ... " val options = SummarizerOptions.builder(context) .setInputType(InputType.ARTICLE) .setOutputType(OutputType.ONE_BULLET) .setLanguage(Language.ENGLISH) .build() val summarizer = Summarization.getClient(options) Adding it to your App

Adding it to your App ML Kit GenAI APIs val
articleToSummarize = “We are excited to announce a set of on-device ... " val options = SummarizerOptions.builder(context) .setInputType(InputType.ARTICLE) .setOutputType(OutputType.ONE_BULLET) .setLanguage(Language.ENGLISH) .build() val summarizer = Summarization.getClient(options)

ML Kit GenAI APIs val articleToSummarize = “We are excited
to announce a set of on-device ... " val options = SummarizerOptions.builder(context) .setInputType(InputType.ARTICLE) .setOutputType(OutputType.ONE_BULLET) .setLanguage(Language.ENGLISH) .build() val summarizer = Summarization.getClient(options) Adding it to your App

Two options for running on-device Gen-AI Running on the Edge

•You’re in control • Model management • Setup, initialization, and
resource usage •You can run many different models •Bring your own fine tuned model •Wider device compatibility The DIY alternative Self managed Models

Adding to your App Media Pipe LLM Inference API What
model do we use?

Wait, what? 🤨 Adding to your App Media Pipe LLM
Inference API

What are these models? What's under the hood kaggle.com/models/google/gemma-3

❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name
549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood

549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood Model family ^

549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood ~1 billion parameters ^

549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood fine-tuned to follow instructions/commands ^

549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood weights are quantized to 4-bit integers ^

549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood MediaPipe bundle format ^

549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood

TOKENIZER_MODEL •Convert text to and from token IDs •Contains the
Vocabulary What are these models? What's under the hood

TOKENIZER_MODEL •Convert text to and from token IDs •Contains the
Vocabulary { "explain": 123, "this": 456, "model": 789 } What are these models? What's under the hood

TF_LITE_PREFILL_DECODE •Model weights & Architecture •Tensor Flow Lite •Prefill &
Decode operations What are these models? What's under the hood

What are they doing? What's under the hood What do
we do with that?

We’re running Inference What are they doing? What's under the
hood

What are they doing? What's under the hood Making next
token predictions Running Inference?

https: // colab.research.google.com/github/google-ai-edge/ mediapipe-samples/blob/main/codelabs/litert_inference/ Gemma3_1b_fine_tune.ipynb#scrollTo=AM6rDABTXt2F What are they doing? What's
under the hood

What are they doing? What's under the hood

What are they doing? Media Pipe LLM Inference API

How to use this in my app? Media Pipe LLM
Inference API MediaPipe LLM Inference Thin Java Layer

How to use this in my app? Media Pipe LLM
Inference API MediaPipe LLM Inference CPP implementation uses LiteRT APIs

Media Pipe LLM Inference API Where were we again?

Adding it to your App Media Pipe LLM Inference API
What model do we use?

What model do we use?

implementation("com.google.mediapipe:tasks-genai:0.10.24")

Adding to your App Media Pipe LLM Inference API #
Push the model to the device MODEL_FILE="gemma3-1b-it-int4.task" TARGET_DIR="/storage/emulated/0/Android/data/com.myapp/files/" adb push "$MODEL_FILE" "$TARGET_DIR"

Adding to your App Media Pipe LLM Inference API fun
mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { }

mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = File( context.getExternalFilesDir(null), modelName ) / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) }

mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = File( context.getExternalFilesDir(null), modelName ) / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) / / Create session val session = LlmInferenceSession.createFromOptions( llmInference, LlmInferenceSession.LlmInferenceSessionOptions.builder() .setTopK(50) .setTemperature(0.8f) .build() ) }

mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = { ... } / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) / / Create session val session = LlmInferenceSession.createFromOptions( llmInference, LlmInferenceSession.LlmInferenceSessionOptions.builder() .setTopK(50) .setTemperature(0.8f) .build() ) / / Send prompt session.addQueryChunk(prompt) session.generateResponseAsync { partialResult, done -> // Handle output Log.d(partialResult) if (done) { // Cleanup llmInference.close() } } }

mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = { .. } / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) / / Create session val session = LlmInferenceSession.createFromOptions( llmInference, LlmInferenceSession.LlmInferenceSessionOptions.builder() .setTopK(50) .setTemperature(0.8f) .build() ) / / Send prompt session.addQueryChunk(prompt) session.generateResponseAsync { partialResult, done -> // Handle output Log.d(partialResult) if (done) { // Cleanup llmInference.close() } } }

What to use? Bottom Line It depends 😌

What to use? Bottom Line Gemini Nano = macOS MediaPipe
= Arch Linux

What to use? Bottom Line Gemini Nano •Via AIEdge &
ML Kit GenAI •Limited to one model •Customization via LoRA Adapters •Guard Rails •Limited availability •OEM adoption is key for its future

What to use? Bottom Line MediaPipe •Completely self managed •Niche
specialized models •Wider availability

CAVEATS Size still matters •Don’t expect SOTA performance •Focus on
a narrow use case •Use LoRA Adapters or fine-tuned models. •Handling resource usage is challenging

Press X to Doubt "This is a fad.”

It's a movie, not a picture “Keep an on the
trend lines” 3 ideas •Race to the bottom •History •Specialization + ubiquity

Race to the bottom Cloud Models are just too expensive

It's a movie, not a picture 1970s-1980s mainframe computer room
Remember Mainframes?

It's a movie, not a picture Computer terminal Remember Mainframes?

It's a movie, not a picture Early personal computer Remember
Mainframes?

SMOL & Agentic AI Specialization & Ubiquity Mini models everywhere

“For many agentic workloads, Small Language Models (SLMs) are a
superior default to Large Language Models (LLMs) due to cost, latency, controllability, and fine-tuning ease.” "10–30× cheaper, lower energy, faster responses, easier local deployment” Specialization & Ubiquity Mini models everywhere

Specialization & Ubiquity Mini models everywhere Cost reduction enables different
business models •Free apps •Premium app tiers •Single payment apps •No variable costs per user interaction with LLMs

GBoard Writing Tools Specialization & Ubiquity Now: LoRA adapters everywhere

Specialization & Ubiquity Now: LoRA adapters everywhere Pixel Screenshots App

Wrapping up Give it a go! Projects for you to
get started:

Wrapping up Give it a go! AI Edge Gallery

Wrapping up Give it a go! Projects for you to
get started:

Wrapping up Give it a go! flutter_gemma Sample App

@iurysza iurysouza.dev Putting The Genie in the Bottle A Crash
Course on running LLMs on Android • Google AI Edge SDK Documentation • MediaPipe LLM Inference API • AI on Android Spotlight Week • Paper: Small Language Models are the Future of Agentic AI • Deep Dive into LLMs like ChatGPT • Google AI Edge

Thank you! 🇵🇹 @iurysza iurysouza.dev

Questions? @iurysza iurysouza.dev

Putting The Genie in the Bottle - A Crash Cours...

Putting The Genie in the Bottle - A Crash Course on running LLMs on Android

More Decks by Iury Souza

Other Decks in Programming

Featured

Transcript