Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Putting The Genie in the Bottle - A Crash Cours...

Avatar for Iury Souza Iury Souza
September 05, 2025

Putting The Genie in the Bottle - A Crash Course on running LLMs on Android

Think running LLMs requires massive data centers? Think again. A new generation of efficient models is making on-device AI a reality, even on Android.

This session cuts through the hype to explore the practicalities of "Edge Inference." Learn why you'd ditch the cloud API calls for on-device processing.

We'll compare Google's AI Edge SDK and the MediaPipe LLM Inference API, highlighting the trade-offs Android developers face today. Understand the current limitations, the surprising capabilities of modern small models, and why experimenting with on-device LLMs now could give your app a competitive edge tomorrow.

Avatar for Iury Souza

Iury Souza

September 05, 2025
Tweet

More Decks by Iury Souza

Other Decks in Programming

Transcript

  1. Gemini 2.5 Pro GPT-5 Claude Opus 4.1 Grok 4 Deepseek

    V3 Mistral Small 3.1 Lamma 3.1 Gemma 3 SmolLM Phi-3 The Model Iceberg
  2. •There's another category of breakthroughs happening in parallel •Highly efficient

    and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models
  3. •There's another category of breakthroughs happening in parallel •Highly efficient

    and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models Gemma 3 Model Performance vs. Size
  4. •There's another category of breakthroughs happening in parallel •Highly efficient

    and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models Mistral 3.1 Model Performance vs. Latency
  5. •There's another category of breakthroughs happening in parallel •Highly efficient

    and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models Top Left Corner Models
  6. •There's another category of breakthroughs happening in parallel •Highly efficient

    and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models Model Size vs Eval
  7. •There's another category of breakthroughs happening in parallel •Highly efficient

    and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models Gemma3 27b ranking in LLM arena Circa March-25
  8. •There's another category of breakthroughs happening in parallel •Highly efficient

    and capable Open Source Models •These models are hitting benchmarks which were SOTA 6 months ago. Top Left Corner Models # of GPUs required vs Elo Score
  9. Why Run Locally? Even if they’re getting good, why should

    you still bother? •Performance & Latency •Privacy & Security •Cost & Accessibility •Offline functionality
  10. A system wide LLM •Built in directly in the platform

    •Gemini Nano • via AI Edge SDK (Oct-24) • via ML Kit GenAI APIs (May-25)
  11. A system wide LLM •Built in directly in the platform

    •Gemini Nano • via AI Edge SDK (Oct-24) • via ML Kit GenAI APIs (May-25)
  12. AI Edge SDK •Easiest approach for developers •The model is

    shared by different apps •Support for LoRA Adapters •Model updates are managed by the platform •Built-in safety features •Available on High-end selected devices only
  13. AI Edge SDK •Requirements • A compatible device •Google: Pixel

    9 & 10 series •Samsung: Galaxy S25 series •OnePlus: 13 series •Top Models from: • Honor, Realme, Vivo, Xiaomi • iQOO, POCCO, OPPO, Motorola
  14. AI Edge SDK •Requirements • A compatible device • AI

    Core APK • Private Compute Services APK Private Compute Services App
  15. private val model = GenerativeModel( generationConfig = generationConfig { context

    = applicationContext maxOutputTokens = 600 temperature = 0.9f topK = 16 topP = 0.1f } ) val response = model.generateContentStream(prompt) AI Edge SDK Using it on your app
  16. private val model = GenerativeModel( generationConfig = generationConfig { context

    = applicationContext maxOutputTokens = 600 temperature = 0.9f topK = 16 topP = 0.1f } ) val response = model.generateContentStream(prompt) AI Edge SDK Using it on your app
  17. private val model = GenerativeModel( generationConfig = generationConfig { context

    = applicationContext maxOutputTokens = 600 temperature = 0.9f topK = 16 topP = 0.1f } ) val response = model.generateContentStream(prompt) AI Edge SDK Using it on your app
  18. AI Edge SDK Using it on your app With it

    you can: •Check if the device is supported. •Tune safety settings. •Run inference at high performance. •Optionally, provide a LoRA adapter.
  19. AI Edge SDK How does it work? AICore •OS-level integration

    •Model deployment management •Hardware abstraction layer AI Edge SDK Architecture
  20. AI Edge SDK How does it work? Private Compute Services/Core

    •PCS + PCC • Privacy through Data Isolation • PCC: Process data safely. • PCS: Model updates. AI Edge SDK Architecture
  21. AI Edge SDK How does it work? Private Compute Services/Core

    •PCS + PCC • Privacy through Data Isolation • PCC: Process data safely. • PCS: Model updates. Now Playing uses PCC
  22. AI Edge SDK How does it work? Private Compute Services/Core

    •PCS + PCC • Privacy through Data Isolation • PCC: Process data safely. • PCS: Model updates. Smart Reply uses PCC
  23. A system wide LLM •Built in directly in the platform

    •Gemini Nano • via AI Edge SDK (Oct-24) • via ML Kit GenAI APIs (May-25)
  24. ML Kit GenAI APIs •High-level APIS for: • Summarization •

    Proofreading • Rewriting • Image Description
  25. Adding it to your App ML Kit GenAI APIs val

    articleToSummarize = “We are excited to announce a set of on-device ... " val options = SummarizerOptions.builder(context) .setInputType(InputType.ARTICLE) .setOutputType(OutputType.ONE_BULLET) .setLanguage(Language.ENGLISH) .build() val summarizer = Summarization.getClient(options)
  26. ML Kit GenAI APIs val articleToSummarize = “We are excited

    to announce a set of on-device ... " val options = SummarizerOptions.builder(context) .setInputType(InputType.ARTICLE) .setOutputType(OutputType.ONE_BULLET) .setLanguage(Language.ENGLISH) .build() val summarizer = Summarization.getClient(options) Adding it to your App
  27. Adding it to your App ML Kit GenAI APIs val

    articleToSummarize = “We are excited to announce a set of on-device ... " val options = SummarizerOptions.builder(context) .setInputType(InputType.ARTICLE) .setOutputType(OutputType.ONE_BULLET) .setLanguage(Language.ENGLISH) .build() val summarizer = Summarization.getClient(options)
  28. ML Kit GenAI APIs val articleToSummarize = “We are excited

    to announce a set of on-device ... " val options = SummarizerOptions.builder(context) .setInputType(InputType.ARTICLE) .setOutputType(OutputType.ONE_BULLET) .setLanguage(Language.ENGLISH) .build() val summarizer = Summarization.getClient(options) Adding it to your App
  29. A system wide LLM •Built in directly in the platform

    •Gemini Nano • via AI Edge SDK (Oct-24) • via ML Kit GenAI APIs (May-25)
  30. •You’re in control • Model management • Setup, initialization, and

    resource usage •You can run many different models •Bring your own fine tuned model •Wider device compatibility The DIY alternative Self managed Models
  31. ❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name

    549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood
  32. ❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name

    549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood
  33. ❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name

    549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood Model family ^
  34. ❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name

    549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood ~1 billion parameters ^
  35. ❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name

    549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood fine-tuned to follow instructions/commands ^
  36. ❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name

    549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood weights are quantized to 4-bit integers ^
  37. ❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name

    549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood weights are quantized to 4-bit integers ^
  38. ❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name

    549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood MediaPipe bundle format ^
  39. ❯ unzip -l gemma3-1b-it-int4.task Archive: gemma3-1b-it-int4.task Length Date Time Name

    549971728 03-07-2025 07 : 36 TF_LITE_PREFILL_DECODE 4689074 03-07-2025 07 : 36 TOKENIZER_MODEL 90 03-07-2025 07 : 36 METADATA --------- ------- 554660892 3 files What are these models? What's under the hood
  40. TOKENIZER_MODEL •Convert text to and from token IDs •Contains the

    Vocabulary What are these models? What's under the hood
  41. TOKENIZER_MODEL •Convert text to and from token IDs •Contains the

    Vocabulary { "explain": 123, "this": 456, "model": 789 } What are these models? What's under the hood
  42. TF_LITE_PREFILL_DECODE •Model weights & Architecture •Tensor Flow Lite •Prefill &

    Decode operations What are these models? What's under the hood
  43. TF_LITE_PREFILL_DECODE •Model weights & Architecture •Tensor Flow Lite •Prefill &

    Decode operations What are these models? What's under the hood
  44. What are they doing? What's under the hood Making next

    token predictions Running Inference?
  45. How to use this in my app? Media Pipe LLM

    Inference API MediaPipe LLM Inference Thin Java Layer
  46. How to use this in my app? Media Pipe LLM

    Inference API MediaPipe LLM Inference CPP implementation uses LiteRT APIs
  47. How to use this in my app? Media Pipe LLM

    Inference API MediaPipe LLM Inference CPP implementation uses LiteRT APIs
  48. Adding it to your App Media Pipe LLM Inference API

    implementation("com.google.mediapipe:tasks-genai:0.10.24")
  49. Adding to your App Media Pipe LLM Inference API #

    Push the model to the device MODEL_FILE="gemma3-1b-it-int4.task" TARGET_DIR="/storage/emulated/0/Android/data/com.myapp/files/" adb push "$MODEL_FILE" "$TARGET_DIR"
  50. Adding to your App Media Pipe LLM Inference API fun

    mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { }
  51. Adding to your App Media Pipe LLM Inference API fun

    mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = File( context.getExternalFilesDir(null), modelName ) / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) }
  52. Adding to your App Media Pipe LLM Inference API fun

    mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = File( context.getExternalFilesDir(null), modelName ) / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) }
  53. Adding to your App Media Pipe LLM Inference API fun

    mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = File( context.getExternalFilesDir(null), modelName ) / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) / / Create session val session = LlmInferenceSession.createFromOptions( llmInference, LlmInferenceSession.LlmInferenceSessionOptions.builder() .setTopK(50) .setTemperature(0.8f) .build() ) }
  54. Adding to your App Media Pipe LLM Inference API fun

    mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = File( context.getExternalFilesDir(null), modelName ) / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) / / Create session val session = LlmInferenceSession.createFromOptions( llmInference, LlmInferenceSession.LlmInferenceSessionOptions.builder() .setTopK(50) .setTemperature(0.8f) .build() ) }
  55. Adding to your App Media Pipe LLM Inference API fun

    mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = { ... } / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) / / Create session val session = LlmInferenceSession.createFromOptions( llmInference, LlmInferenceSession.LlmInferenceSessionOptions.builder() .setTopK(50) .setTemperature(0.8f) .build() ) / / Send prompt session.addQueryChunk(prompt) session.generateResponseAsync { partialResult, done -> // Handle output Log.d(partialResult) if (done) { // Cleanup llmInference.close() } } }
  56. Adding to your App Media Pipe LLM Inference API fun

    mediaPipeBasicExample( context: Context, modelName: String, prompt: String ) { val modelPath = { .. } / / Create LLM engine val llmInference = LlmInference.createFromOptions( context, LlmInference.LlmInferenceOptions.builder() .setModelPath(modelPath) .setMaxTokens(1024) .build() ) / / Create session val session = LlmInferenceSession.createFromOptions( llmInference, LlmInferenceSession.LlmInferenceSessionOptions.builder() .setTopK(50) .setTemperature(0.8f) .build() ) / / Send prompt session.addQueryChunk(prompt) session.generateResponseAsync { partialResult, done -> // Handle output Log.d(partialResult) if (done) { // Cleanup llmInference.close() } } }
  57. What to use? Bottom Line Gemini Nano •Via AIEdge &

    ML Kit GenAI •Limited to one model •Customization via LoRA Adapters •Guard Rails •Limited availability •OEM adoption is key for its future
  58. CAVEATS Size still matters •Don’t expect SOTA performance •Focus on

    a narrow use case •Use LoRA Adapters or fine-tuned models. •Handling resource usage is challenging
  59. It's a movie, not a picture “Keep an on the

    trend lines” 3 ideas •Race to the bottom •History •Specialization + ubiquity
  60. “For many agentic workloads, Small Language Models (SLMs) are a

    superior default to Large Language Models (LLMs) due to cost, latency, controllability, and fine-tuning ease.” "10–30× cheaper, lower energy, faster responses, easier local deployment” Specialization & Ubiquity Mini models everywhere
  61. Specialization & Ubiquity Mini models everywhere Cost reduction enables different

    business models •Free apps •Premium app tiers •Single payment apps •No variable costs per user interaction with LLMs
  62. @iurysza iurysouza.dev Putting The Genie in the Bottle A Crash

    Course on running LLMs on Android • Google AI Edge SDK Documentation • MediaPipe LLM Inference API • AI on Android Spotlight Week • Paper: Small Language Models are the Future of Agentic AI • Deep Dive into LLMs like ChatGPT • Google AI Edge