but I'm clean now.” Model: The joke is a pun on the word "clean". It has two meanings: Model: User: The joke is funny because it is unexpected. The listener expects the speaker to be talking about the first meaning of "clean", but they are actually talking about the second meaning. This creates a surprise and a laugh. To be free of dirt or other unwanted substances. To be free of addiction.
fair project that is both fun and educational. You can learn about the science of volcanoes, earthquakes, and plate tectonics. 2. **Make a solar oven.** This is a great project to learn about renewable energy and how to harness the power of the sun. You can use your solar oven to cook food, dry fruit, or even melt chocolate. 3. **Create a 3D printer.** This is a more challenging project, but it is also very rewarding. You can learn about 3D printing technology and design your own custom objects. 4. **Build a robot.** This is a great project for those who are interested in engineering and robotics. You can learn about basic electronics, mechanics, and programming. 5. **Design a water filtration system.** This is a project that can help you learn about the importance of clean water and how to filter it. You can use your water filtration system to clean water from a stream or river. Give me five neat ideas for science fair projects. User:
or the ability to perform tasks that were not present in smaller models. LLMs contextual understanding of human language changes how we interact with data and intelligent systems. LLMs can find patterns and connections in massive, disparate data corpora. Why are large language models different?
Researchers 7,000 Publications Built & Tested for Safety Privacy in design Upholds high scientific standards Accountable to People Socially Beneficial Avoid creating unfair bias 2015 Google DeepMind AlphaGo defeats Go champion 2016 Google’s DeepMind helps detect eye disease 2017 Google invents Transformer kickstarting LLM revolution 2018 Google’s groundbreaking large language model, BERT 2019 Text-to-Text Transfer Transformer LLM 10B P Model Open Sourced 2020 Google LaMDA Model Trained to converse 2022 AlphaFold predicts structures of all known proteins 2023 A conversational AI Service powered by PaLM2 2024 Family of multimodal LLMs & products
open development Distbelief: First distributed system for training neural nets First DL model in production (Deep Neural Nets for acoustic modeling) “Cats paper”: First network trained in a distributed way Word2Vec: Proof of viability of word embeddings to understand content Seq2Seq: Foundational framework leading to Transformers and BERT GoogleNet “Going Deeper with Convolutions” Brain’s first live DL experiments in Ads Distillation: compression model reducing size of complex ML models RankBrain first launches TensorFlow released TPUs announced JAX AutoML: Success in learning to learn Transformers: Rethinking Seq-to-Seq BERT in Search AlphaFold LaMDA 2, PaLM, AI Test Kitchen Imagen, Parti, Imagen-Video, Phenaki Bard , Gemini, Search Labs, and more 2012 2013 2014 2015 2016 2017 2018 2019 2022 2024
model development Already 500+ variants Fostering trust and responsible AI Safety by design + tools Enabling continuous learning Events, community contributions Shaping the AI future Partner with NVIDIA, Hugging Face, and you!
available across Kaggle, Vertex AI and Hugging Face - choose the model hub that makes the most sense for you. Just request access on the model hub of your choice, agree to the Gemma Terms of Use and you’ll be ready to get started!
KerasNLP with Keras 3 into your environment, then import the keras_nlp module. !pip install --upgrade keras-nlp !pip install --upgrade keras import keras_nlp # https://keras.io/api/keras_nlp/models/gemma/gemma_causal_lm/ g_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en") Then, load Gemma from a preset configuration! Presets are available for the 2B and 7B parameter versions of Gemma.
the model. For instance, if we wanted to ask Gemma what comes after “it was a dark and stormy night”... Gemma will give us: txt = g_lm.generate("It was a dark and stormy night.", max_length=64) print(txt) It was a dark and stormy night. The rain was pouring down, and the wind was howling. But that didn’t stop a group of friends from going out for a night of fun. They were all dressed up in their best clothes, and they were ready to have a good time.
sizes (larger = more powerful)! Option for instruction-tuned models for specific tasks. Your use cases and target hardware will help shape your choice of Gemma model to use. For more information, see https://ai.google.dev/gemma
Easily launch into working with Gemma in notebooks. • Explore how the community uses Gemma and connect with fellow data enthusiasts. • Compete for cash prizes, dive into past competitions for cutting-edge techniques. • Be part of shaping the future of AI alongside Google and the vibrant Kaggle community. https://www.kaggle.com/models/google/gemma Kaggle: Your Gateway to AI Innovation
for the model and set the LoRA rank to 4. gemma_lm.backbone.enable_lora(rank=4) # Use AdamW (a common optimizer for transformer models). optimizer = keras.optimizers.AdamW( learning_rate=5e-5, weight_decay=0.01, ) # Exclude layernorm and bias terms from decay. optimizer.exclude_from_weight_decay(var_names=["bias", "scale"]) gemma_lm.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer=optimizer, weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()], ) gemma_lm.fit(data, epochs=1, batch_size=1)
you trained gemma_lm.backbone.enable_lora(rank=4) # Load pre-trained LoRA weights gemma_lm.backbone.load_lora_weights("my_model.lora.h5") gemma_lm.compile(sampler=keras_nlp.samplers.TopKSampler(k=3, temperature=0.7)) Save / Load pre-trained LoRA # Save LoRA weights after gemma_lm.fit(), filename must end in ".lora.h5" gemma_lm.backbone.save_lora_weights("my_model.lora.h5")
(GPUs or TPUs) for faster results. Why? • Large Models: Complex models with many parameters require significant time to train on a single device. • Massive Datasets: Distributed training accelerates the process of working through extremely large datasets. • Time Constraints: Distributing training can drastically reduce the time needed to reach optimal model performance.
'token_embedding/embeddings' will be sharded on 8 TPUs layout_map["token_embedding/embeddings"] = (model_dim, None) # Regex to match against the query, key and value matrices in the decoder # attention layers layout_map["decoder_block.*attention.*(query|key|value).*kernel"] = ( model_dim, None, None) layout_map["decoder_block.*attention_output.*kernel"] = ( model_dim, None, None) layout_map["decoder_block.*ffw_gating.*kernel"] = (None, model_dim) layout_map["decoder_block.*ffw_linear.*kernel"] = (model_dim, None) Specifying how the weights and tensors should be distributed Distributed Tuning (cont’ed)
embedded systems Host (Conversion) TensorFlow Graph TensorFlow APIs Keras TensorFlow Lite Converter Optimization & Conversion Passes TensorFlow Lite Model (flatbuffer) Device (Inference) TensorFlow Lite Model (flatbuffer) TensorFlow Lite C/C++ APIs Java TensorFlow Lite Runtime CPU Kernels Python Swift Obj-C XNNPACK TPU CoreML HW Accelerator Delegates DSP GPU
rewrite • Summarization • Document / Corpus Q&A • Text classification (toxicity, sentiment, etc) What can you do with Gemma on-device? Without cloud bills Without privacy concerns Without connectivity
engine for Google’s Gemma series of foundation models. It prioritizes simplicity and portability, and being built with C++, can be easily integrated into other native projects.
of a black box, researchers can easily experiment with direct code modifications • C++ developers with LLM needs Developers building or maintaining existing native apps that want to add LLM capabilities • Anyone who thinks it’s cool! (Maybe that’s you.) Who is this project for?
This can take a few minutes, should produce a file containing model weights such as “2b-it-sfp.sbs” and a tokenizer file “tokenizer.model”. You may want to move these files to a convenient directory (e.g. gemma.cpp/build/ directory)
cmake $ cd gemma.cpp/build && cmake .. Then run make to build the gemma executable $ make gemma If this is successful, you should now have a gemma executable in the gemma.cpp/build/ directory.
| 7b-it | 7b-pt --compressed_weights compressed weights file 2b-it-cpp-sfp.sbs --tokenizer tokenizer file tokenizer.spm How to use gemma.cpp? Run gemma has the following required arguments:
AI developer: • Extensive quick start library with code samples • Free developer labs and training resources at https://cloudskillsboost.google/ • Robust integrations with popular third party developer tools • Packages and extensions to natively support Google Cloud foundation models Vertex AI is built for developers Interfaces for all developers Firebase Colab Vertex AI Flutter
You have a Google Cloud Project (GCP) with Vertex AI enabled. ◦ You have fine-tuned Gemma Model in local • Steps 1. Export Gemma Model: 2. Prepare Model for Vertex AI: 3. Upload Model to Google Cloud Storage (GCS): 4. Create a Vertex AI Model Resource: 5. Import Model to Vertex AI: 6. Deploy as an Endpoint (Optional): 7. Test and Integrate: https://cloud.google.com/vertex-ai/docs/model-registry/import-model
frameworks and ecosystem tools that are container portable Orchestrate AI models at massive scale across specialized compute Flexibility Performance Optimize valuable compute resources while reducing operational complexity Efficiency
and serve AI models, with portability, scalability, and easy administration GKE is a robust, production ready AI and ML platform with all the benefits of managed Kubernetes Efficient resource management, consistent ops environments, and autoscaling Easy orchestration of Google Cloud AI accelerators, including GPUs and TPUs, for faster training and inference
pipelines • Single line of code for multiple sources and destinations • Pre and post processing steps for real world use cases • Simple to complex ensambles with multiple models / frameworks model_handler = TFModelHandlerNumpy(model_path, inference_fn=gemma_inference_function) with beam.Pipeline() as p: _ = (p | beam.io.ReadFromPubSub(topic=topic) # Read from Source | beam.Map(lambda x:x.decode("utf-8")) # Decode the source data | RunInference(model_handler, inference_args={'max_length': 32}) # Send prompt to Gemma. | beam.ParDo(FormatOutput()) # Format the output. | beam.Map(print) # Print the formatted output. )
with vLLM ◦ on GKE with Hugging Face TGI ◦ on GKE with Triton and TensorRT-LLM • Serve Gemma using TPUs ◦ on GKE with Saxml • Gemma deployment to GKE using TGI on GPU • Use Gemma with Dataflow
for safety Be accountable to people Incorporate privacy design principles Uphold high standards of scientific excellence Be made available for uses that accord with these principles Avoid creating or reinforcing unfair bias Be socially beneficial
sensitive data Filtered pre-training data Safety by design Supervised Fine Tuning & Reinforcement Learning with Human Feedback for safety Safety tuning
compared Gemma and Mistral over 400 prompts testing basic for safety protocols. Human SxS Gemma results on 9 academic authoritative safety and fairness benchmarks. Academic benchmarks Advanced red teaming. Manual testing for advanced capabilities (chemical, biological risks). Internal safety evaluations Tested for safety, privacy, societal risks, data memorization and dangerous capabilities.
prompt-debugger, based on saliency methods. Model debugging Guidance on developing Responsible models RAI guidance A hate speech classifier. Methodology to build any classifier with limited data points. safety classifiers
training data examples 2. Parameter efficient tune using LoRA 3. Get model scores or predictions and evaluate Custom classifier 200 data points SOTA on ETHOS leaderboard F1 : 0.8 Hate speech classifier
is an open model – accessible for further development. • Those adapting Gemma must uphold safety and quality standards. • Responsibility extends to creators and end-users.
knowledge generation and reasoning. They are pre-trained on large amounts of publicly available data. Typical usage of LLMs LLM • Text generation • Summarization • Q&A
the information • That they were trained on • That they are explicitly given in the prompt They will often assume that the premise of a prompt is true. LLMs don’t have the capability to ask for more information and potentially need some outside input. Prompt What is langchain? Response LangChain is a blockchain-based platform that allows users to create, share, and learn languages. It is a decentralized platform that is not controlled by any single entity. This means that users have complete control over their data and content. LangChain is also a secure platform, as all data is encrypted and stored on the blockchain. This makes it very difficult for hackers to access or steal data. But…. The Grounding Problem (aka Hallucinations)
Check Prompt Engineering Run fine-tuning to teach the LLM by baking knowledge into the weights of the network. - Data preparation effort - Expensive to do - Online learning/ updating new data - Still may not work Put a human in-between the output of the LLM and the user. - Very latent and expensive - Humans untrustworthy or inattentive Induce the LLM by adding any known information to the prompt. - Based on LLMs existing trained knowledge - Token limit for LLMs - Error prone: Retrieving the right context - Trade off: performance, latency, cost Some naive solutions
do not know your business’s proprietary or domain specific data • LLMs do not have real-time information • LLMs find it challenging to provide accurate citations from their parametric knowledge The Solution: Feed the LLM *relevant* context in real-time, by using an information retrieval system Text Generation Information Retrieval (IR) System External Knowledge Retriever (BM25, encoder, vector search) Generator (LLM: PaLM) question related context Input Prompt ranked results Retrieval Augmented Generator Answer Generative AI applications with Vertex AI PaLM 2 Models and LangChain Retrieval Augmented Generation (RAG) ‘Grounding’ on user data
Agents Semantic search and/or summarization over unstructured documents or structured data sources. Can involve breaking down complex question, combining heterogeneous data sources or multiple documents. Instead of a single question and answer, a chatbot can handle multiple back-and-forth queries and answers, getting clarification or answering follow-up questions. An “agent” is an automated reasoning and decision engine that takes in a user input/query and make internal decisions for executing that query to return results. Involves breaking down complex question, choosing external tools, planning tasks and caching completed tasks. Common use cases / applications
vs RAG? RAG vs Finetuning — Which Is the Best Tool to Boost Your LLM Application? Fine-Tuning RAG External knowledge required? ❌ ✅ Model adaptation required? ✅ ❌ Minimize hallucinations? ❌ ✅ Is training data available? ✅ ❌ How dynamic is the data? ❌ ✅ Interpretability required? ❌ ✅
relatively low-dimensional vector into which you can translate high-dimensional vectors. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space." From: Google Machine Learning Crash Course Data (104~106 dims) DL models Embs (102~104 dims) Meet AI’s multitool: Vector embeddings Embeddings
• Split up document(s) into even chunks. • Each chunk is a piece of raw text. • Generate embedding for each chunk • Store each chunk into a vector database Chunk Chunk Chunk Chunk Vector Database Vector Database Chunk LLM Querying • Generate embedding for query • Find top-k most similar chunks from vector database • Plug into LLM response synthesis Retrieval Synthesis RAG workflow for building a QA System
Feature Store now natively supports embeddings storage and retrieval. “.... lorem ipsum lorem ipsum….” User Prompt Vertex AI: Text Embeddings API Document Corpus or Knowledge Base (1) Index the relevant content Embeddings + Content Index Vertex AI: Vector Search Your favourite low latency key-value store (2) Fetch relevant info and augment prompt Vertex AI: Text Embeddings API Fetch top-k semantic matches and append to prompt “.... lorem ipsum lorem ipsum…. Context: ….doc 1…. ….doc 2…. … ….doc k…. ” Augmented Prompt Vertex AI: PaLM2 LLM Response (long docs to be split into smaller chunks) Retrieval Augmented Generation: Example architecture powered by Vertex AI Text Embeddings and Vector Search