Generative AI for Data Platforms - Databricks Data Intelligence Platform

©2023 Databricks Inc. — All rights reserved 1 Frank Munz,
Principal TM Engineer, Databricks / April 2024 Generative AI for Data Platforms Cutting to the Chase

©2022 Databricks Inc. — All rights reserved Hi, I am
Frank! • Principal @Databricks. TMM for Data, Analytics and AI products • Large scale data & compute • Based in 🍻 ⛰ 🥨 󰎲 Munich • Formerly AWS Tech Evangelist, SW architect, data scientist, published author etc. • @frankmunz / LindedIn

©2023 Databricks Inc. — All rights reserved 10,000+ global customers
$1.5B+ in revenue $4B in investment Inventor of the lakehouse & Pioneer of generative AI Gartner-recognized Leader Database Management Systems + Data Science and Machine Learning Platforms The data and AI company Creator of

©2023 Databricks Inc. — All rights reserved Conﬁdential and Proprietary
4 Streaming Data

©2023 Databricks Inc. — All rights reserved Streaming Data •
Small sized data • Continuously produced • Expectation -> processed in time • Programming paradigm ◦ Right-time vs real-time 5

©2023 Databricks Inc. — All rights reserved Streaming Data Think
“right-time” instead of “real-time” 6 Manually Continually Scheduled Latency Cost

7 Latency vs Throughput

©2023 Databricks Inc. — All rights reserved TPC-DS Benchark from
Barcelona HPC Center 2.2x faster with Photon than previous record for DWH 9

10 But how about Latency?

©2023 Databricks Inc. — All rights reserved 11 Project Lightspeed
https://www.databricks.com/blog/project-lightspeed-update-advancing-apache-spark-structured-streaming

©2023 Databricks Inc. — All rights reserved Subsecond Latency for
Stateless Pipelines 12

13 Serverless

©2023 Databricks Inc. — All rights reserved SIMPLE and FAST
EFFICIENT RELIABLE Serverless Compute for Data Platforms Serverless Compute Hands-off auto-optimized compute No knobs Fast startup For any practitioner Fully managed and versionless Paying only what you use Strong cost governance Secure by default Stable with smart fail-overs Storage Notebooks with Spark Pipelines AI Model hosting SQL DWH "Put your vendor T-shirts down" 14 multi-cloud

15 System Architecture

Architecture 16

Walk trough 17 • Single Page App (S3) • Kinesis Stream ◦ JSON Structure ◦ Kinesis Ingest with EFO • Delta Live Tables (ETL) • Spark Streaming Data Analytics ◦ Histogram streaming data ◦ Window-based aggregation • Databricks Workﬂows • Databricks SQL

©2024 Databricks Inc. — All rights reserved Databricks Data Intelligence
Platform Use generative AI to understand the semantics of your data Data Intelligence Engine Open Data Lake (lake ﬁrst approach: S3, ADLS, GCS) Databricks SQL Text-to-SQL Workﬂows optimized based on past runs Delta Live Tables Automated data qualility Mosaic AI Create, tune, and serve custom LLMs Unity Catalog Securely get insights in natural language Delta Lake with Delta UniForm Data layout is automatically optimized based on usage patterns

©2023 Databricks Inc. — All rights reserved Streaming ETL with
Delta Live Tables Pipelines Python or SQL. STs for ingestion and MVs for transformation Bronze cloud_files CREATE STREAMING TABLE Use a short retention period to avoid compliance risks and reduce costs Avoid complex transformations that could have bugs or drop important data Retain inﬁnite history Easy to perform GDPR and other compliance tasks CREATE MATERIALIZED VIEW Materialized views automatically handle complex joins / aggregations, and propagate updates and deletes. Silver/Gold Ad-hoc DML for GDPR / Corrections

©2023 Databricks Inc. — All rights reserved 20 Delta Live
Tables • Serverless Compute (zero compute settings) • Streaming Ingest from Message Buses with SQL read_kafka(), read_kinesis(), … • Incrementally computed Materialized Views Link to blog

©2023 Databricks Inc. — All rights reserved Building Blocks of
Databricks Workflows 21 A unit of orchestration in Databricks Workflows is called a Job. Databricks Notebooks Python Scripts Python Wheels SQL Files/Queries Delta Live Tables Pipeline dbt Java JAR file Spark Submit Jobs consist of one or more Tasks Sequential Parallel Conditionals (Run If) Jobs-as-a-Task (Modular) Control flows can be established between Tasks. Jobs supports different Triggers DBSQL Dashboards Manual Trigger Scheduled (Cron) API Trigger File Arrival Triggers Table Triggers Continuous (Streaming) Preview Coming Soon

22 gen AI for Data Platforms

©2024 Databricks Inc. — All rights reserved Databricks Data Intelligence
Platform Use generative AI to understand the semantics of your data Data Intelligence Engine Open Data Lake (lake ﬁrst approach: S3, ADLS, GCS) Databricks SQL Text-to-SQL Workﬂows optimized based on past runs Delta Live Tables Automated data qualility Mosaic AI Create, tune, and serve custom LLMs Unity Catalog Securely get insights in natural language Delta Lake with Delta UniForm Data layout is automatically optimized based on usage patterns

We’re infusing AI in our experiences AI-generated docs + semantic search in Catalog Explorer Databricks Assistant SQL to Dashboard Data Rooms (Project Genie)

25 Data Platforms for gen AI

©2023 Databricks Inc. — All rights reserved MosaicML Model Serving
MosaicML Model Serving Vector Search MLflow AI Gateway Model Serving MLflow AI Gateway MLflow Evaluation MLflow Prompt Engg Generative AI Solutions Enable every architectural pattern Prompt Engineering and Chains Retrieval Augmented Generation (RAG) Fine-tuning Pre-training Unity Catalog | Lakehouse Monitoring Crafting specialized prompts to guide LLM behavior Combining an LLM with enterprise data Adapting a pre-trained LLM to specific data sets or domains Training an LLM from scratch Complexity / Compute-intensiveness

©2023 Databricks Inc. — All rights reserved Hallucination Lacking enterprise
context Gen AI gone wrong

Model Serving Custom Models Foundation Models APIs External Models Deploy
any model as a REST API with Serverless compute, managed via MLﬂow. CPU and GPU. Integration with Feature Store and Vector Search. Govern external models and APIs. This provides the governance of MLﬂow Deployments for LLMs, plus the monitoring and payload logging of traditional Databricks Model Serving. Databricks curates top Foundation Models and provides them behind simple APIs. You can start experimentation immediately, without setting up serving yourself. Databricks Model Serving Unified UI, API & SDK for managing all types of AI Models

©2024 Databricks Inc. — All rights reserved Built-in governance with
permissions and lineage Automatically synchronizes streaming source data with vector db. No separate data pipelines Vector DB Serverless vector database for RAG

©2024 Databricks Inc. — All rights reserved Finetuning Finetune your
LLM on your data Serverless: no need to reserve or pick GPUs Pick the data from Unity Catalog or from Huggingface Maintain control and ownership of the model. It is your Intellectual Property.

©2024 Databricks Inc. — All rights reserved Mosaic AI Training
Up to 7X faster and cheaper training of large AI Models Simpliﬁed, scalable, and cost-effective training of large AI models. Train or ﬁne-tune your own generative AI model with your data in your secure environment. Full control of your model and privacy of your data. Your data, your model, built in your secure environment.

©2024 Databricks Inc. — All rights reserved Model architecture •
Sparse Mixture-of-Experts (MoE) • 4 of 16 experts for a given input Model training • Pre-trained on 3072 NVIDIA H100s in 3 months. • on Databricks Data Intelligence Platform, Notebooks, Jobs, etc. The models • DBRX Base for ﬁne-tuning • DBRX Instruct for RAG chains • 132B parameters • 32k token context length License and data • Open-source for commercial use • Pretrained on publicly available 12T tokens • Designed for enterprises Introducing DBRX’s details

39 Gen AI meets Data Platforms Data Intelligence Engine + Uniﬁed Governance -> Assistant, Intelligent Search, automated documentation, natural language queries and better scheduling, automated data quality

40 Data Platform meets gen AI "There is no good model with bad data"

Generative AI for Data Platforms - Databricks D...

Generative AI for Data Platforms - Databricks Data Intelligence Platform

More Decks by Frank Munz

Other Decks in Programming

Featured

Transcript