Data Engineering in the Large Language Models era by Ismaël Mejía

Data Engineering in the Large Language Models (LLM) era Ismaël
Mejía Senior Cloud Advocate

About me  Software/Data Engineer  ~10y experience in ‘Big-data’
/ cloud systems  Real-time (and batch) data at scale  Apache Avro and Beam PMC/committer  Apache Software Foundation (ASF) member ‘An open-source data systems person’, so why do I care about AI/LLMs?

September 28, 2022

November 30, 2022

Artificial Intelligence 1956 Artificial Intelligence The field of computer science
that seeks to create intelligent machines that can replicate or exceed human intelligence Machine Learning 1997 Machine Learning Subset of AI that enables machines to learn from existing data and improve upon that data to make decisions or predictions Deep Learning 2017 Deep Learning A machine learning technique in which layers of neural networks are used to process data and make decisions Generative AI 2021 Create new written, visual, and auditory content given prompts or existing data

 2016  Human parity  2017  Human parity
 2018  Human parity  2019  Human parity  2020  Human parity  2021  Human parity  2021  Human parity

AI innovation fueled by research  Redmond WA  Montreal
QB  New York NY  Boston MA  Cambridge, UK  In di a  Beijing, China  Shanghai, China  Global research centers  Researchers employed worldwide  AI-related patents  AI research papers published  To human parity on vision, speech, and language

AI is already mainstream  Top 3 common adopted AI
use cases and benefits 1 Intelligent document automation  Automate processes and improve operational efficiency 2 Sales and Demand Forecasting / Inventory Management  Accelerate time-to-market 3 Hyper-personalization for up-sell and cross-sell  Build digital trust and improve user experiences

Also LLM revolution of new products / features • Content
generation • Summarization • Semantic search • Prompt Engineering • Copilots • Agents / Assistants

User expectations What if I have that! • Be able
to interact like that, have an ‘expert’ assistant • Use my own data • Integrate my systems / workflow

AI/LLM revolution consequences for data practitioners 1. Better tools to
do our work faster 2. Make structured and unstructured data accesible 3. Offer similar features/services to others

Microsoft Analytics Portfolio Data Factory Synapse DW Purview Event Hub
Data Explorer Azure AI Power BI Synapse Spark Azure Databricks

Data analytics for the era of AI

Data Integration Data Warehouse Real Time Analytics Business Intelligence Data
Science Data Lake Governance Spark Engines Power BI + Synapse Marrying the ease of use of Power BI with the scalability and depth of Synapse

Data Integration Data Engineering Data Warehousing Data Science Real Time
Analytics Business Intelligence OneLake Microsoft Fabric Unified analytics platform Lake centric and open Empower every Office user Pervasive security and governance

AI/LLM revolution consequences for data practitioners 1. Better tools to
do our work faster 2. Make structured and unstructured data accesible 3. Offer similar features/services to others

Microsoft Fabric Data analytics for the era of AI Complete
Analytics Platform Everything, unified SaaS-ified Secured and governed Lake Centric and Open OneLake One copy Open at every tier Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action AI Powered Copilot accelerated GPT on your data AI-driven insights

Complete Analytics Platform Everything, unified SaaS-ified Secured and governed AI
Powered Empower Every Business User Lake Centric and Open Complete Analytics Platform Everything, unified SaaS-ified Secured and governed OneLake One copy Open at every tier Familiar and intuitive Built into Microsoft 365 Insight to action Copilot accelerated GPT on your data AI-driven insights Microsoft Fabric Data analytics for the era of AI Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action

Persona optimized experiences

AI Powered Copilot accelerated GPT on your data AI-driven insights
Complete Analytics Platform Empower Every Business User Lake Centric and Open Everything, unified SaaS-ified Secured and governed OneLake One copy Open at every tier Familiar and intuitive Built into Microsoft 365 Insight to action AI Powered Copilot accelerated GPT on your data AI-driven insights Microsoft Fabric Data analytics for the era of AI

Copilot in Power BI Create beautiful and insightful reports just
by chatting with Copilot Define metrics and calculations for your data model just by describing them in natural language Use Copilot to find and summarize insights in your data Stay focused on your business outcomes and unlock insights in your data with Copilot

Copilot in Power BI

Copilot in Notebooks Use Copilot to enrich, model, analyze and
explore your data in notebooks Work with Copilot to understand how best to analyze your data Chat with Copilot to create and configure ML models Write code faster with inline code suggestions from Copilot Use Copilot to summarize and explain code to understand how it works

Copilot in Notebooks

at scale

Microsoft Fabric Data analytics for the era of AI Complete
Analytics Platform Everything, unified SaaS-ified Secured and governed Lake Centric and Open OneLake One copy Open at every tier Empower Every Business User Familiar and intuitive Built into Microsoft 365 Insight to action AI Powered Copilot accelerated GPT on your data AI-driven insights

Lake Centric and Open OneLake One copy Open at every
tier Complete Analytics Platform AI Powered Empower Every Business User Everything, unified SaaS-ified Secured and governed Familiar and intuitive Built into Microsoft 365 Insight to action Copilot accelerated GPT on your data AI-driven insights Microsoft Fabric Data analytics for the era of AI Lake Centric and Open OneLake One copy Open at every tier

OneLake for all Data “The OneDrive for Data” A single
SaaS lake for the whole organization Provisioned automatically with the tenant All workloads automatically store their data in the OneLake workspace folders All the data is organized in an intuitive hierarchical namespace The data in OneLake is automatically indexed for discovery, MIP labels, lineage, PII scans, sharing, governance and compliance

One Copy for all computes Real separation of compute and
storage All the compute engines store their data automatically in OneLake The data is stored in a single common format Delta – Parquet, an open standards format, is the storage format for all tabular data in Analytics vNext Once data is stored in the lake, it is directly accessible by all the engines without needing any import/export All the compute engines have been fully optimized to work with Delta Parquet as their native format Shared universal security model is enforced across all the engines Serverless Compute Customers 360 Finance Service Telemetry Business KPIs Delta – Parquet FormatÅ Delta – Parquet Format Delta – Parquet Format Delta – Parquet Format T-SQL Spark KQL Analysis Services

Taking One Copy to the Next Level Shortcuts Customers 360
Finance Service Telemetry Business KPIs Amazon Google Azure Sharing data in OneLake is as easy as sharing files in OneDrive, removing the needs for data duplication With shortcuts, data throughout OneLake can be composed together without any data movement Shortcuts also allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement, making OneLake a multi- cloud data lake With support for industry standard APIs, OneLake data can be directly accessed by any application or service

that different

Making Unstructured data accesible <the old days> • Query/Index logs
to extract information e.g. Observability • Just put everything in some database we will care later

Does every data have some hidden structure? • File Format
Metadata (e.g. headers in images) • Automatic structure ‘extraction’ - Parsing • Look for structure conceptually - Features • Manual structure (aka Human labeling) LLMs • Represent data in a model space (aka embeddings)

Embedding Model 0.027 -0.001 0.002 … 0.011 Image Audio Text
Semantic Search & Power Of Embeddings

Cosine Similarity Cosine similarity - Wikipedia

Vector Databases

OpenAI Fine-Tuning API Fine-tuning - OpenAI API {"messages": [ {"role":
"system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}] }

 Turing  Rich language understanding  Z-Code  100
languages translation  Florence  Breakthrough visual recognition  OpenAI  GPT-3/GPT-4  Human-like language generation  DALL-E  Realistic image generation  Codex  Advanced code generation  Azure AI services  Vision  Speech  Language  Decision  OpenAI Service  Cognitive Search  Form Recognizer  Immersive Reader  Bot Service  Video Analyzer  Better search and Q&A  Better customer engagement and support  Better matching and content moderation  Better email management and meeting preparation  Better knowledge management  Better meeting management  Better reading and writing assistance  Better content moderation ChatGPT Conversation generation

Data Science in Microsoft Fabric End-to-end data science for predictive
business insights Developer friendly • SaaS experiences with quick setup • Starter pools with fast cluster startup • Code authoring experiences in Notebooks and IDE • VSCode integration • Git integration (CI/CD) Data Centric • Easy and secure access to lake centric data • Open Delta Lake support promotes reproducibility • Native integration with data infrastructure Secure collaboration • Unified platform for all analytics roles incl. data scientists • Secure and easy sharing of data, code, models and experiments Rich ML tools • Supports MLFlow model and experiment management • MLFlow Autologging • Large set of built-in, scalable ML tools with SynapseML library • Serve predictions swiftly to PowerBI with Direct Lake mode

Problem formulation/ideation Experiment & Model Enrich & Operationalize Insight Data
discovery and pre-processing Prepare Model Evaluate Explore Data Wrangler SynapseML Batch PREDICT Direct Lake The Data Science Process in Fabric

Notebooks

Built in model & experiment tracking enables data scientists to
track and compare their different experiment runs and model versions. Automatically capture model metrics & parameters with built-in support for MLFlow auto-logging Users can create and manage model artifacts in Trident MLFlow compatible. Model registry is powered by AzureML Models and Experiments with MLFlow

SynapseML and Microsoft AI services Distributed ML Model Training MLflow
support Cognitive Services OpenAI LLMs

Service Name API Type Vision OCR Analyze Image Recognize Text
Read Image Recognize Domain Specific Content Generate Thumbnails Tag Image Describe Image Form Recognizer Analyze Layout Analyze Receipts Analyze Business Cards Analyze Invoices Analyze ID Documents Analyze Custom Model Analyze Documents List Custom Models Bing Bing Image Search Face Detect Face Find Similar Face Group Faces Identify Faces Verify Faces Speech Speech to Text Conversation Transcription Text to Speech Emotion Recognition SynapseML: Cognitive Services Built-in to Trident Service Name API Type OpenAI Completion* Chat* Embeddings* Text Entity Detector* Key Phrase Extractor* Language Detector* PII* Sentiment* Healthcare Analyze Text Translation Translate* Transliterate* Detect Language* Break Sentence* Dictionary Lookup* Dictionary Examples* Document Translator Azure Search Add Documents Anomaly Detection Detect Last Anomaly* Detect Anomalies* Simple Detect Anomalies Custom Detection Multivariate Detection Azure Maps Address Geocoding Reverse Geocoding Check Point in Polygon *Supported in Trident built-in endpoint at //build Private Preview

AI Plug-ins for your data Create AI plug-ins to deliver
custom generative AI experiences for your data Enable custom Q&A on your data in Fabric Define custom business semantics and grounding unique to your organization Deploy plug-ins to work seamlessly with Copilot in Business Chat

Want to learn more microsoft.com/fabric Microsoft Fabric

aka.ms/learnlive-get-started-microsoft-fabric Date Title August 29, 2023 Get started with end-to-end
analytics and Lakehouses in Microsoft Fabric September 5, 2023 Use Apache Spark in Microsoft Fabric September 12, 2023 Work with Delta Lake tables in Microsoft Fabric September 19, 2023 Use Data Factory pipelines in Microsoft Fabric September 26, 2023 Ingest Data with Dataflows Gen2 in Microsoft Fabric October 3, 2023 Get started with data warehouses in Microsoft Fabric October 10, 2023 Get started with Real-Time Analytics in Microsoft October 17, 2023 Get started with data science in Microsoft Fabric October 24, 2023 Administer Microsoft Fabric

aka.ms/fabric-csc Compete Benchmark your progress against friends and coworkers. It's
always better when we learn together. Learn Increase your understanding with easy-to-read instruction and stay up on the bleeding- edge of technology. Develop skills By the end of the challenge, you will have marketable skills to better yourself and your career.

Microsoft Fabric Community Resources ✓ Try Microsoft Fabric for free:
https://aka.ms/try-fabric ✓ Join the Fabric community: https://aka.ms/fabriccommunity ✓ Share and vote for ideas to improve Fabric: https://aka.ms/fabricideas ✓ Read and comment our blog: https://aka.ms/fabricblog ▪ Product announcement: https://aka.ms/fabric ▪ Digital Event at Build (videos): https://aka.ms/build-with-analytics ▪ Product website: https://aka.ms/microsoft-fabric ▪ Documentation: https://aka.ms/fabric-docs ▪ Fabric e-book: https://aka.ms/fabric-get-started-ebook ▪ Microsoft Learn: https://aka.ms/learn-fabric ▪ End-to-end scenario tutorials: https://aka.ms/fabric-tutorials ▪ Fabric Notes: https://aka.ms/fabric-notes

Thank you!

What’s coming next?

Roadmap Problem formulation/ideation Experiment & Model Enrich & Operationalize Insight
Data discovery and pre-processing • Tagging support • Hyperparameter tuning • AutoML (FLAML) • Model interpretability • DNN training • Built-in pre-trained AI models (Cognitive Services) • Data Wrangler on Spark • Explore PowerBI datasets from Notebooks (Semantic Link) • CI/CD • ALM support for ML models • Trident SDK for ML • Improved model batch scoring with containerization • Model endpoints • Trident ML model support in PBI dataflows • Monitoring of models • Feature store • Integration with PowerBI Metrics Copilot experiences and Azure Open AI Integration • Open Notebook from BI report visual (Semantic link)

Automate the process of building machine learning models with FLAML
Code-first integration to parallelize AutoML trials with Spark Run AutoML Integrated with MLFlow to automatically capture runs & metrics Microsoft Confidential: Content is shared under NDA AutoML with FLAML Private Preview

Extra slides

COMPUTE A shared pool of capacity that powers all capabilities
in Microsoft Fabric, from data modeling and data warehousing to business intelligence. Pay-as-you-go (per sec billing with one minute minimum). STORAGE A single place to store all data. Pay-as-you-go ($ per GB / month). Microsoft Fabric simplicity Microsoft Fabric is a unified product for all your data and analytics workloads. Rather than provisioning and managing separate compute for each workload, with Microsoft Fabric, your bill is determined by two variables: the amount of compute you provision and the amount of storage you use.

Pricing Pay as you go

Azure OpenAI + Cognitive Search

Azure OpenAI + Plug-in Introduction - OpenAI API OpenAI API
Plug-Ins App Orchestrator API Data Source Others

Data Engineering in the Large Language Models e...

Data Engineering in the Large Language Models era by Ismaël Mejía

More Decks by Azure Zurich User Group

Other Decks in Technology

Featured

Transcript