Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apidays Singapore 2024 - Scalable LLM APIs for ...

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Application Development by Ettikan Karuppiah, NVIDIA

Scalable LLM APIs for AI and Generative AI Application Development
Ettikan Karuppiah, Director/Technologist - NVIDIA

Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024)

------

Check out our conferences at https://www.apidays.global/

Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8

Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io

Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/

apidays

May 04, 2024
Tweet

More Decks by apidays

Other Decks in Technology

Transcript

  1. Scalable LLM APIs for AI and Generative AI Application Development

    Ettikan Kandasamy Karuppiah (Ph.D) Director/Technologist, NVIDIA ROAP Region
  2. TEXT SOUND TEXT TEXT IMAGE VIDEO SPEECH MULTI-MODAL AMINO ACID

    BRAINWAVES SPEECH IMAGE VIDEO IMAGE 3D ANIMATION MANIPULATION PROTEIN LEARN AND UNDERSTAND EVERYTHING TEXT TEXT TEXT IMAGE VIDEO SPEECH MULTI-MODAL AMINO ACID BRAINWAVES SOUND SPEECH IMAGE VIDEO IMAGE 3D ANIMATION MANIPULATION PROTEIN “An adorable cat in 3D confidently riding a flying, rocket-powered bike, adorned with a sleek black leather jacket.” ”A close shot of a cat in a futuristic space suit confidently operating controls in the cockpit of a sci-fi spaceship. The cockpit has lots of lights and holographic screens with data in cool colors. The spaceship is traveling through a warm, colorful nebula. Shot on 35mm, vivid colors.” Generative AI Can Learn and Understand Everything
  3. Generative AI Across Every Industry and Job Function Transform Data

    into Business Insights and Automation Coding Copilot Summarization Kinetica Telco Copilot Telco Customer Service Avatar SAP CRM Assistant Software Security Analysis Car Configurator ServiceNow Customer Relations Management
  4. Enterprises Face Challenges Experimenting with Generative AI Organizations must choose

    between ease of use and control Data and prompts are shared externally Managed Generative AI Services Open-Source Deployment Ongoing maintenance and updates Tuning required for different infrastructure Infrastructure limited to managed environment Limited control for overall generative AI strategy Run anywhere across data center and cloud Custom code for APIs and fine-tuned models Easy to use APIs for development Fast path to getting started with AI Enterprise Controlled Environment Enterprise Controlled Environment Securely manage data in self hosted environment API
  5. Enterprise are on the Generative AI Journey Production Organizations have

    set aside budget and are ramping up efforts to build accelerated infrastructure to support generative AI in production. Experimentation Enterprise application developers kick off POCs for generative AI applications with API services and open models including Llama 2, Mistral, NVIDIA, and others. Explosion ChatGPT gets announced late in 2022, gaining over 100 million users in just two months. Users of all levels can experience AI and feel the benefits firsthand. 2022 2023 2024
  6. Anatomy of a NIM Prompt Event Triton Inference Server TensorRT

    Engine Industry Standard APIs Cloud Native Container AI Model Pre Processing Post Processing Inflight Batching Customization Cache Text | Speech | Image | Video | 3D | Biology K8s Support | Metrics & Monitoring | Identity | Secret Management | Liveness Probe cuDF | CV-CUDA | DALI | NCCL cuBLAS | cuDNN | In-Flight Batching | Memory Optimization | FP8 Quantization LORA | P-tuning Text-to-Text | Text-to-Image | Text-to-3D | Multimodal | ASR | Text-to-Speech Triton has 417 packages/libraries across OSS, 3rd party and NVIDIA TensorRT has 333 packages/libraries across OSS, 3rd party and NVIDIA
  7. NVIDIA CUDA Cloud Native Stack GPU Operator, Network Operator Triton

    Inference Server cuDF, CV-CUDA, DALI, NCCL, Post Processing Decoder Enterprise Management Health Check, Identity, Metrics, Monitoring, Secrets Management Kubernetes Industry Standard APIs Text, Speech, Image, Video, 3D, Biology Customization Cache P-Tuning, LORA, Model Weights Optimized Model Single GPU, Multi-GPU, Multi-Node TensorRT and TensorRT-LLM cuBLAS , cuDNN, In-Flight Batching, Memory Optimization, FP8 Quantization 100’s of Millions of CUDA GPUs Installed Base NVIDIA NIM
  8. Enterprise Ready MLOps Certified on NVIDIA AI Enterprise Infrastructure Optimization

    Cloud Native Management and Orchestration AI and Data Science Development and Deployment Tools Speech AI Video Analytics Recommenders Medical Imaging Autonomous Vehicles Logistics Communi -cations Physics ML Robotics Conversational AI Customer Service Cybersecurity Cloud Data Center Embedded Edge AI Workflows, Frameworks and Pretrained Models* Accelerated Infrastructure
  9. NVIDIA NIM 1. Download model from NGC 2. Unpack the

    downloaded artifact into a model repository 3. Launch the NIM container with your desired model ngc registry model download-version "ohlfw0olaadg/ea-participants/llama-2- 7b:LLAMA-2-7B-4K-FP16-1-A100.24.01“ tar -xzf llama-2-7b_vLLAMA-2-7B-4K-FP16-1-A100.24.01/LLAMA-2-7B-4K-FP16-1- A100.24.01.tar.gz docker run --gpus all --shm-size 1G -v $(pwd)/model-store:/model-store -- net=host nvcr.io/ohlfw0olaadg/ea-participants/nemollm-inference-ms:24.01 nemollm_inference_ms --model llama-2-7b --num_gpus=1 Setup
  10. NVIDIA NIM 1. Once NIM is deployed, you can start

    making requests using a standard REST API: “Triton, a product offering from Cohesity, is a data protection and management solution designed to simplify and streamline the backup and recovery of data across various environments. Triton supports several interfaces, including…” from openai import OpenAI client = OpenAI( base_url = "https://integrate.api.nvidia.com/v1", api_key = "$API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC" ) completion = client.chat.completions.create( model="meta/llama2-70b", messages=[{"role":"user","content":"What interfaces does Triton support?"}], temperature=0.5, top_p=1, max_tokens=1000, stream=True ) for chunk in completion: if chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="") Deployment
  11. NVIDIA NIM is the Fastest Path to AI Inference Reduces

    engineering resources required to deploy optimized, accelerated models NVIDIA NIM Triton + TRT-LLM Opensource 5 minutes ~1 week Industry standard protocol OpenAI for LLMs, Google Translate Speech User creates a shim layer (reducing performance) or modify Triton to generate custom endpoints Pre-built TRT-LLM engines for NV and community models User converts checkpoint to TRTLLM format and creates and runs sweeps through different parameters to find the optimal config Pre-built with TRT-LLM to handle pre/post processing (tokenization) User manually sets up + configures Automated User manually sets up + configures Supported – P-tuning and LORA, more planned User needs to create custom logic Pre-validated with QA testing No pre-validation NVIDIA AI Enterprise - Security and CVE scanning/patching and tech support No enterprise support Deployment Time API Standardization Pre-Built Engine Triton Ensemble/ BLS Backend Triton Deployment Customization Container Validation Support Llama 2 Nemotron
  12. Inference Microservices for Generative AI NVIDIA NIM is the fastest

    way to deploy AI models on accelerated infrastructure across cloud, data center, and PC MIXTRAL 8x7B VISTA-3D DIFFDOCK GEMMA 7B FUYU AI GENERATOR KOSMOS 2 AUDIO2FACE ESM FOLD MolMIM NEMO RETRIEVER 3D GENERATOR NVIDIA API Catalog
  13. LANGUAGE NIMs Code Llama 70B Nemotron-3 22B Persona Gemma 7B

    Llama 2 70B Mistral 7B Mixtral 8x7B VISUAL / MULTIMODAL NIMs Deplot Edify. Getty Edify. Shutterstock FuYu 8B, 55B Kosmos-2 NeVA SDXL 1.0 SDXL Turbo DIGITAL HUMAN NIMs Audio2Face Riva ASR OPTIMIZATION / SIMULATION NIMs cuOpt Earth-2 APPLICATION NIMs Llama Guard Retrieval Embedding Retrieval Reranking DIGITAL BIOLOGY NIMs DeepVariant DiffDock ESMFold MolMIM Vista 3D Adept 110B Jamba Cohere 35B Phi-2 NVIDIA NIM for Every Domain
  14. NVIDIA AI Enterprise Enterprise-Ready Generative AI with RAG and NVIDIA

    NIM Ease the journey from pilot to production Development Deployment https://www.nvidia.com/en-us/ai-data-science/ai-workflows/generative-ai-chatbots/
  15. Gen AI for Technician Support Information retrieval for technical documents

    • Two volume “Aviation Maintenance Technician Handbook —Airframe” FAA manual (~1,200 pages in total) NeMo LLM models • 43B model (4K token limit) • 22B model (16K token limit) Features • Ingest (embed) large documents into vector database for semantic search • Cite sources in retrieved answers • Extract and cite images, captions* • Guardrails (no hallucination) * on the product roadmap
  16. Gen AI for Document Summarization Document Summarization • LLMs excel

    at understanding and synthesizing text. • Given a set of documents, LLMs can summarize the text. • For example, a 183-page NIST publication like the one presented below Some NeMo LLM models of interest • 43B model (4K token limit) • 22B model (16K token limit) Features • Can be finetuned on custom dataset (various ways of adaptation: p-tuning, LoRA, adapters, etc..) • Can handle arbitrary length documents • More features to come soon!
  17. Video Search and Summarization Real-time Asset Tracking Customer Assistance Content

    Generation Detecting Hazardous Condition Human-robot Interaction Applying Multi-modal Foundation Models
  18. NVIDIA Optimized Visual Foundation Models Model Purpose NV-DINOv2 Vision-only backbone

    for downstream Vision AI tasks–image classification, detection, segmentation NV-CLIP Image-text matching model that can align image features with text features; backbone for downstream Vision AI tasks–image classification, detection, segmentation Grounding-DINO Open-vocabulary object detection with text prompts as input EfficientViT-SAM Faster, more efficient version of SAM (Segment Any Model), which is a visual FM for segmenting any object based on different types of visual prompts such as single coordinate or bounding box VILA Family of visual language model for image and video understanding and Q&A LITA Visual language model for video understanding and context with spatial and temporal localization Foundation Pose 6-DoF object pose estimation and tracking, providing the object pose and 3D bounding box BEVFusion Sensor fusion model which fuses multiple input sensors–cameras, LiDAR, radar, etc. to create a bird’s eye view of the scene with 3D bounding box representation of the objects NeVa Multi-modal visual language model for image understanding and Q&A LiDARSAM Segment any object based on user provided text prompts on 3D point cloud LiDAR data
  19. Multi-Modal Foundation Backbone - NV-CLIP Commercially Viable Trained on ethically

    sourced data and compares favorably to other non-commercial public models Trained on Very Large Dataset Trained on 700M image-text pairs for text and image embeddings Foundation Backbone for Vision AI Used in many downstream vision tasks like zero-shot detection/segmentation, VLMs and more Zero-shot Accuracy (ImageNet-1K) Model NV-CLIP OpenAI CLIP† ViT-B 70.4 68.6 ViT-H 77.4†† 78.0 † - Non-commercial use only † † - Trained on 700M image-text pair vs. 2B for CLIP Available in April 2024 Text Encoder Image Encoder Colorful cat Colorful cat Colorful cat T1 T2 T3 … TN I1 T1 I1 T2 I1 T3 … I1 TN I2 T1 I2 T2 I2 T3 … I2 TN I3 T1 I3 T2 I3 T3 … I3 TN … … … … … IN T1 IN T2 IN T3 … IN TN T1 I2 I3 … IN … … 0 1000 2000 3000 4000 5000 6000 ViT-B ViT-H Model Inference Performance (FPS) H100 A100 L40 A30 L4 A2
  20. Using Foundation Backbones for Downstream CV Task Class label Bounding

    box & labels Class label/pixel (mask) Class label, b-box, mask, text Image Text, image Image Feature Vector Data Classification Detection Segmentation Zero-shot Image Retrieval VLMs Diffusion Foundation Backbones NV-DINO / NV-CLIP
  21. Fine-Tune in 100 or Less Samples for Image Classification Dataset

    Feature Vector 0.2 0.3 1.4 . . . 1.1 Ground Truth Labels Fine-tune with TAO Trained Weights Inference Prediction Inference on the Foundational Model NV-DINOv2 Foundational Model Trained on >100M Image/Text Pair 50 60 70 80 90 100 10 100 1000 NV-DINOv2 GC-ViT Few-shot Learning on NV-DINOv2 Train in as few as 10 samples PCB defect classification Demo: Foundational Model Fine-tuning Feature Vector NV-DINOv2 - Frozen
  22. Enterprises Gen AI with RAG Enhance the Accuracy & Reliability

    of Generative AI with RAG Improve accuracy and security Control costs Increase productivity Avoid vendor lock-in User Rankings NVIDIA NIM Prompt Response Enterprise Data Image Office Docs Text PDF
  23. As for Start-ups, do join NVIDIA INCEPTION PROGRAM All are

    welcome to join NVIDIA Developer Program developer.nvidia.com Visit AI.NVIDIA.COM [email protected]