apidays Paris 2025 | Architecting Intelligence: Building at Scale with NVIDIA Nemotron

Architecting Intelligence: Building at Scale with NVIDIA Nemotron Ziv Ilan,
Super AI Startups, Dec 2025

AI is Not a One Size Fits All Problem Advancing
Open Source and Specialized AI AI Systems of Models Specialized AI AI Efficiency

So what is Nemotron and how it fits the picture?

Nemotron Building blocks that enables to architect your intelligent AI
agent NVIDIA Nemotron – Open Family of AI Models, Datasets and Techniques Data 9T tokens 30M samples 1M compute hours Models Nano, Super, Ultra 1000 derivatives Libraries NeMo-RL, Minitron, NAS Research 200+ papers

NVIDIA Nemotron Models Available Today Nemotron Nano 2 High efficiency
SLM Llama Nemotron Super Optimal accuracy-efficiency LLM Llama Nemotron Ultra Highest accuracy powered by world knowledge Reasoning / Agentic Vision Language Information Retrieval Content Safety Nemotron Nano 2 VL Video understanding and Image reasoning Nemotron Parse 1.1 Document extraction Nemotron Page Elements Structured images detection & classification Nemotron Table Structure Rows, columns detection & preservation in markdown Nemotron Graphic Elements Chart component detection &classification Llama Nemotron Embed Multilingual text question-answering retrieval Llama Nemotron Rerank Multilingual fine-tuned reranking model Llama Nemotron Safety Guard Multilingual content safety

Nemotron Datasets Open, transparent training data for every stage of
AI development Nemotron-Pre v1 Code, STEM, safety, chat, multilingual Nemotron-ClimbLab Optimized pre-training mix Nemotron-Post v1 Code, STEM, chat, reasoning Nemotron-VLM2 OCR training Foundations Agentic AI Sovereign AI Alignment Nemotron-MIND Structured mathematical dialogues OpenCodeReasoning Synthetic reasoning for code OpenMathReasoning Large-scale math reasoning OpenCodeInstruct Largest open instruction tuning Granary ASR speech data (25 languages) Nemotron-Personas Synthesized region-specific demographics Nemotron-PII Sensitive data mitigation Nemotron-Content-Safety 2 Human-LLM safe interactions Nemotron-AIQ-Agentic-Safety Contextual safety risks Nemotron-Safety-Guard 3 Multilingual safety training HelpSteer 2 Alignment: helpful, factual, coherent

Leverage Open Techniques to Build Custom, Efficient Models Nemotron Advanced
Using NVIDIA Techniques Hybrid Mamba-Transformer Efficient arch for best accuracy, throughput Mixture of Experts Improve compute and latency Architecture Accuracy Efficiency Recipes NeMo RL Align models with human preferences NeMo Data curation accelerates training and improves accuracy Minitron, Neural Arch Search Right-size models for higher efficiency Thinking Budget Minimize model overthinking Efficient Video Sampling Process longer videos Quantization Lowers inference cost while maintaining accuracy Technical Reports Detailed model development recipes Starter Kits Instructions to build custom agents with sample Jupyter Notebooks

From (foundation) models to agentic AI application in production

How AI Agents Work AI Agents use Tools and Collaborate
to Complete Complex Projects THINKING TOOL USE COMPUTER USE MEMORY > I/O PROMPT OTHER AGENTS

System-level accelerations require knowledge of the entire system Many tools
exist—but don’t work together Repeatability Challenging to guarantee consistent results Building Agentic Systems is Complex Agentic applications should integrate with existing tools, heterogeneous data, and perform reliably Architectural Complexity Performance Code Reuse Fragmented solutions result in duplicate work

Fast API Server Typical Agent Chatbot Architecture Projects start small
and increase in complexity over time Agent Tool 1 Tool 2 LLM Evaluation System Profiler Memory Telemetry Observability Retriever Config

Nemo Agent Toolkit – OSS to enable rapid agentic development
An open-source Python library for scaling enterprise-ready agentic AI systems – evaluation-driven development Agents Guardrails Evaluation NVIDIA NIM Tools/Workflows Observability Profiling NeMo Agent Toolkit Optimization

NeMo Agent Toolkit Reduces Agentic System Complexity Flexible integration, configuration,
monitoring, and optimization Function-level security controls for agent workflows* Safety & Security End-to-end tracing across frameworks and agent invocations Observability Tracing Customizable evaluation framework for agent workflow components Evaluation System YAML-based configuration for rapid agent deployment and prototyping Configuration System Agent Hyperparameter Optimizer, NVIDIA Dynamo integration for accelerated performance* Workflow Optimization Performance analysis and runtime bottleneck identification Profiler Multi-framework support: LangChain, LangGraph, CrewAI, Semantic Kernel, Google ADK Framework Agnostic </> Easily connect with MCP tools, Memory providers, custom plugins Agentic Ecosystem Connectors * Coming in a future release

Boost Agent Performance, Reduce Development Time By Adding NeMo Agent
Toolkit to your AI Agent 1.7X Faster Response Times 1.4X Higher Throughput 57% Fewer Lines of Code Healthcare Virtual Assistant for patient Q&A + NeMo Agent Toolkit • YAML enables component reuse • Write once, reuse anywhere • Higher code quality • Lower developer onboarding time • Supports more concurrent users with stable response times • Increases request processing speed • Handles more requests in less time through efficient execution • Smaller code base, shortens execution time

NVIDIA NeMo Developer Workflow Build, Deploy and Optimize Specialized AI
Agents Define Use Case Prepare, Generate Data Select Model Build Agent Guardrail Optimize Monitor Deploy Data Flywheel NeMo Curator NeMo Data Designer NVIDIA Nemotron NeMo Retriever NeMo Evaluator NeMo Agent Toolkit NeMo Retriever NVIDIA NIM NeMo Agent Toolkit NeMo RL NeMo Customizer NeMo Evaluator Connect to Data NeMo Guardrails

AI-Q (Pronounced ‘IQ’) NVIDIA Deep Researcher NVIDIA AI Blueprint Uses
NeMo Agent Toolkit to Connect AI Agents, Data, and Tools User or Machine Prompt Report Agent Reason Llama Nemotron Web Search Tavily Generate Llama Nemotron RAG Embedding NeMo Retriever Reranking NeMo Retriever Extraction NeMo Retriever Vector Database NVIDIA cuVS Enterprise Files Reflect Plan Refine Report Generation Llama 3.3 • Open-source blueprint for research & reporting • Provides observability and transparency • Increases agent accuracy with reasoning • Upload your own data sources • Guide plan and report in real-time • Change report output dynamically • Synthesize hours of material in minutes

Decoding Agent Performance

Profiling Functions • Tagging along with evaluation, the profiler can
simulate large numbers of users with multiple requests • The telemetry stream is collected under load to track each component as its running in the workflow • The profiler can perform many different types of analysis: • Token efficiency • Bottleneck analysis • Tool count • Prompt uniqueness • Support for profile guided optimization coming in the future A generic method for monitoring and measuring functions

End-to-End Agentic Profiler and Optimizer Identify bottlenecks and improve the
performance of agentic workflows

Visualizing Other Behavior NeMo Agent Toolkit Profiler also collects fine-grained
telemetry per function, tool, and LLM regardless of framework Visualized are just some examples of workflow behavior extracted from Profiler output. Certain CVEs can be quite large and contain a lot of information, increasing the prompt length It appears the agent called the tool to retrieve source code was used the most (and in the retries) as it attempted to confirm the presence of vulnerabilities.

apidays Paris 2025 | Architecting Intelligence:...

apidays Paris 2025 | Architecting Intelligence: Building at Scale with NVIDIA Nemotron

apidays PRO

More Decks by apidays

Other Decks in Technology

Featured

Transcript

Architecting Intelligence: Building at Scale with NVIDIA Nemotron Ziv Ilan,

AI is Not a One Size Fits All Problem Advancing

So what is Nemotron and how it fits the picture?

Nemotron Building blocks that enables to architect your intelligent AI

NVIDIA Nemotron Models Available Today Nemotron Nano 2 High efficiency

Nemotron Datasets Open, transparent training data for every stage of

Leverage Open Techniques to Build Custom, Efficient Models Nemotron Advanced

From (foundation) models to agentic AI application in production

How AI Agents Work AI Agents use Tools and Collaborate

System-level accelerations require knowledge of the entire system Many tools

Fast API Server Typical Agent Chatbot Architecture Projects start small

Nemo Agent Toolkit – OSS to enable rapid agentic development

NeMo Agent Toolkit Reduces Agentic System Complexity Flexible integration, configuration,

Boost Agent Performance, Reduce Development Time By Adding NeMo Agent

NVIDIA NeMo Developer Workflow Build, Deploy and Optimize Specialized AI

AI-Q (Pronounced ‘IQ’) NVIDIA Deep Researcher NVIDIA AI Blueprint Uses

Decoding Agent Performance

Profiling Functions • Tagging along with evaluation, the profiler can

End-to-End Agentic Profiler and Optimizer Identify bottlenecks and improve the

Visualizing Other Behavior NeMo Agent Toolkit Profiler also collects fine-grained

Q & A