Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LLM Telemetry & Evals as First Class Rails Conc...

LLM Telemetry & Evals as First Class Rails Concerns - BlueRidgeRuby 2026

Your LLM calls deserve the same observability and tests as your database queries. This talk shows how to build the full Ruby stack — telemetry, evals, and a feedback loop — so you can review prompts, track costs, score quality with custom rubrics, and turn your worst production responses into your best test cases. The running example: Tex Support, an AI Texas concierge that judges brisket recipes with statistical rigor.

Avatar for David Paluy

David Paluy

May 01, 2026

More Decks by David Paluy

Other Decks in Programming

Transcript

  1. LLM Telemetry & Evals as First- Class Rails Concerns BlueRidgeRuby

    2026 · Asheville, NC David Paluy · @dpaluy
  2. Problems When Shipping LLM Features 1. How do you know

    the answers are good? 2. How do you see what’s happening in production? 3. How do you keep improving?
  3. About Me David Paluy · @dpaluy Doing Rails since version

    1.8 Did AI since it was called Machine Learning Austin, TX 🤠 Previously: CTO at Suppli (fintech/construction) Today: Agents Maestro
  4. What's your go-to LLM provider? A OpenAI (GPT) 0% 0

    B Anthropic (Claude) 0% 0 C Open Source (Llama, Mistral) 0% 0 D I don't use LLMs yet 0% 0 tex-support.majesticlabs.dev/quiz.html
  5. The Idea: Tex Support Your AI-powered Texas concierge Built with

    Rails + RubyLLM Ask it anything about Texas Culture, BBQ, etiquette, survival tips One interface. Any provider. Pure Ruby.
  6. Tex Support Chat Questions people actually ask: "My guest ordered

    a well-down ribeye with ketchup at TX restaurant. What should I do?" "How do I make the best brisket, at home?" "What’s the proper response when someone says 'bless your heart'?"
  7. The Model Decision Model Type Example Why Open Source Llama

    3, Kimi 2.5 Cost control, self-hosting Proprietary GPT-4.1, Claude Best quality (maybe?) Fine-tuned Our custom model Texas-specific knowledge Using RubyLLM — one interface, any provider: We want to test three categories: RubyLLM.chat(model: "gpt-4.1") RubyLLM.chat(model: "llama-3-70b") RubyLLM.chat(model: "tex-support-ft-v1")
  8. "It Works" Is Not a Test Product says: "It works."

    QA says: "Looks fine." Your guest says: "The ketchup answer was helpful." But LLM output is non-deterministic. You wouldn’t deploy an application without tests. Why deploy a prompt without evals?
  9. How do you evaluate LLM output today? A Manual review

    / vibes 0% 0 B Ruby tools 0% 0 C Python tools (DeepEval) 0% 0 D We don't. Yet. 0% 0 tex-support.majesticlabs.dev/quiz.html
  10. The Eval Landscape leva eval-ruby RubricLLM What it is Rails

    engine with UI Generic RAG metrics Lightweight eval framework LLM access You implement it Raw HTTP (OpenAI/Anthropic) RubyLLM (any provider) Rails required? Yes No No Custom metrics Yes No (fixed set) Yes The problem: Ruby’s eval ecosystem is young — but growing.
  11. A/B comparison No Basic Paired t-test with p- values Test

    assertions No Minitest + RSpec Minitest + RSpec Three tools, three philosophies. All Ruby, all open source. Enter RubricLLM 6 built-in metrics: Faithfulness · Relevance · Correctness · Context Precision · Context Recall · Factual Accuracy RubricLLM.configure do |c| c.judge_model = "gpt-5.5" c.judge_provider = :openai end result = RubricLLM.evaluate( question: "Best way to cook brisket?", answer: tex_support_response, context: texas_bbq_docs, ground_truth: "Low and slow, post oak, 12-14 hrs" ) result.correctness # => 0.95 result.relevance # => 0.92 result.pass? # => true
  12. The Cat Guarding the Milk Will bend criteria to make

    the answer pass Will rationalize mediocre output as "good enough" Will grade on a curve it invented just now Letting an LLM judge another LLM without rules is like asking a cat to guard the milk. That’s why RubricLLM uses rubrics. The judge doesn’t decide what "good" means. You do. Warning: LLM-as-Judge without specific rules is dangerous.
  13. RubricLLM Metrics LLM-as-Judge Uses a judge LLM to score 0.0–1.0:

    Faithfulness — claims supported by context? Relevance — actually answers the question? Correctness — matches ground truth? Context Precision — retrieved chunks relevant? Context Recall — context covers ground truth? Retrieval Metrics No LLM calls needed — pure math: Precision@K — relevant results in top K Recall@K — coverage of relevant docs MRR — ranking of first relevant result NDCG — position-weighted relevance Hit Rate — any relevant docs found?
  14. Factual Accuracy — no factual discrepancies? Evaluating Tex Support Metric

    GPT-4.1 Llama 3 tex-ft-v1 Faithfulness 0.92 0.78 0.97 Relevance 0.95 0.85 0.98 Correctness 0.88 0.65 0.99 Overall 0.91 0.76 0.98 Paired t-tests. Not vibes — statistics. Test case: "Best way to smoke a brisket?" comparison = RubricLLM.compare(gpt41_report, llama_report) # faithfulness 0.92 0.78 -0.14 p=0.003 **
  15. Custom Metrics — The Pitmaster Council 5 custom metrics: TexasCrutch

    · Patience · SauceHeresy · WoodSnob · RubPhilosophy class SauceHeresyMetric < RubricLLM::Metrics::Base SYSTEM_PROMPT = <<~PROMPT You are a Central Texas BBQ traditionalist. If brisket needs sauce, the pitmaster has failed. Score 1.0 (no sauce) to 0.0 (sauce-drenched). PROMPT def call(answer:, **) normalize judge_eval( system_prompt: SYSTEM_PROMPT, user_prompt: "Recipe:\n#{answer}" ) end end
  16. The Brisket Trial The Abomination Recipe: 3-hour oven + Sweet

    Baby Ray’s Score: 4% Verdict: BANISHED FROM TEXAS "This is not brisket, this is beef torture." The Righteous Recipe Recipe: Post oak, 14 hours, salt & pepper Score: 100% Verdict: TEXAS APPROVED "Pink butcher paper is the only civilized way to handle the stall." The audience laughs — but the API is real. Swap brisket for your production domain.
  17. Evals in CI Minitest RSpec include RubricLLM::Assertions assert_faithful answer, texas_docs,

    threshold: 0.8 assert_relevant question, answer refute_hallucination answer, texas_docs include RubricLLM::RSpecMatchers expect(answer) .to be_faithful_to(texas_docs) .with_threshold(0.8) expect(answer) .not_to hallucinate_from(texas_docs) report = RubricLLM.evaluate_batch(dataset, concurrency: 4) report.worst(3) # find weakest outputs comparison = RubricLLM.compare(report_a, report_b) # faithfulness 0.880 0.920 +0.040 p=0.023 *
  18. Evals: Solved ✅ ✅ Evaluate answers across models ✅ Custom

    metrics encode our domain’s quality bar ✅ Evals run in CI — regressions caught before deploy ✅ A/B comparisons with statistical rigor But wait… Users are using the service. And we have no idea what’s happening.
  19. The Blind Spot 1. Silent Regression A model update changes

    brisket answers from "low and slow" to "air fryer works too." Nobody notices for two weeks. 2. Cost Explosion One user asks 400 questions in a day. We find out at the end of the month - $1,500. 3. The Ketchup Incident A user reports: "Tex Support told me ketchup on steak is fine in Austin." We can’t trace which model, prompt, or version caused it. Tex Support is live. Three things can go wrong:
  20. Logger Soup vs. Real Telemetry What most teams do Can’t

    search. Can’t aggregate. Can’t alert. What we need Structured events with timing + payload Cost tracking per interaction Privacy-first: no raw prompts by default A review workflow for quality assurance Rails.logger.info("Calling OpenAI...") result = client.chat(prompt: user_input) Rails.logger.info("Got: #{result[0..50]}") Rails.logger.info("Tokens: #{result.usage}")
  21. Solution 1: TraceBook Privacy-First PII redacted, optional encryption at rest

    Cost Tracking Per-interaction: (input_tokens × price) + (output_tokens × price) Review Workflow Approve / Flag / Reject with audit trail Analytics Daily rollups, dashboards, CSV/NDJSON export Rails engine for LLM telemetry and review · github.com/dpaluy/tracebook bundle add tracebook rails generate tracebook:install rails db:migrate TraceBook::Adapters::RubyLLM.enable!
  22. PII: Content Capture is Policy "My wife and I are

    moving to city. She’s pregnant. Best hospital near address?" Default (Safe) Prompt fingerprint (SHA256) Token counts only No raw content stored Opt-in (Debug) Truncated previews Custom redactors (lambdas) Per-environment toggle "Input messages likely contain sensitive/PII data" — OpenTelemetry GenAI Conventions Users ask Tex Support personal questions:
  23. Solution 2: OTel Instrumentation Captures automatically: Chat completions · Tool

    calls · Token counts · Nested traces Exports to: Langfuse · Arize · LangSmith · Datadog — any OTel backend opentelemetry-instrumentation-ruby_llm (by Thoughtbot) # Gemfile gem "opentelemetry-sdk" gem "opentelemetry-exporter-otlp" gem "opentelemetry-instrumentation-ruby_llm" # config/opentelemetry.rb OpenTelemetry::SDK.configure do |c| c.use "OpenTelemetry::Instrumentation::RubyLLM" end
  24. Two Tools, One Stack Layer Tool What it does LLM

    Interface RubyLLM Unified API — swap models without code changes Telemetry + Review TraceBook Rails-native storage, cost tracking, review UI, PII redaction Export + APM OTel (Thoughtbot) Vendor-agnostic traces to Langfuse/Datadog/Arize Evaluation RubricLLM Quality scoring, custom metrics, CI integration Together: you know what your LLM did, what it cost, and whether it was any good. Not competing — composing.
  25. The Full Architecture User Question ↓ Rails Controller → RubyLLM

    (any model) ↓ ActiveSupport::Notifications event fired ↓ ┌─────────────────┬─────────────────┐ │ TraceBook │ OTel Gem │ │ - Redact PII │ - Trace spans │ │ - Calc cost │ - Export OTLP │ │ - Persist async│ - DataDog │ │ - Review UI │ │ └─────────────────┴─────────────────┘ ↓ RubricLLM (in CI or batch) - Score quality - A/B model comparison - Custom metrics - Test assertions
  26. The Feedback Loop The loop that makes Tex Support better

    over time: 1. TraceBook captures a bad interaction → "Tex Support said ketchup on steak is fine" 2. Flag it in the review dashboard 3. Export to your eval dataset 4. RubricLLM tests prompt changes against this real case 5. New prompt passes → ship it Your worst production responses become your best test cases.
  27. The Ribeye Question — Revisited GPT-4.1:"Respect their choices. Everyone has

    different preferences…" Relevance: 0.90 · TexasSpirit: 0.35 FAIL ❌ tex-ft-v1:"First: breathe. This is a character test, not a steak test. In Texas, we don’t judge people by their steak preferences — we judge them loudly and then buy them a proper brisket to make up for it." Relevance: 0.95 · TexasSpirit: 0.98 PASS ✅ Custom metrics tell you what generic ones can’t. User: "My guest ordered a well-done ribeye with ketchup. What should I do?"
  28. Demo Tex Support in Action 1. Ask Tex Support a

    question → RubyLLM processes it 2. TraceBook captures the interaction → cost, tokens, latency 3. Run RubricLLM eval → quality scores 4. Flag a bad response → it becomes a test case
  29. What To Do Monday # Action Effort 1 bundle add

    tracebook 5 min 2 Enable the RubyLLM adapter 1 line 3 Set content capture policy Default: safe 4 Review 10 interactions this week 30 min 5 bundle add rubric_llm — write your first eval 15 min Steps 1–3 take 15 minutes. Step 5 is how you build confidence.
  30. The Tex Support Stack Gem Role One-liner RubyLLM Interface One

    API, any provider TraceBook Telemetry What happened, what it cost, who saw it OTel Gem Export Vendor-agnostic traces to your APM RubricLLM Evals Is it any good? Prove it. All Ruby. All open source. All composable.
  31. Resources TraceBook — github.com/dpaluy/tracebook RubricLLM — github.com/dpaluy/rubric_llm eval-ruby — github.com/johannesdwicahyo/eval-ruby

    leva — github.com/kieranklaassen/leva Thoughtbot OTel Blog — thoughtbot.com/blog/observability-for-your-llm-powered-apps RubyLLM — rubyllm.com OTel GenAI Conventions — opentelemetry.io/docs/specs/semconv/gen-ai/
  32. Bonus: The Actual Brisket Recipe - The Texas way The

    Coffee Rub (12-15 lb packer) Kosher salt 2 tbsp Instant coffee / espresso grind 2 tbsp Garlic powder 2 tbsp Smoked paprika 2 tbsp Coarse black pepper 1 tbsp Crushed coriander seed 1 tbsp Onion powder 1 tbsp Chili powder 1 tsp Cayenne (optional) 1/2 tsp The Prep — Trim fat cap to 1/4". Season generously. Dry- brine uncovered in fridge 12-24 hrs. Pull out 30-60 min before cook. The Cook — 235-250°F. Post oak. Fat side up. The Rest — Wrapped in cooler at ~150°F. 2-3 hours ideal. The Slice — Separate point and flat, against the grain, pencil-thick. The Serve — Saltines, pickles, onions. No sauce needed.