LLM Telemetry & Evals as First Class Rails Concerns - BlueRidgeRuby 2026

LLM Telemetry & Evals as First- Class Rails Concerns BlueRidgeRuby
2026 · Asheville, NC David Paluy · @dpaluy

Problems When Shipping LLM Features 1. How do you know
the answers are good? 2. How do you see what’s happening in production? 3. How do you keep improving?

About Me David Paluy · @dpaluy Doing Rails since version
1.8 Did AI since it was called Machine Learning Austin, TX 🤠 Previously: CTO at Suppli (fintech/construction) Today: Agents Maestro

What's your go-to LLM provider? A OpenAI (GPT) 0% 0
B Anthropic (Claude) 0% 0 C Open Source (Llama, Mistral) 0% 0 D I don't use LLMs yet 0% 0 tex-support.majesticlabs.dev/quiz.html

The Idea: Tex Support Your AI-powered Texas concierge Built with
Rails + RubyLLM Ask it anything about Texas Culture, BBQ, etiquette, survival tips One interface. Any provider. Pure Ruby.

Tex Support Chat Questions people actually ask: "My guest ordered
a well-down ribeye with ketchup at TX restaurant. What should I do?" "How do I make the best brisket, at home?" "What’s the proper response when someone says 'bless your heart'?"

The Model Decision Model Type Example Why Open Source Llama
3, Kimi 2.5 Cost control, self-hosting Proprietary GPT-4.1, Claude Best quality (maybe?) Fine-tuned Our custom model Texas-specific knowledge Using RubyLLM — one interface, any provider: We want to test three categories: RubyLLM.chat(model: "gpt-4.1") RubyLLM.chat(model: "llama-3-70b") RubyLLM.chat(model: "tex-support-ft-v1")

"It Works" Is Not a Test Product says: "It works."
QA says: "Looks fine." Your guest says: "The ketchup answer was helpful." But LLM output is non-deterministic. You wouldn’t deploy an application without tests. Why deploy a prompt without evals?

How do you evaluate LLM output today? A Manual review
/ vibes 0% 0 B Ruby tools 0% 0 C Python tools (DeepEval) 0% 0 D We don't. Yet. 0% 0 tex-support.majesticlabs.dev/quiz.html

The Eval Landscape leva eval-ruby RubricLLM What it is Rails
engine with UI Generic RAG metrics Lightweight eval framework LLM access You implement it Raw HTTP (OpenAI/Anthropic) RubyLLM (any provider) Rails required? Yes No No Custom metrics Yes No (fixed set) Yes The problem: Ruby’s eval ecosystem is young — but growing.

A/B comparison No Basic Paired t-test with p- values Test
assertions No Minitest + RSpec Minitest + RSpec Three tools, three philosophies. All Ruby, all open source. Enter RubricLLM 6 built-in metrics: Faithfulness · Relevance · Correctness · Context Precision · Context Recall · Factual Accuracy RubricLLM.configure do |c| c.judge_model = "gpt-5.5" c.judge_provider = :openai end result = RubricLLM.evaluate( question: "Best way to cook brisket?", answer: tex_support_response, context: texas_bbq_docs, ground_truth: "Low and slow, post oak, 12-14 hrs" ) result.correctness # => 0.95 result.relevance # => 0.92 result.pass? # => true

The Cat Guarding the Milk Will bend criteria to make
the answer pass Will rationalize mediocre output as "good enough" Will grade on a curve it invented just now Letting an LLM judge another LLM without rules is like asking a cat to guard the milk. That’s why RubricLLM uses rubrics. The judge doesn’t decide what "good" means. You do. Warning: LLM-as-Judge without specific rules is dangerous.

RubricLLM Metrics LLM-as-Judge Uses a judge LLM to score 0.0–1.0:
Faithfulness — claims supported by context? Relevance — actually answers the question? Correctness — matches ground truth? Context Precision — retrieved chunks relevant? Context Recall — context covers ground truth? Retrieval Metrics No LLM calls needed — pure math: Precision@K — relevant results in top K Recall@K — coverage of relevant docs MRR — ranking of first relevant result NDCG — position-weighted relevance Hit Rate — any relevant docs found?

Factual Accuracy — no factual discrepancies? Evaluating Tex Support Metric
GPT-4.1 Llama 3 tex-ft-v1 Faithfulness 0.92 0.78 0.97 Relevance 0.95 0.85 0.98 Correctness 0.88 0.65 0.99 Overall 0.91 0.76 0.98 Paired t-tests. Not vibes — statistics. Test case: "Best way to smoke a brisket?" comparison = RubricLLM.compare(gpt41_report, llama_report) # faithfulness 0.92 0.78 -0.14 p=0.003 **

Custom Metrics — The Pitmaster Council 5 custom metrics: TexasCrutch
· Patience · SauceHeresy · WoodSnob · RubPhilosophy class SauceHeresyMetric < RubricLLM::Metrics::Base SYSTEM_PROMPT = <<~PROMPT You are a Central Texas BBQ traditionalist. If brisket needs sauce, the pitmaster has failed. Score 1.0 (no sauce) to 0.0 (sauce-drenched). PROMPT def call(answer:, **) normalize judge_eval( system_prompt: SYSTEM_PROMPT, user_prompt: "Recipe:\n#{answer}" ) end end

The Brisket Trial The Abomination Recipe: 3-hour oven + Sweet
Baby Ray’s Score: 4% Verdict: BANISHED FROM TEXAS "This is not brisket, this is beef torture." The Righteous Recipe Recipe: Post oak, 14 hours, salt & pepper Score: 100% Verdict: TEXAS APPROVED "Pink butcher paper is the only civilized way to handle the stall." The audience laughs — but the API is real. Swap brisket for your production domain.

Evals in CI Minitest RSpec include RubricLLM::Assertions assert_faithful answer, texas_docs,
threshold: 0.8 assert_relevant question, answer refute_hallucination answer, texas_docs include RubricLLM::RSpecMatchers expect(answer) .to be_faithful_to(texas_docs) .with_threshold(0.8) expect(answer) .not_to hallucinate_from(texas_docs) report = RubricLLM.evaluate_batch(dataset, concurrency: 4) report.worst(3) # find weakest outputs comparison = RubricLLM.compare(report_a, report_b) # faithfulness 0.880 0.920 +0.040 p=0.023 *

Evals: Solved ✅ ✅ Evaluate answers across models ✅ Custom
metrics encode our domain’s quality bar ✅ Evals run in CI — regressions caught before deploy ✅ A/B comparisons with statistical rigor But wait… Users are using the service. And we have no idea what’s happening.

The Blind Spot 1. Silent Regression A model update changes
brisket answers from "low and slow" to "air fryer works too." Nobody notices for two weeks. 2. Cost Explosion One user asks 400 questions in a day. We find out at the end of the month - $1,500. 3. The Ketchup Incident A user reports: "Tex Support told me ketchup on steak is fine in Austin." We can’t trace which model, prompt, or version caused it. Tex Support is live. Three things can go wrong:

Logger Soup vs. Real Telemetry What most teams do Can’t
search. Can’t aggregate. Can’t alert. What we need Structured events with timing + payload Cost tracking per interaction Privacy-first: no raw prompts by default A review workflow for quality assurance Rails.logger.info("Calling OpenAI...") result = client.chat(prompt: user_input) Rails.logger.info("Got: #{result[0..50]}") Rails.logger.info("Tokens: #{result.usage}")

Solution 1: TraceBook Privacy-First PII redacted, optional encryption at rest
Cost Tracking Per-interaction: (input_tokens × price) + (output_tokens × price) Review Workflow Approve / Flag / Reject with audit trail Analytics Daily rollups, dashboards, CSV/NDJSON export Rails engine for LLM telemetry and review · github.com/dpaluy/tracebook bundle add tracebook rails generate tracebook:install rails db:migrate TraceBook::Adapters::RubyLLM.enable!

PII: Content Capture is Policy "My wife and I are
moving to city. She’s pregnant. Best hospital near address?" Default (Safe) Prompt fingerprint (SHA256) Token counts only No raw content stored Opt-in (Debug) Truncated previews Custom redactors (lambdas) Per-environment toggle "Input messages likely contain sensitive/PII data" — OpenTelemetry GenAI Conventions Users ask Tex Support personal questions:

Solution 2: OTel Instrumentation Captures automatically: Chat completions · Tool
calls · Token counts · Nested traces Exports to: Langfuse · Arize · LangSmith · Datadog — any OTel backend opentelemetry-instrumentation-ruby_llm (by Thoughtbot) # Gemfile gem "opentelemetry-sdk" gem "opentelemetry-exporter-otlp" gem "opentelemetry-instrumentation-ruby_llm" # config/opentelemetry.rb OpenTelemetry::SDK.configure do |c| c.use "OpenTelemetry::Instrumentation::RubyLLM" end

Two Tools, One Stack Layer Tool What it does LLM
Interface RubyLLM Unified API — swap models without code changes Telemetry + Review TraceBook Rails-native storage, cost tracking, review UI, PII redaction Export + APM OTel (Thoughtbot) Vendor-agnostic traces to Langfuse/Datadog/Arize Evaluation RubricLLM Quality scoring, custom metrics, CI integration Together: you know what your LLM did, what it cost, and whether it was any good. Not competing — composing.

The Full Architecture User Question ↓ Rails Controller → RubyLLM
(any model) ↓ ActiveSupport::Notifications event fired ↓ ┌─────────────────┬─────────────────┐ │ TraceBook │ OTel Gem │ │ - Redact PII │ - Trace spans │ │ - Calc cost │ - Export OTLP │ │ - Persist async│ - DataDog │ │ - Review UI │ │ └─────────────────┴─────────────────┘ ↓ RubricLLM (in CI or batch) - Score quality - A/B model comparison - Custom metrics - Test assertions

The Feedback Loop The loop that makes Tex Support better
over time: 1. TraceBook captures a bad interaction → "Tex Support said ketchup on steak is fine" 2. Flag it in the review dashboard 3. Export to your eval dataset 4. RubricLLM tests prompt changes against this real case 5. New prompt passes → ship it Your worst production responses become your best test cases.

The Ribeye Question — Revisited GPT-4.1:"Respect their choices. Everyone has
different preferences…" Relevance: 0.90 · TexasSpirit: 0.35 FAIL ❌ tex-ft-v1:"First: breathe. This is a character test, not a steak test. In Texas, we don’t judge people by their steak preferences — we judge them loudly and then buy them a proper brisket to make up for it." Relevance: 0.95 · TexasSpirit: 0.98 PASS ✅ Custom metrics tell you what generic ones can’t. User: "My guest ordered a well-done ribeye with ketchup. What should I do?"

Demo Tex Support in Action 1. Ask Tex Support a
question → RubyLLM processes it 2. TraceBook captures the interaction → cost, tokens, latency 3. Run RubricLLM eval → quality scores 4. Flag a bad response → it becomes a test case

What To Do Monday # Action Effort 1 bundle add
tracebook 5 min 2 Enable the RubyLLM adapter 1 line 3 Set content capture policy Default: safe 4 Review 10 interactions this week 30 min 5 bundle add rubric_llm — write your first eval 15 min Steps 1–3 take 15 minutes. Step 5 is how you build confidence.

The Tex Support Stack Gem Role One-liner RubyLLM Interface One
API, any provider TraceBook Telemetry What happened, what it cost, who saw it OTel Gem Export Vendor-agnostic traces to your APM RubricLLM Evals Is it any good? Prove it. All Ruby. All open source. All composable.

Resources TraceBook — github.com/dpaluy/tracebook RubricLLM — github.com/dpaluy/rubric_llm eval-ruby — github.com/johannesdwicahyo/eval-ruby
leva — github.com/kieranklaassen/leva Thoughtbot OTel Blog — thoughtbot.com/blog/observability-for-your-llm-powered-apps RubyLLM — rubyllm.com OTel GenAI Conventions — opentelemetry.io/docs/specs/semconv/gen-ai/

Questions? @dpaluy · github.com/dpaluy

Bonus: The Actual Brisket Recipe - The Texas way The
Coffee Rub (12-15 lb packer) Kosher salt 2 tbsp Instant coffee / espresso grind 2 tbsp Garlic powder 2 tbsp Smoked paprika 2 tbsp Coarse black pepper 1 tbsp Crushed coriander seed 1 tbsp Onion powder 1 tbsp Chili powder 1 tsp Cayenne (optional) 1/2 tsp The Prep — Trim fat cap to 1/4". Season generously. Dry- brine uncovered in fridge 12-24 hrs. Pull out 30-60 min before cook. The Cook — 235-250°F. Post oak. Fat side up. The Rest — Wrapped in cooler at ~150°F. 2-3 hours ideal. The Slice — Separate point and flat, against the grain, pencil-thick. The Serve — Saltines, pickles, onions. No sauce needed.

LLM Telemetry & Evals as First Class Rails Conc...

LLM Telemetry & Evals as First Class Rails Concerns - BlueRidgeRuby 2026

David Paluy

More Decks by David Paluy

Other Decks in Programming

Featured

Transcript

LLM Telemetry & Evals as First- Class Rails Concerns BlueRidgeRuby

Problems When Shipping LLM Features 1. How do you know

About Me David Paluy · @dpaluy Doing Rails since version

What's your go-to LLM provider? A OpenAI (GPT) 0% 0

The Idea: Tex Support Your AI-powered Texas concierge Built with

Tex Support Chat Questions people actually ask: "My guest ordered

The Model Decision Model Type Example Why Open Source Llama

"It Works" Is Not a Test Product says: "It works."

How do you evaluate LLM output today? A Manual review

The Eval Landscape leva eval-ruby RubricLLM What it is Rails

A/B comparison No Basic Paired t-test with p- values Test

The Cat Guarding the Milk Will bend criteria to make

RubricLLM Metrics LLM-as-Judge Uses a judge LLM to score 0.0–1.0:

Factual Accuracy — no factual discrepancies? Evaluating Tex Support Metric

Custom Metrics — The Pitmaster Council 5 custom metrics: TexasCrutch

The Brisket Trial The Abomination Recipe: 3-hour oven + Sweet

Evals in CI Minitest RSpec include RubricLLM::Assertions assert_faithful answer, texas_docs,

Evals: Solved ✅ ✅ Evaluate answers across models ✅ Custom

The Blind Spot 1. Silent Regression A model update changes

Logger Soup vs. Real Telemetry What most teams do Can’t

Solution 1: TraceBook Privacy-First PII redacted, optional encryption at rest

PII: Content Capture is Policy "My wife and I are

Solution 2: OTel Instrumentation Captures automatically: Chat completions · Tool

Two Tools, One Stack Layer Tool What it does LLM

The Full Architecture User Question ↓ Rails Controller → RubyLLM

The Feedback Loop The loop that makes Tex Support better

The Ribeye Question — Revisited GPT-4.1:"Respect their choices. Everyone has

Demo Tex Support in Action 1. Ask Tex Support a

What To Do Monday # Action Effort 1 bundle add

The Tex Support Stack Gem Role One-liner RubyLLM Interface One

Resources TraceBook — github.com/dpaluy/tracebook RubricLLM — github.com/dpaluy/rubric_llm eval-ruby — github.com/johannesdwicahyo/eval-ruby

Questions? @dpaluy · github.com/dpaluy

Bonus: The Actual Brisket Recipe - The Texas way The