Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Observability By Openlineage

Avatar for suci suci
October 14, 2025
110

Data Observability By Openlineage

A sharing session in Hello-World Dev Conference 2025.

https://hwdc.ithome.com.tw/2025/session-page/4042

As data processing and AI workloads become increasingly complex, data observability is crucial for platform stability and trustworthiness. OpenLineage, an open-source standard, offers a new way to track data flow and lineage relationships, helping teams understand data movement, quickly pinpoint issues, and enhance transparency.

This session will share our practical experience in exploring and adopting OpenLineage, covering its integration strategies in cloud-native environments, the challenges encountered and how they were addressed, and its actual impact on observability, data quality, and cross-team collaboration. We will also discuss its potential applications in AI or data platforms.

Whether you are focused on observability, data governance, or the stability of AI/ML pipelines, this presentation will provide practical insights and directions for consideration.

Avatar for suci

suci

October 14, 2025
Tweet

Transcript

  1. Slide theme and illustrations made from Gemini/Perplexity Data Observability By

    Openlineage Shuhsi Lin 20251015 Hello-World Dev Conference The Smart Pizza & AI Way
  2. Find me on sciwork member Working in Smart Manufacturing &

    AI With data and people About Me Focus Areas • Agile/Engineering Culture/ Developer Experience • Team Coaching • Data Engineering Shuhsi (Suci)
  3. Outline Takeaway Openlineage The Evolution of Data Observability Lineage Stories

    in Smart Pizza & AI Common Data Engineering Challenges From Monitoring to Observability Observability and Data Observability Start with Basic Slide link
  4. Smart Pizza Dashboard Anomaly • Daily Sales Dashboard suddenly showed

    a 99% drop in sales. • Analytics model had failed • Three hours of backtracking, discovered that the POS (point-of-sale) data import had broken—leaving sales data out of the dashboard entirely. • Took much longer to isolate the root cause. • No clear lineage showing data flow and dependencies • What is the data source? (Enables traceability) • What is the schema? (Defines structure) • Who owns the data? (Clarifies accountability) • How often is it updated? (Shows freshness) • Where does it originate? (Ensures transparency) • Who uses the data? (Tracks impact) • What has changed? (Supports auditing and trust) The need for Metadata
  5. ETL/ELT Data Engineering Challenges Inspired from “ ( Joe Reis

    and Matt Housley, “Chapter 1,” in Fundamentals of Data Engineering (O’Reilly Media, 2022))”
  6. Data Engineering Challenges How to manage pipelines effectively • Complex

    data pipelines • Inability to centrally view - Limited data asset discoverability - Error detection and root cause analysis - Scattered monitoring and time-consuming troubleshooting - Optimization and monitoring workloads at scale - Inefficient pipelines that negatively impact data quality and costs
  7. Many Flow-like Data in a Real World • Data movement

    as flow • Moving data content from A to B Team X Team Y Team Z Team C Team B Team A Simplistic Data Flow
  8. From Monitoring to Observability Why Traditional Monitoring is Not Enough?

    . Monitoring Observability WHERE the issue is WHY it happened • Measure and report specific metrics in a system. • Reactive – collect data to identify abnormal systems. • WHEN and WHAT did the system error occur? • Smart Pizza & AI Example: ◦ Checking only if the pizza oven is on and the thermometer is working. ◦ Alerts when 'Order Sync Task failed' or 'Order DB CPU > 90%'. • Collect metrics, events, logs, and traces across distributed systems. • Proactive – investigate root causes of abnormal systems. • WHY and HOW did the system error occur? • Smart Pizza & AI Example: ◦ A smart oven that not only tracks temperature but also analyzes pizza color, dough rise, and past baking data to diagnose why it burned. ◦ The accuracy dropped due to a schema change in an upstream API that nullified the pizza_type field.
  9. 5 Pillars of Data Observability 3 Pillars of Observability Tracing

    focuses on: • Performance metrics • Error tracking • System behavior • Service interactions Lineage focuses on: • Data transformation tracking • Compliance and governance • Data quality
  10. Three Key Focus Areas of Data Observability Pipeline Data •

    Focus ◦ Hardware & services running pipelines • Metrics ◦ CPU, memory, disk, network • What we want to know ◦ Is the ML cluster (training AI models) overloaded? Infrastructure • Focus ◦ Data transfer & processing flow • Metrics ◦ Task duration, success/failure rate, retries • What we want to know ◦ Did the daily ETL for orders finish on time • Focus ◦ Data content, structure & quality • Metrics ◦ Freshness, volume, distribution, schema, lineage • What we want to know ◦ Are order fields valid? Any anomaly in new order volume?
  11. What is Data Lineage The process of mapping and tracking

    how data flows throughout its lifecycle in an organization • Where a piece of data came from (origin/source) • How it has changed (transformations, calculations, integrations) • Where it is used or stored (destination or endpoint) • Every system and process it touches along the way
  12. OpenLineage • Open framework for data lineage collection and analysis

    • Integrations can be pushed to the underlying scheduler and/or data processing framework; no longer does one need to play catch up and ensure compatibility https://openlineage.io/docs
  13. Scope OpenLineage defines the metadata for running jobs and the

    corresponding events. A configurable backend allows the user to choose what protocol to send the events to. https://openlineage.io/
  14. Core Concept https://openlineage.io/docs/spec/object-model/ Run Dataset Facet Run Event Lifecycle START

    RUNNING ABORT FAIL COMPLETE runID eventType: START eventTime producer (input)datasets runID eventType: RUNNING eventTime producer specific facets runID eventType: COMPLETE eventTime producer (output)datasets https://openlineage.io/docs/spec/run-cycle
  15. Best Practices for implementing OpenLineage 1. Design Principles ◦ Start

    simple (Job-level then message-level) ◦ Prioritize coverage (basic lineage across all pipelines than detailed lineage ) ◦ Standardize naming (airflow) 2. Metadata Strategy ◦ Trace ID propagation ◦ Enrich context (business-layer information) ◦ Appropriate granularity (Row/message level, column/field level, schema level, and change management level ) 3. Operational Considerations ◦ Performance impact: eg. Avoid heavyweight processing in execute() methods ◦ Error handling: Ensure lineage tracking failures don't affect primary data tasks ◦ Retention policy: Define retention periods for different types of lineage data ◦ Access control: Lineage data may reveal sensitive architecture information 4. Data Governance Integration ◦ Connect to data catalogs: Integrate with OpenMetadata, etc. ◦ Support compliance: Use lineage information to meet GDPR, CCPA requirements ◦ Impact analysis: Assess downstream impacts before changes
  16. Data Observability Design Pattern Data Detectors Flow Interruption Detector Skew

    Detector Time Detectors Lag Detector SLA Misses Detector Lineage Trackers Dataset Tracker Fine-Grained Tracker Konieczny, Bartosz. Data Engineering Design Patterns: Recipes for Solving the Most Common Data Engineering Problems. O'Reilly Media, Inc., 2025.
  17. Smart Pizza & AI Takeaway • Smart Pizza & AI

    • 3 Observability Pillars • 5 Data Observability Pillars • Metadata & Data Lineage • Openlineage • Observability Design Patterns • Context-Aware & Intelligent Observability Let’s make data observable & AI trustworthy logs Traces Metrics