Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Old Dogs, New Tricks – A pragmatic guide for mo...

Sharon Xie
May 10, 2024
9

Old Dogs, New Tricks – A pragmatic guide for modern data movement platforms

This slide deck is for RTA Summit 2024 (https://www.rtasummit.com/agenda/sessions/569719)

Abstract:
Data platforms have evolved in amazing ways, tackling increasingly specific and difficult problems. Central to these advancements is data movement - the crucial pathway that gets data to the right place, in the right format. Over the years, numerous methodologies have emerged, including ETL, ELT, event streaming, stream processing, and Change Data Capture (CDC). This talk will take you on a journey through the evolving landscape of data movement technologies and tools, highlighting the trade-offs inherent in each method. We conclude with a pragmatic guide to architecting a unified data movement platform, drawing upon the accumulated knowledge and best practices to date.

Sharon Xie

May 10, 2024
Tweet

Transcript

  1. Old Dogs, New Tricks A pragmatic guide for data movement

    Sharon Xie, Founding Engineer, Decodable
  2. Agenda • Where is Data Movement? • Data Movement Patterns

    • A Unified Approach to Data Movement
  3. Data Movement Use Cases • Online ◦ Caches and search

    index ◦ User-facing analytics ◦ Monitoring and alerting • Offline ◦ Data analytics ◦ Business intelligence ◦ ML model training
  4. Old Dogs • Batch ETL (Extract, Transform, Load) • ELT

    (Extract, Load, Transform) • Point to point
  5. Batch ETL • ✅ Well known pattern • ✅ Robust

    systems • ❌ Online use cases
  6. ELT • ✅ Simple to use • ❌ Online use

    cases • ❌ When data must be transformed before storing ◦ Eg: security and compliance
  7. Point-to-point • ✅ Specialized for use cases • 🟡 Lack

    of abstraction and data inconsistency issues Learn More
  8. Change Data Capture (CDC) • ✅ Enable real-time data movement

    ◦ ✅ Online use cases • 🟡 Must integrate with other technologies
  9. Event Streaming • ✅ Online use cases • ✅ Source

    once, consume multiple times • 🟡 Processing is limited ◦ Additional infrastructure for complex transformations
  10. Technology - Apache Flink • Highly Scalable • Exactly-once processing

    semantics • Layered APIs: Streaming SQL (easy to use) ↔ DataStream (expressive)
  11. Stream Processing • ✅ Online use cases • ✅ Support

    complex transformations • 🟡 Hard to operationalize
  12. • Data stack is heterogeneous • Many patterns for data

    movement with different trade-offs • Newer patterns focus on online use cases Conclusion
  13. As an engineer • In which systems does the data

    I need live? Where does it need to go? • How does that data need to be queried? • What are the latency characteristics? • Can this data be updated or is it immutable? • Do I need to do any transformation before it hits the target system? • What is the schema and format of the source and destination? • What kind of guarantees are required on this data? • How should failures be handled?
  14. A Unified Data movement platform Should: • Abstract away the

    technologies • Automatically choose the most appropriate technologies • Support full range of simple to complex use cases
  15. ETL > ELT • E, T, L all interface with

    streaming data • Ability to transform data when needed • Zero additional cost or latency when it’s not
  16. Unified UX - Abstraction • Source Connectors ◦ Turn all

    data into streaming data • Stream Processing (Optional) ◦ Continuously process streaming data • Sink Connectors ◦ Consume streaming data and put them in the destination systems
  17. Unified UX Declarative YAML for • ETL or ELT •

    SQL or language-specific processing • Online or offline use cases
  18. Summary • Data movement is a stubborn problem • A

    Unified Data Movement should be the equivalent of Kubernetes for data movement