Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyCon APAC 2021 - Designing Functional Data Pip...

PyCon APAC 2021 - Designing Functional Data Pipelines for Reproducibility and Maintainability

Designing data pipelines at scale is often a challenge, as testing and debugging across compute units are often complex due to dependencies at runtime. In this talk, I explore the use of functional programming in Python to design data pipelines that are reproducible and maintainable at scale.

Ong Chin Hwee

November 20, 2021
Tweet

More Decks by Ong Chin Hwee

Other Decks in Programming

Transcript

  1. By: Chin Hwee Ong 20 - 21 November 2021 Designing

    Functional Data Pipelines for Reproducibility and Maintainability @ongchinhwee
  2. About me Ong Chin Hwee 王敬惠 • Data Engineer @

    DT One • Aerospace Engineering + Computational Modelling • Speaker and (occasional) writer on data processing @ongchinhwee Slides link: bit.ly/pa2021-design-fp-data
  3. Designing a Data Pipeline at Scale • Reliable ◦ Data

    pipeline must produce the desired output → Reproducibility • Scalable ◦ Data pipeline must run independently across multiple nodes → Parallelism • Extensible ◦ Able to extend data pipeline with changing business logic → Maintainability @ongchinhwee
  4. Challenges in Designing Data Pipelines at Scale • Reproducibility during

    Testing ◦ Dependencies in data pipeline design ▪ Data source ▪ Computation logic @ongchinhwee
  5. Challenges in Designing Data Pipelines at Scale • Reproducibility during

    Testing ◦ Challenge: Given the same data source, how do we ensure that we replicate the same result every time we re-run the same process? @ongchinhwee
  6. Challenges in Designing Data Pipelines at Scale @ongchinhwee • Reproducibility

    in Production ◦ Debugging parallel/concurrent code at runtime due to shared states ▪ E.g. What is the current state of the data source?
  7. Challenges in Designing Data Pipelines at Scale @ongchinhwee • Reproducibility

    in Production ◦ Challenge: How do we design data pipelines that run the same computation logic across multiple nodes and reproduce predictable results every time?
  8. Challenges in Designing Data Pipelines at Scale • Maintainability during

    Debugging ◦ “Works in testing, breaks in production” 😔 ▪ Edge cases and inefficiencies not detected in test cases causing performance issues and/or failures in production ▪ Complexities in debugging and logging for parallelism @ongchinhwee
  9. Challenges in Designing Data Pipelines at Scale • Maintainability during

    Debugging ◦ Challenge: How do we design data pipelines that are readable and maintainable at its core to reduce inefficiencies in production debugging at scale? @ongchinhwee
  10. Challenges in Designing Data Pipelines at Scale • Maintainability when

    Adding New Features ◦ Adding new features to an evolving (growing) codebase ▪ Code reasoning becomes more challenging with increasing code complexity ▪ Risk of introducing unintended behaviour due to dependencies @ongchinhwee
  11. Challenges in Designing Data Pipelines at Scale • Maintainability when

    Adding New Features ◦ Challenge: How do we design data pipelines that adapts well to changing business and technical requirements and ensures developer productivity? @ongchinhwee
  12. What is Functional Programming? • Declarative style of programming that

    emphasizes writing software using only: ◦ Pure functions; and ◦ Immutable values. @ongchinhwee
  13. 3 Key Principles of Functional Programming • Pure functions and

    avoid side effects • Immutability • Referential transparency @ongchinhwee
  14. The concept of a “pure function” • Pure function ◦

    Output depends only on its input parameters and its internal algorithm ◦ No side effects ⇒ same function f, same input parameter x → same result y regardless of number of invocations @ongchinhwee
  15. Pure Function: Making Pizza 160°C, 10 mins P U T

    T H E M T O G ET H ER @ongchinhwee
  16. “Impure” Function: Making Pizza with Side Effects 160°C, 10 mins

    P U T T H E M T O G ET H ER @ongchinhwee Side Effect: Radiation Heat
  17. “Impure” Function: Making Pizza with Side Effects 180°C, 10 mins

    P U T T H E M T O G ET H ER @ongchinhwee Side Effect: Oven Overheat, Burnt Pizza! 😖
  18. What is a side effect? • A function with side

    effects changes state outside the local function scope ◦ Examples: ▪ modifying a variable or data structure in place ▪ modifying a global state ▪ performing any I/O operation ▪ throwing an exception with an error @ongchinhwee
  19. The concept of Immutability • Immutability of an assigned variable

    ◦ Once a value is assigned to a variable, the state of the variable cannot be changed. ⇒ Disciplined state management ⇒ Prevents side effect resulting from state change → “pure function” @ongchinhwee
  20. The concept of Immutability: Key Implication • Key implication: Ease

    of writing parallel/concurrent programs @ongchinhwee
  21. The concept of Referential Transparency A function is referentially transparent

    when an expression can be substituted by its equivalent algorithm without affecting the program logic for all programs @ongchinhwee
  22. Conditions for Referential Transparency • Pure function • Deterministic ◦

    Expression always returns the same output given the same input @ongchinhwee
  23. Conditions for Referential Transparency • Pure function • Deterministic ◦

    Expression returns the same output given the same input • Idempotent ◦ Expression can be applied multiple times without changing the result beyond its initial application @ongchinhwee
  24. Equational Reasoning • A key consequent of referential transparency ◦

    Expression can be replaced with its equivalent result @ongchinhwee
  25. Functions are Values • In Python, functions are first-class objects.

    • A function can be: ◦ assigned to a variable ◦ passed as a parameter to other functions ◦ returned as a value from other functions @ongchinhwee
  26. Higher-order Functions • Key consequent of first-class functions • A

    higher-order function has at least one of these properties: ◦ Accepts functions as parameters ◦ Returns a function as a value @ongchinhwee
  27. Anonymous Functions • Also known as “lambda expressions” in Python

    • Using function as input without defining named function object @ongchinhwee
  28. Recursion as a form of “functional iteration” • Recursion is

    a form of self-referential function composition ◦ Takes the results of itself as inputs into another instance of itself ◦ To prevent infinite recursive loop, base case required as terminating condition @ongchinhwee
  29. Recursion as a form of “functional iteration” • Tail-call optimization

    ◦ Objective: reduce stack frame consumption in call stack ◦ Tail call: does nothing other than returning the value of function call ◦ Identify tail calls and compile them to iterative loops @ongchinhwee
  30. Built-in Higher-order Functions • map/filter vs list comprehensions ◦ List

    comprehensions are syntactic sugar for map/filter operations in a data collection (list) @ongchinhwee
  31. Built-in Higher-order Functions • Benefits of using map/filter in data

    transformations ◦ Keeping data and transformation logic separate ▪ Improved code reusability with better transparency of transformation logic @ongchinhwee
  32. Extending map/filter to parallel/concurrent programming • Using multiprocessing.Pool or concurrent.futures

    ◦ Generate iterator using map, then filter results to a collection (list) @ongchinhwee More details on parallel processing and concurrent.futures: My EuroPython 2020 talk "Speed Up Your Data Processing"
  33. Immutable Data Structures • Once an immutable data structure is

    created, it cannot be changed • Benefits: ◦ Easier to reason - “what you see is what you get” ◦ Easier to test - worry about the logic, not the state ◦ Thread-safe - easier for parallelism @ongchinhwee
  34. Structural Pattern Matching (PEP 634) • Python 3.10 feature inspired

    by similar syntax with Scala • Especially useful for conditional matching of data structure patterns match Item: case Something: do_something() @ongchinhwee
  35. Structural Pattern Matching (PEP 634) • Pattern matching for maintainability

    of data schema @ongchinhwee Note: Example based on case classes and pattern matching syntax in Scala Dataclasses used as the Python equivalent of Scala case classes
  36. Recursions in Python • Tail-call optimization not supported in Python

    ◦ Optimization has to be implemented manually • Recursion limit of 1000 (by default) as a prevention mechanism against call stack overflow in CPython implementation @ongchinhwee
  37. Type Systems • Python has support for type hints (though

    not enforced in runtime) @ongchinhwee
  38. Type Systems • Type checking with mypy • Preventing bugs

    at runtime by ensuring type safety and consistency across the data pipeline @ongchinhwee
  39. Short Answer: Not really. Can we write a purely functional

    data pipeline in Python? @ongchinhwee
  40. “Functional Core, Imperative Shell” • I/O operations still needed for

    reading and writing data outside of the application domain • Keeping core domain logic and infrastructure code separate Ref: Gary Bernhardt's PyCon 2013 talk on "Boundaries" @ongchinhwee
  41. Key Takeaways • Adopt functional design patterns when designing data

    pipelines at scale (parallel and distributed workflows) ◦ Reproducible ◦ Scalable ◦ Maintainable • “Functional Core, Imperative Shell” to manage side effects separately from data pipeline logic @ongchinhwee
  42. Reach out to me! : ongchinhwee : @ongchinhwee : hweecat

    : https://ongchinhwee.me And check out my ongoing series on Functional Programming at: https://ongchinhwee.me/tag/functional -programming @ongchinhwee