Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Easy Batch

Easy Batch

The simple, stupid batch processing framework for Java

Mahmoud Ben Hassine

February 08, 2020
Tweet

More Decks by Mahmoud Ben Hassine

Other Decks in Programming

Transcript

  1. Easy Batch The simple, stupid batch processing framework for Java

    Mahmoud Ben Hassine https://benas.github.io @b_e_n_a_s
  2. 2 Agenda • Introduction • State of the art •

    Motivations • Easy Batch • Overview • Basic usage • Advanced topics • Wrap-up
  3. 3 Agenda • Introduction • State of the art •

    Motivations • Easy Batch • Overview • Basic usage • Advanced topics • Wrap-up
  4. Batch vs Stream processing Batch processing Stream processing Bounded data

    set Unbounded data stream High latency Low latency Static data set Dynamic data stream
  5. Batch processing • Long running jobs • No human interaction

    • No fancy GUIs • OutOfMemory errors! 5
  6. State of the art* 6 JSR-352 Excellent solutions! But ..

    *: Big data tools like Spark, Flink, etc are out of scope
  7. What’s wrong with
 Spring Batch / JSR 352? 7 “I

    have to admit I got a little overwhelmed by the complexity and amount of configuration needed for even a simple example”, Jeff Zapotoczny “What should we think of the Spring Batch solution? Complex. Obviously, it looks more complicated than the simple approaches. This is typical of a framework: the learning curve is steeper”, Arnaud Cogoluègnes “Recently evaluated Spring Batch, and quickly rejected it once I realized that it added nothing to my project aside from bloat and overhead”, RT. Person Complex configuration + Steep learning curve
  8. What’s wrong with
 Spring Batch / JSR 352? 8 “The

    context of a Spring Batch application grows pretty quick and involves configuring a lot of stuff that, at the outset, it just doesn't seem like you should need to configure. A "job repository" to track the status and history of job executions, which itself requires a data source - just to get started? Wow, that's a bit heavy handed.”, Jeff Zapotoczny “On voit que l’on a besoin d’un transaction manager. Cette propriété est obligatoire, ce qui est à mon sens dommage pour les cas simples comme le nôtre où nous n’utilisons pas les transactions.”, Julien Jakubowski Mandatory components that you might not need “Spring Batch or How Not to Design an API.. Why do I Need a Transaction Manager? Why do I Need a Job Repository?”, William Shields
  9. Agenda • Introduction • State of the art • Motivations

    • Easy Batch • Overview • Basic usage • Advanced topics • Wrap-up
  10. Motivations (1/2) • Keep it simple, stupid • Flexible and

    extensible API • Modular architecture • Reduce boilerplate code 10 Build yet another: - big data - cloud-native - map-reduce - fault-tolerant - ultra high-performance - massively parallel - distributed - reactive - real-time - resilient - [put buzzword here]
 processing framework. No, this is not the goal.. Goals Non Goals
  11. Motivations (2/2) #id,name,description,price,published,lastUpdate 0001,product1,description1,2500,true,2014-01-01 000x,product2,description2,2400,true,2014-01-01 0003,,description3,2300,true,2014-01-01 0004,product4,description4,-2200,true,2014-01-01 0005,product5,description5,2100,true,2024-01-01 0006,product6,description6,2000,true,2014-01-01,Blah! import

    java.util.Date; public class Product { private long id; private String name; private String description; private double price; private boolean published; private Date lastUpdate; // getters, setters omitted } products.csv Common requirements: - Read file line by line - Filter header record - Parse and map data to the Product bean - Validate product data - Do something with the product (business logic) - Log errors - Report statistics The goal is to keep focus on business logic! Boilerplate Product.java 11
  12. Agenda • Introduction • State of the art • Motivations

    • Easy Batch • Overview • Basic usage • Advanced topics • Wrap-up
  13. Easy Batch in a nutshell • Name: Easy Batch •

    Date of birth: 13/08/2012 • Weight: 108 Kb (v6) • DNA: https://github.com/j-easy/easy-batch 13
  14. public interface Record<P> { /** * Header of the record.

    */ Header getHeader(); /** * Payload of the record. */ P getPayload(); } The Record abstraction (2/2) 16 Header (No, Source, etc) Payload (Raw Data) Record Multiple implementations: FlatFileRecord, XmlRecord, JsonRecord, JdbcRecord, JmsRecord, etc.. Record.java
  15. The Batch abstraction 17 { record 1, record 2, ...

    record n } Batch public class Batch implements Iterable<Record> { private List<Record> records; } Batch.java
  16. The Job abstraction 18 public interface Job extends Callable<JobReport> {

    String getName(); } class BatchJob implements Job { } • Synchronous execution JobReport report = jobExecutor.execute(job); • Asynchronous execution Future<JobReport> report = jobExecutor.submit(job); • Parallel execution jobExecutor
 .submitAll(job1, job2); • Scheduled execution scheduledExecutorService
 .schedule(job, 2, MINUTES);
  17. Batch Jobs 19 • Read records in sequence • Process

    records in pipeline • Write records in batches
  18. Validating data • Validate data against application’s constraints • Declarative

    approach: Bean Validation API (JSR303) public class Tweet { private int id; @NotNull private String user; @Size(min=0, max=280) private String message; } 27
  19. Writing data 29 • Hide low-level APIs • Write records

    in batches • Transaction management for relational databases
  20. Agenda • Introduction • State of the art • Motivations

    • Easy Batch • Overview • Basic usage • Advanced topics • Wrap-up
  21. Agenda • Introduction • State of the art • Motivations

    • Easy Batch • Overview • Basic usage • Advanced topics • Wrap-up
  22. Parallel processing 34 • Jobs are Callable objects => jobExecutor.submitAll(job1,

    job2) • ReportMerger API to merge partial reports • Suitable for physical/logical partitioning
  23. Fault tolerance 35 • Retry feature • Retryable record reader/processor/writer

    • Custom RetryPolicy + RetryTemplate if needed • Skip feature • Batch scanning in case of write error • Skip bad records instead of failing the whole job
  24. Agenda • Introduction • State of the art • Motivations

    • Easy Batch • Overview • Basic usage • Advanced usage • Wrap-up
  25. Wrap-up • Lightweight, free and open source • Easy to

    learn, configure and use • Flexible & extensible API • Modular architecture • Fault tolerance features • Declarative data validation • Real-time monitoring 38 • No step concept with flows • No remote partitioning • No remote chunking • Not suitable for big data The not so good ones The good ones
  26. FAQs 39 • How does Easy Batch compare to Spring

    Batch? • Why does Easy Batch not persist job state in a database like Spring Batch? • Why does Easy Batch not provide a Step concept like Spring Batch?
  27. Who is using Easy Batch? 41 “I use this framework

    in production (and love it)” chsFleury / @github “Try EasyBatch. The simple stupid Batch framework. Try it once and use it forever.” Eddy Bayonne / @stackoverflow “Loving it so far. Making something I'm working on very simple” zackehh_ / @twitter “Thanks @easy_batch. You guys rock - especially your use of fluent interfaces in your APIs :-) #cleancode” NorthConcepts / @twitter “we have successfully used @easy_batch in production at Leroy Merlin and we love it” benensi / @twitter Community feedback Trusted by