Current 2023: 3 Mistakes We Made So You Won't Have To

3 Flink Mistakes We Made So You Wonʼt Have To
Robert Metzger, Staff Engineer @ Decodable Apache Flink Committer and PMC Chair Sharon Xie, Founding Engineer @ Decodable

What weʼll be talking about today #1 Data Loss with
Flink Exactly-Once Delivery to Kafka #2 Inefficient Memory Configuration #3 Inefficient Checkpointing Config

#1 Data Loss with Flink Exactly-Once Delivery to Kafka

Two Phase Commit for EO  Happy Path

Two Phase Commit for EO  Phase 1 Failure

Two Phase Commit for EO  Phase 2 Failure

Life is doomed when… Phase 2 can’t be successful 💣🔥

Important Kafka Broker Configurations transaction.max.timeout.ms • Default: 900000 (15 minutes)
transactional.id.expiration.ms • Default: 604800000 (7 days)

Timeout Causes Data Loss

• Flink Kafka Producer creates a new transaction id for
each checkpoint per task • transactional.id.expiration.ms = 604800000 (7 days) Excessive Memory Usage

• transaction.max.timeout.ms = 604800000 (7 days) ◦ From default: 15min
• transactional.id.expiration.ms = 3600000 (1 hour) ◦ From default: 7 days Better Kafka Transaction Configuration

When a checkpoint/savepoint to restore is over 1 hour (the
new transactional.id.expiration.ms) old org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id. InvalidPidMappingException

Short-term: Ignore InvalidPidMappingException 😇 • ONLY when transaction.timeout.ms (Kafka client
configuration in Flink) > transactional.id.expiration.ms Long-term: 🤝 • KIP-939: Support Participation in 2PC • FLIP-319: Integrate with Kafka's Support for Proper 2PC Participation Fix InvalidPidMappingException

Flink Exactly-Once Delivery to Kafka ✅ #2 Inefficient Memory Configuration #3 Inefficient Checkpointing Config

#2 Inefficient Memory Configuration

How to Tune TaskManager Memory • Flink automatically computes memory
budgets Just provide total process size. • Main memory consumers ◦ Framework + Task heap ◦ RocksDB State backend (off-heap) ◦ Network stack (off-heap) ◦ JVM internal structures [metaspace, thread stacks] (off-heap)

How to Tune TaskManager Memory • Example: taskmanager.memory.process.size: 8gb JVM
internal structures [metaspace, thread stacks] (off-heap) Framework + Task heap RocksDB State backend (off-heap) Network stack (off-heap)

How to Tune TaskManager Memory • Letʼs tune for this
particular job 150mb 700mb 2300mb = 3150mb unused memory

How to Tune TaskManager Memory • Give as much memory
as possible to Managed Memory = RocksDB taskmanager.memory.task.heap.size: 1 gb taskmanager.memory.managed.size: 5800 mb taskmanager.memory.network.min: 32 mb taskmanager.memory.network.max: 32 mb taskmanager.memory.jvm-metaspace.size: 120 mb

• Stateful workloads with RocksDB benefit most from as much
memory as possible → Check out the full documentation: https://nightlies.apache.org/flink/flink-doc s-master/docs/deployment/memory/mem_ setup/ Memory Configuration Wrap Up

Flink Exactly-Once Delivery to Kafka ✅ #2 Inefficient Memory Configuration ✅ #3 Inefficient Checkpointing Config

execution.checkpointing.interval: 10s execution.checkpointing.min-pause: 10s Make sure your job is not
spending all the time checkpointing Image source: https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/large_state_tuning/#tuning-checkpointing #3 Reliable, Fast Checkpointing

state.backend: rocksdb state.backend.incremental: true Only upload the diff to the
last checkpoint #33 full #34 incremental #35 incremental Reliable, Fast Checkpointing

state.backend.local-recovery: true Local recovery: Only re-download the state on failed
machines After a failure without local recovery: All TaskManagers download the state TM1 TM2 TM3 TM4 1  TM4 fails TM1 TM2 TM3 TM4 2  Recovery With local recovery: Most machines use local disks, only one needs to download TM1 TM2 TM3 TM4 1  TM4 fails TM1 TM2 TM3 TM4 2  Recovery Reliable, Fast Checkpointing

Fast Checkpointing and State Put your RocksDB state on the
fastest available disk. Typically a local SSD. TaskManager Your Flink Worker Remote EBS Volume Your Flink Worker TaskManager Local SSD

The End – Q&A Robert Metzger, Staff Engineer @ Decodable
Apache Flink Committer and PMC Chair Sharon Xie, Founding Engineer @ Decodable Get your free decodable.co account today if you want us to handle the issues discussed in the talk. Visit the Decodable Booth 201) for any Flink related questions.

Fast Checkpointing and State • RocksDB stores your state on
the /tmp directory • On AWS Kubernetes, thatʼs by default an EBS volume Type Size IOPS (max) Throughput Price per Month io1 950 GB 64000 $4278 io2 block express 950 GB 256000 $9769 gp3 950 GB 16000 1000 mb/s $176 M6gd.4xlarge 64g | 16c 950 GB Read: 93000 Write: 222000 $78 per instance for a local NVMe SSD → Using an instance type with a local SSD gives you by far the best performance per $ We just mount the entire Docker working directory on the local SSD.

• Flink EO with Kafka can still cause data loss
• Transaction timeout is the key • Flink EO implementation can consume excessive memory from Kafka • A better approach with Flink + Kafka is under way Recap

Current 2023: 3 Mistakes We Made So You Won't H...

Current 2023: 3 Mistakes We Made So You Won't Have To

Sharon Xie

More Decks by Sharon Xie

Featured

Transcript

3 Flink Mistakes We Made So You Wonʼt Have To

What weʼll be talking about today #1 Data Loss with

#1 Data Loss with Flink Exactly-Once Delivery to Kafka

Two Phase Commit for EO  Happy Path

Two Phase Commit for EO  Phase 1 Failure

Two Phase Commit for EO  Phase 2 Failure

Life is doomed when… Phase 2 can’t be successful 💣🔥

Important Kafka Broker Configurations transaction.max.timeout.ms • Default: 900000 (15 minutes)

Timeout Causes Data Loss

• Flink Kafka Producer creates a new transaction id for

• transaction.max.timeout.ms = 604800000 (7 days) ◦ From default: 15min

When a checkpoint/savepoint to restore is over 1 hour (the

Short-term: Ignore InvalidPidMappingException 😇 • ONLY when transaction.timeout.ms (Kafka client

What weʼll be talking about today #1 Data Loss with

#2 Inefficient Memory Configuration

How to Tune TaskManager Memory • Flink automatically computes memory

How to Tune TaskManager Memory • Example: taskmanager.memory.process.size: 8gb JVM

How to Tune TaskManager Memory • Letʼs tune for this

How to Tune TaskManager Memory • Give as much memory

• Stateful workloads with RocksDB benefit most from as much

What weʼll be talking about today #1 Data Loss with

execution.checkpointing.interval: 10s execution.checkpointing.min-pause: 10s Make sure your job is not

state.backend: rocksdb state.backend.incremental: true Only upload the diff to the

state.backend.local-recovery: true Local recovery: Only re-download the state on failed

Fast Checkpointing and State Put your RocksDB state on the

The End – Q&A Robert Metzger, Staff Engineer @ Decodable

Fast Checkpointing and State • RocksDB stores your state on

• Flink EO with Kafka can still cause data loss