OF TUNING TO GET A HADOOP CLUSTER RUNNING WELL. There are large companies that make money soley by configuring and supporting hadoop clusters for enterprise customers.
version of a cloudera python script to launch and bootstrap a cluster. We ran on spot instances Cluster boot up time SUCKED and often failed. We paid for instances during bootstrap and configuration Our jobs weren't designed to tolerate things like spot instances going away in the middle of a job. Drinking heavily dulled the pain a little.
(only took nine months!) Data pipelines were broken out into a handful of fault- tolerant jobflow steps; each steps writes output to S3. EMR supported spot instances at this point.
into stages; write out to S3 after each stage. Bake in (a lot) of variability into your expected jobflow run times. Compress the data your are reading and writing from S3 as much as possible. Drinking helps.