From Git Pulls to K8S - Capillary's journey to Zero Touch Deployments

From Git Pull’s to K8S Capillary’s Journey to Zero Touch
Deployments Anshul Sao & Piyush Goel SaaS @ Scale, July 2020

About Capillary Singapore China India Indonesia South Africa Malaysia UAE
450 Million Employees Worldwide 14 Oﬃces 30+ Countries 400+Brands Consumers on the platform Stores powered 35K 650 KSA Thailand • Leading SaaS Platform for Omni-Channel Customer Engagement, Commerce, Analytics for Retail • Product Portfolio - Loyalty+ - SmartStore+ - Anywhere Commerce+ - Insights+ - Engage+

About Capillary - Tech Stack & Scale • Fully Multitenant
Architecture • 100+ Microservices • 5 Global Deployments • 50 Master data shards • 100 nodes Spark Clusters • 1000+ servers in AWS regions. • 20 TB ETL runs daily • 450 million users and counting • Billions of Transactions processed Annually • Polyglot Data sources - MySQL, Mongo, Dynamo, HDFS, Parquet

The “Before Christ” Era

The “Before Christ” Era • Product ran on 4 servers
(3 app, 1 db) • PHP Monolith + 2 Java Apps • Code on prod vs Code on local box • Manual syncs and SVN Pull’s • Manual Verification and QA • Life was simple!

Initial Growth

Initial Growth • 10-15 servers (8-10 app, 4 dbs) ◦
PHP Monolith + 4 Java Services • Service Oriented Architecture. ◦ Static Discovery • Git with SVN flavour. • Git Pulls + RSyncs still rule! ◦ Tag based deployments ◦ Should have automated!

Enter the VC’s - Series A

Series A - 2012

Series A - 2012 • Launched second cluster - ap-southeast-1
• ~40 servers (30 app servers, 10 db’s) ◦ PHP Monolith + 10 Java services. ◦ Discovery via hard-coded ELB’s ◦ No Rolling deployments • Automation Round - 1 ◦ Python Fabric to the rescue! ◦ Tag instances for discovery ▪ PHP - Fabric scripts pull Git Tags on app servers and reloads. ▪ Java - Scripts pull JAR’s from the custom maven repos and reboots JVM • Rollbacks were painful... Argh!!

Growth Continues - 2013 Q1

Growth Continues - 2013 Q1 • Entered the Middle East,
Africa & Australia in Q1! • Launched another cluster in eu-west-1 • 100+ servers (~80 apps, 20 dbs) • Service Discovery ◦ ZooKeeper - Exhibitor ◦ Apache Curator! ◦ Deployment Order and Dependencies made easy!

Growth Continues - 2013 Q1 • Post Release Monitoring became
a problem. • Observability wasn’t cool, yet! ◦ Logs - 400GB per day ▪ Elasticsearch -- Too heavy to operate for a 2-member devops team. ▪ Log Streaming (Apache Flume) + Alerting Framework (Rule Engine) + MongoDb (Storage) ▪ Hive jobs for log processing and metric aggregation ▪ Splunk did exist! ◦ Metrics ▪ Custom implementation of a Time-Series store on MySQL ▪ Google Charts for visualisation. ▪ Graphite did exist! • Re-invented the wheel, unnecessarily!

Entered US - 2013 Q2

Entered US - 2013 Q2 • Yet another cluster -
us-west-2 • What are we looking at? ◦ 150 servers (~120 apps, 30 db’s) ◦ PHP Monolith + 15 Java Services. ▪ Too many tags! ▪ Different Versions. ◦ Databases ▪ 100 schemas - 1600 tables (25 schemas, 400+ tables, 4 regions) ▪ Inconsistent schemas. ◦ Too many releases to babysit ▪ Devops team going crazy!

2013 Q3 : Automation Round - 2 • Deployments Automation
◦ Took Inspiration from Yahoo! Days - YPM / Igor for the win! ▪ Move to self-contained bundles - Debian packages. ▪ Automated Release Distributions via Jenkins- Testing, Staging, Production ▪ Templatize Server States. ▪ Easy to deploy & rollback (upto 3 versions). ▪ Pre-install and Post-install steps allow seamless deployments & restarts. ◦ Deployment times reduced by 75%. ◦ Did someone say Containers? ▪ Meh.. Too early for us!

2013 Q3 : Automation Round - 2 • Databases need
Deployments - duh!! ◦ Need version control for DDL’s. ◦ Enter - DBDeploy ◦ Customized Wrapper on top. ▪ Reduced inconsistencies significantly. ▪ Devops & DBA’s were happy! • Only devs to blame now. • Monitoring & Logs ◦ Home grown tools still holding strong! • No more problems - yay!!

2014 Q2 - New Products & Verticals! • Hypermarkets -
Data explosion • Keys Tables go beyond 500M records each. • Core Entities are transactional by nature - MySQL is the king! • Sharded the DB and the Services Layers • Home grown implementation. • Vitess wasn’t widely popular, yet! • Multiple copies of the schema in the same cluster • DBDeploy is still going strong - Maintains state on the db instance.

Lessons Learned So Far! • Automate Deployment Workflows Early regardless
of company stage • Homegrown tools can lead to Confirmation Bias! • Deployment troubles grow exponentially as you add more clusters, and microservices. • Schema Management should be a part of Deployment Workflows!

Growth Continues - 2015

Growth Continues - 2015 • 4 clusters • 500+ servers
• 325 apps + 75 db’s • 100 devs + 30 QA’s • PHP Monolith + 25 Java Services in each cluster. • Package Based Deployments • Home Grown Tools for post-deployment monitoring & alerts. • 15 master data shards across clusters. • DbDeploy for Schema management. • PHP Servers were not scaling, Unpredictable loads, underutilization of infra resources. DevOps ticket based scale up and down.

Gitflow, CI and Rundeck

No Late night releases! • Release Management was pain ‘again’.
• Branching!! Too many branches to manage codes to be merged • No of microservices exploded • Which commits to be merged? Cherry picking and manual merges • Stay back to release, keep it safe! • Gitflow branching model adopted. • Jenkins to build and push artifacts in debian repos. Promotion of packages for full QA control. • Rundeck and rolling releases! Takes care of taking server offline, release and move!

Growth Continues - 2016

Docker and Kubernetes

No prior notice for End of Season Sale • Managing
multiple clusters, serving different geographies • Server estimation upscaling, downscaling was a manual task with Devs and Devops fulfilling the requests by Tickets. • Lot of wastage, as non peak hours also the cluster size was constant. • Increased downtimes in case of performance bugs. • Debian to Docker • Gitflow along with docker images (Same image is still not promoted from environment to environment. Limitation!!) • ECR for repository. • Migrated all config files to env variables. • Created Capillary custom CI.

Deployer (Capillary CI)

Jenkins is good. We are too opinionated. • How to
manage Kubernetes deployments? • Should developers have kubectl access? • Every deployment meant different env vars. Make files are so old school and unmanageable • Build selection and updating in yaml is a pain. • Created a Helm based build & deployment system • Build selection is UI driven and with SSO and access control • HPA configs can be easily defined with CPU thresholds in UI. • Deployment specific environment variables management in UI

YAML Fatigue is Real!

Replacing DBDeploy

So Long! • Mutations are tracked and version controlled, but
status in each cluster can vary. • Developer can write a bad query or a bad undo query wrecking havoc. • Versions can grow really fast and it can be overwhelming to comprehend the final state. • How to avoid alters on big tables? • Track Final version rather than mutations • Limit permissible operations, no drops. • Schema diff to find transformation to get to final state. More predictable! • Types of data ◦ schema ◦ database_view ◦ seed_data

State vs Transition Management Manage Transitions Manage States Used in
current db migrate To be used in Capillary Cloud Each change is version States are versioned CREATE TABLE tbl1( col1 int PRIMARY KEY ); CREATE TABLE tbl1( col1 int PRIMARY KEY); ALTER TABLE tbl1 ADD COLUMN col2 int; CREATE TABLE tbl1( col1 int PRIMARY KEY, col2 int );

Capillary Cloud - Idea

The Uber Orchestrator Single reusable definition of the stack, which
can be easily launched and managed Opinionated JSON Files to define everything in stack. Applications Service discovery is redundant Kubernetes Namespaces Manage Application access to DB TF providers to fulfill application requests with access control. How to avoid messy nginx configurations and Domain SSL management Application Definitions to contain ingress rules. TF Providers to manage domains. How to reference dynamic cloud objects without hardcoding or maintaining region specific configs DSL to refer any entity in stack. Monitoring, APM and Alerts Reusable hardened TF Modules to achieve high observability. Prometheus! What about Schema Management A new kind of Schema sync which compares end states and does schema diff to get mutations

A Typical deployment

Automated Rollbacks Pipeline for automated validation of builds

Capillary 2020

Learnings • Late adoption of containers because of complex infra
and priority constraints. ◦ Make room for deployment debts along with business growth and priorities • We tried Deis for deployments which had limited community and support, wasted months of effort. ◦ Always choose external dependencies with good community support. • Trying to make everything Generic consumes a lot of time with limited immediate benefits ◦ It’s ok to be opinionated, to move fast and to align with internal company development processes. • Manual Interventions/ steps will cause problems in long run ◦ Automate everything

• Anshul Sao ([email protected]) • Piyush Goel ([email protected], @pigol1) Co-ordinates

From Git Pulls to K8S - Capillary's journey to ...

From Git Pulls to K8S - Capillary's journey to Zero Touch Deployments

More Decks by pigol

Other Decks in Programming

Featured

Transcript