Twitter @sora_h Site Reliability Engineer at Cookpad Global Cookpad TechConf 2017 NOC Rubyist, Ruby committer Interests: Site Reliability, Networking, Distributed systems
Twitter @sora_h Site Reliability Engineer at Cookpad Global Cookpad TechConf 2017 NOC Rubyist, Ruby committer Interests: Site Reliability, Networking, Distributed systems
in the new region • Building better infra than existing one, based on our past experiences with AWS EC2 and VPC in Japan • e.g. JP: CentOS → Ubuntu, US: Ubuntu only • e.g. JP: weird subnetting US: private/public subnets 3/3
DNS returns IP of closer region from resolver • If a requested service lives in another region, reverse- proxy to the alternate region • Also, terminating TCP/TLS as possible as close from user is better on latency. (But serving only in 2 regions are not enough…) 4/7
Using consul + consul-template to apply the latest instance list to configurations • Recent AWS Autoscaling Group (ASG) allows suspending actions by API, so the global relies to ASG (JP uses original implementation) 6/7
running on CircleCI.com • (JP uses GitHub Enterprise) • Deploy: capistrano base • Deploy server to run capistrano in us-east-1 (Latency, poor office internet, … etc)
calendar • Muslims refrain from consuming food during ramadan while fasting from dawn until sunset • They enjoy cooking after sunset • This is the biggest occasion in MENA/Indonesia which expects higher traffic than usual https://en.wikipedia.org/wiki/Ramadan
a lot before Ramadan 2016 than 2015 • So we have to take extra care for expected traffic in 2016. We couldn’t think our infra and application could survive the Ramadan without taking any care. 1/2
DB migration: ɹRDS MySQL (standard EBS) → Amazon Aurora for MySQL • Capacity: Expanding the target of autoscaling • CDN: Switching to Fastly • App: Giving a lot of performance improvements 2/2
a lot than usual — Disks are getting full early and we had to review the log retention or implement S3 archival • Fixing slow queries were required in higher priority — impact of those became massive than usual
is many simple operation request… • We have to reduce simple “applications” or operations, by: • delegating permissions to dev • automation • Reduce SRE blockers to enable asynchronous work, because developers are living all the world 6/8
• Performance • Architecture • Developers’ Productivity • JP has a lot of useful, time to import those into global • Be good with developers (DevOps…!) 1/2