users Used for about 16.7 minutes*2 per day Per person (Consolidated for Japan and US, as of August 2019) No. 1 News App User Base*1 - An everyday habit of consumers makes the largest user base in Japan *1. Source: Nielsen Mobile NetView as of January 2021 (Calculation of SmartNews App's user base based on number of installs of SmartNews App) *In-house figures, average for January 2021
Simple DI function • Easy to learn, development can begin quickly FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. - FastAPI Makes prompt service launch possible!
if there is no target! Throughput How many requests can be handled in a short time? ^(Because it's so much fun!) Latency How much time does it take to deal with one request? 2-1. Determine performance targets
system λ … the average number of items arriving at the system per unit of time W … the average waiting time an item spends in a queuing system In terms of web application performance, this means... Little's theorem 2-1. Determine performance targets
the use of ten 5 core instances (or Pods) Since it should be okay to handle 1000/10 = 100rps per 5core instance, set the performance target per instance as: Throughput 100rps (or more) , Latency 50ms (or less) Note: It may not always be possible to use the CPU 100%, I/O or concurrency also need to be considered and the number of CPU cores that can be used in K8s pod does not correspond with the limit, so the calculation is actually quite complex, and this is just a rough estimate. Method of determining performance targets: Example 2-1. Determine performance targets
to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is. Rule 2: Measure Don't tune for speed until you've measured,. and even then don't unless one part of the code overwhelms the rest. - Rob Pike: Notes on Programming in C - [Wikipedia] Pike: Notes on Programming in C (Japanese) 2-2.Identify measurements and bottlenecks and there is no point is tuning what is not a bottleneck. Do not guess, measure. Bottlenecks cannot be identified without measuring,
3. Resolve bottlenecks Met performance targets? END 🎉 YES NO Add load and check the load situation of the whole system Check the application's load situation Recreate the same conditions locally as much as possible, and measure while adding load Correct the part that is most likely to be the bottleneck Is the bottleneck resolved? Has performance improved? Deploy in the load testing environment and measure Is the application the bottleneck? Prepare the environment for load testing 2-2. Identify measurements and bottlenecks Correct until the bottleneck becomes part of the application YES NO YES NO Tools for measurement APM fastapi_profiler py-spy Speed up the feedback loop
Writes scenarios with Python API • Simple and adequate settings • Can be run in clusters for large-scale loads Locust is an easy to use, scriptable and scalable performance testing tool. - LOCUST 2-2. Identify measurements and bottlenecks Add load: LOCUST
clusters of rps, median/95% tile latency Add load until latency worsens while increasing the number of simultaneous connections, and check the maximum throughput and 2-2. Identify measurements and bottlenecks
load is added • Q: Has the application become the bottleneck? • There is no point in tuning the application until the application turns into the bottleneck. • Middleware, such as databases, usually become the bottleneck first. • Conduct tuning until the application throughput stops increasing even though resources, such as upstream services and connecting middleware, have capacity available. • Today, I will only be talking about tuning application.🙇 Check the load situation of the whole system: Datadog So, the rest of the presentation is based on the assumption that the application is the bottleneck. 2-2. Identify measurements and bottlenecks
usage of CPU/memory, etc. Check the CPU usage of middleware and the changes in the number of connections Although detailed status can also be checked by attaching data to containers, this is very convenient for grasping the big picture 2-2. Identify measurements and bottlenecks
of launch code • Gets data from the real operating environment • Helps find bottlenecks as most of the necessary information is available • But, it is expensive 💸, and there is a need to find workarounds, such as applying only to specific instances... Application measurement (1): Datadog APM Datadog APM and Continuous Profiler provide detailed visibility into the application with standard performance dashboards for web services, queues, and databases to monitor requests, errors, and latency. — Datadog APM 2-2. Identify measurements and bottlenecks
check your service code performance. — fastapi_profiler • Integrate pyinstrument as a FastAPI middleware • Useful for measuring CPU time while modifying code locally • Cannot be used in the actual application or under load because the impact on performance is high 2-2. Identify measurements and bottlenecks
measuring CPU in a local/load environment • Also useful when GIL or multi-threaded processing could be the bottleneck • fastapi_profiler cannot measure threads that are not under Application measurement (3): py-spy It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way. — py-spy 2-2. Identify measurements and bottlenecks
as a process manager and allowing users to tell it which specific worker process class to use. Then Gunicorn would start one or more worker processes using that class. And Uvicorn has a Gunicorn-compatible worker class. — Server Workers - Gunicorn with Uvicorn <Process Manager> <ASGI Server> <ASGI Framework> <Web Framework> Launch the process and move forks and sockets to workers Request/response Asynchronous processing Request Response Func call callback Extends <Application> Worker Applications Business logic
Server> <ASGI Framework> <Web Framework> Request Response Func call callback Extends <Application> Worker there is no impact on performance because it only deals with process management It is not faster than Starlette, but... Fastest Python ASGI Items below Starlette did not pose a bottleneck this time Gunicorn supports working as a process manager and allowing users to tell it which specific worker process class to use. Then Gunicorn would start one or more worker processes using that class. And Uvicorn has a Gunicorn-compatible worker class. — Server Workers - Gunicorn with Uvicorn
not increase when CPU is not used up. • Check CPU us with top, vmstat, etc. ✤ Causes • Processing that requires network I/O • E.g.: access into DB or other middleware • File writing • etc., etc.
compatible libraries • E.g.: aioredis/httpx … 3-1. Slow I/O bound processing Countermeasure (1): Use acyncio asyncio is a library for writing codes for synchronous processing using the async/await syntax . — asyncio
set this as default for anything involving the network, so there was hardly any problems in I/O On FastAPI, simple set routing method to async Use await in async method
takes almost up tp 1000 times more time when compared to the memory reference • Cache retrieved results in application memory if result consistency tolerance is high 3-1. Slow I/O bound processing Countermeasure (2): Cache Latency Comparison Numbers (~2012) ---------------------------------- : Main memory reference 100 ns : Round trip within same datacenter 500,000 ns 500 us — Latency Numbers Every Programmer Should Know
high LA • Throttle in k8s environment ✤ Main causes • Simple heavy calculations such as floating point calculations • Inefficient algorithms • etc., etc. 3-2. Slow CPU bound processing Now, let's focus on application tuning!
Use prior calculation/cache Either calculate offline in advance, or bypass heavy calculations by caching calculation results urllib/parse.py takes up a lot of time Cache Url calculation results 3-2. Slow CPU bound processing
each request and have a high computational load • E.g.: calculating geohash for requested location information • Pre-compile to a low-level language such as C • Can be partially applied, such as per function Countermeasure (2): Cython 3-2. Slow CPU bound processing Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language — Cython
calculate geohash(maptile) from specified location information The caller that compiles by writing into .pyx is the same as usual Python Prepare setup.py like this Compiles during docker build
• Optimize the use of CPU cache by using numpy vector calculation for loop processing • E.g.: calculating the distance to each coupon from the requested location information. Countermeasure (3): numpy 3-2. Slow CPU bound processing Latency Comparison Numbers (~2012) ---------------------------------- L1 cache reference 0.5 ns L2 cache reference 7 ns 14x L1 cache : Main memory reference 100 ns 20x L2 cache, 200x L1 cache — Latency Numbers Every Programmer Should Know
vectors and use vector calculation for future requests We want to calculating online the location information of the request and its distance from each coupon
for threads other than FastAPI • Check with Datadog • GIL values are low in py-spy top ✤ Causes • A library called logru was writing logs in MultiThread, and that was taking a long time • Python can handle threads, but it is not efficient because of GIL
lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. — Global Interpreter Lock GIL / Global Interpreter Lock • Even if you set up threads, only one thread will be run at a time in the process • In Python, parallel processing is handled by Multi-Processing. Combining that with parallel processing with asyncio can solve the problem.
information are taking a long time Seems to be retrieving from Queue to deserialize If _enqueue is true (default), create a thread for SimpleQueue and log output multiprocessing.SimpleQueue is actually a pipe for communication between processes Data transfer to and from WriterThread seems to be slow
we usually use. • Processes that compile to a lower level language, processes with JIT compiler, etc... • I tried two and selected PyPy this time. • PyPy • cinder • If the CPU is the bottleneck, overall performance improvement can be expected. 3-4. Improving overall performance through changes in the processing system
3.8. It contains a number of performance optimizations, including bytecode inline caching, eager evaluation of coroutines, … — cinder • Performance-improved CPython from Facebook • We saw a performance improvement of about +10% in our environment. • It is highly compatible with CPython and most of its libraries • However, you need to build it yourself, and I gave up on it because the GitHub is not very active, and there was no documentation, so it was difficult to manage. 3-4. Improving overall performance through changes in the processing system
average, PyPy is 4.2 times faster(!) than CPython — PyPy • JIT compiler/Incminimark GC • It is also quite compatible with CPython. • There have been no compatibility-related problems so far. • The performance benefits were so great that we now use it in production. • We confirmed an improvement in performance by nearly 40% 3-4. Improving overall performance through changes in the processing system
faced in replacement and operation • The latest version is 3.7. • Some libraries are not available. • OOM death due to omission of GC Option specification • Memory leak when combined with FastAPI? 3-4. Improving overall performance through changes in the processing system
Problem: 3.7 is the latest version as of 2021 • 3.8 beta ✤ Solution • Since I was developing based on version 3.8, I downgraded some parts • Walrus operator := • position-specific argument def huga(hoge, /, …) • I have not faced any real problems. 3-4. Improving overall performance through changes in the processing system
Problem: PEP 517 Libraries that involve building external sources are almost always a problem. • E.g. • orjson … high-speed JSON serializer from rust 😭 • dd-trace-py … for measuring DataDogAPM 😭 • fastapi-profiler … the profiler that has been working wonders 😭 • black(caused by typed-ast) … Linter/Formatter, used through pysen 😭 ✤ Solution • Use separate processing systems for development and deployment environments 3-4. Improving overall performance through changes in the processing system
but not in PyPy, we implement e2e tests using production containers during CD For development: CPython3.7 We also implemented UT on CI For production: PyPy3.7 3-4. Improving overall performance through changes in the processing system Problems in PyPy (2): Some libraries cannot be used
OOMKiller. • GC for PyPy is incminimark • GCOption must be specified ✤ Solution • Output GC with PYPY_GC_DEBUG=2 • Restrict the heap limit with PYPY_GC_MAX 3-4. Improving overall performance through changes in the processing system Problems in PyPy (2): Some libraries cannot be used
GC Option specification Ensure MaxWorker x PYPY_GC_MAX At least specify PYPY_GC_MAX 3-4. Improving overall performance through changes in the processing system
✤ Problem: Memory increases to the maximum and dies with OOMKiller. • Even on Echo servers, memory increases monotonically under load • I had not set KeepAlive on nginx... ✤ Solution • Resolved by connecting HTTP 1.1 (Default KeepAlive) • Set timeout-keep-alive on uvicorn also • I have not dug deep into this, but this seems to be a connection problem, so it may be more due to uvicorn than FastAPI. 3-4. Improving overall performance through changes in the processing system