Communication Planning Div. Open Source Program Group (Toyota OSPO) TOYOTA MOTOR CORPORATION Works R&D for Connected Vehicle Infrastructure E2E Observability, Standardization,… Toyota OSPO - Co-Lead Keywords Operating System, Cloud Infrastructure, etc. https://github.com/thatsdone https://www.linkedin.com/in/masanori-itoh-6401603/ 2
– A High Level R&D POC A POC System of Connected Vehicles Service Data Processing gNB#1 W A N Location #1 (on-premise) Location #2 (on-premise) Public Cloud #1 CAN Camera DynamicMap Others… LB Ops Subsystem Pseudo Vehicle Pseudo Vehicle (generator) Edge#1(Region #2) Auth GW Dispatcher Offload Process Slice#1 (Lat.) Auth GW For Fallback W A N Data Accumulation Edge#2(Region #2) Auth GW Dispatcher Offload Process UE#1 Slice#2 (B/W) UPF#2 UPF#1 Dedicated Line Pseudo Vehicle (generator) UE#1 Slice#1 (Lat.) Slice#2 (B/W) OCI /etc. Public Cloud#2 UPF#2 UPF#1 Fail Over gNB#2 gNB#2 gNB#1 Orchestra tion gNB#1 Dedicated Line/VPN Dedicated Line/VPN Dedicated Line 5
– A High Level R&D POC A POC System of Connected Vehicles Service Data Processing 6 gNB#1 W A N Location #1 (on-premise) Location #2 (on-premise) Public Cloud #1 CAN Camera DynamicMap Others… LB Ops Subsystem Peeudo Vehiclea Pseudo Vehicle (generator) Edge#1(Region #2) Auth GW Dispatcher Offload Process Slice#1 (Lat.) Auth GW For Fallback W A N Data Accumulation Edge#2(Region #2) Auth GW Dispatcher Offload Process UE#1 Slice#2 (B/W) UPF#2 UPF#1 Dedicated Line Pseudo Vehicle (generator) UE#1 Slice#1 (Lat.) Slice#2 (B/W) OCI /etc. Public Cloud#2 UPF#2 UPF#1 Fail Over gNB#2 gNB#2 gNB#1 Orchestra tion gNB#1 Dedicated Line/VPN Dedicated Line/VPN Dedicated Line Boxes with red dashed line are Subsystems of Interest now
Motivation and the Idea (a bit Low-Level) Motivation To have Fleets of Virtual Connected Vehicles in my Lab Idea Running AGL/aarch64 instances using Linux/KVM via libvirt on ARM servers Libvirt is commonly/widely used in Virtualization/Private Cloud Infrastructure – Rich know-hows in the internet 7
Design Design Points Use Libvirt w/KVM acceleration on ARM server/ (NO Emulation!) Instead of using ‘runqemu’ Libvirt enables also Centralized Control/Management Use ‘backing_file’ feature to reduce disk space consumption Additional Software Components to AGL Install UERANSIM and make AGL instances 5G UE Equipped https://github.com/aligungr/UERANSIM UERANSIM communicates with OSS based 5GC (built using Free5GC) Add Observability related components Tracing : OpenTelemetry (otelcol-contrib) Metrics: Prometheus Node Exporter 8
Performance Evaluation Results Notes Scores are “System Benchmarks Index Score” of UnixBench Geometric average of relative measured results (like Dhrystone) against base hardware (SUN SPARCstation 20 SM61) #2~4 are results on the same physical server(AWS EC2 Gravition3 baremetal instance(c7g.metal)) OSes of #2~5 are aarch64, RasPi of #1 is 32bit(armhf) ARM server of #8~9 is SuperMicro ARS-110M-NR (Ampere Altra Max, 128cores/3.0GHz)
Performance Evaluation Results Conclusion We can run AGL with enough practical performance using VMs (qemu) with KVM acceleration on physical ARM CPU servers With KVM acceleration, overhead is about 15% against baremetal servers KVM acceleration gives 150 times better score (against emulation)
Performance Evaluation - Setup Motivation What if we run many VMs on a single machine? Test Setup Run one UnixBench process in each VM and up to 100VMs concurrently CPU Affinity Configuration 1:1 to physical cores Assigned physical cores with appropriate spacing E.g. #11 for VM#1, 60 for VM#2 in case 2VMs measurement First 10 cores (out of 128) are reserved for the Host OS Run UnixBench (almost) simultaneously using GNU parallel from the host side Note : A bit UN-natural as generic workload Used ‘backing_file ’ feature of qcow2 virtual disk image format
Performance Evaluation – Results “System Benchmark Index Score” and Breakdown to each micro-benchmark Dhrystone/Whetstone/Pipe/System Call (No perf. Impacts) Tests with I/O are generally affected Performance degradation (per VM) is roughly 50% on 100 VMs
Performance Evaluation – Analysis Quick Analysis We see single-node level performance degradation from around 10 VMs(concurrency). At the maximum concurrency (=100 VMs), single-node level UnixBench score is roughly 50% against 1 VM. CPU intensive tests (e.g. Dhrystone) are not affected along with concurrency. I/O intensive tests (e.g., File Copy) are noticeably affected. Majority of I/O intensive tests have linear degradation Deep analysis required, but the root cause would be I/O and mutual exclusion contention regarding qcow2 backing_file feature. In case using qemu/kvm for CI/CD infrastructure, maybe better to avoid qcow2 backing_file feature or use disaggregated storage(e.g. Ceph).
Evaluation Effectiveness of AGL as a (High-Level) POC Testbed E2E Connectivity through the (Emulated) 5G cellular communication by UERANSIM/Free5GC Worked without any troubles Build of UERANSIM is straight forward, just ‘cmake’ and ‘make’ following the UERANSIM document E2E Observability Showed effectiveness of Distributed Tracing (by OpenTelemetry(OTEL)) for E2E (Vehcile ~ Mobile NW ~ Backbone NW ~ Cloud) Excerpted from a talk before (OSSJ2023, https://sched.co/1R2oh)
/ Take Aways AGL can be used as an testbed for high level R&D POC, not only for IVI/IC In this sense, better to create a recipe with a reduced feature set e.g. In case of me, graphic is not necessary. Libvirt is another way to run AGL on virtualized environments Good for centralized management/control qemuarm64 with KVM acceleration gives us enough performance But be careful when using qcow2 backing_file feature for I/O workloads
Works Random Thoughts Update host/guest side software Build custom AGL image based on appropriate profile (gateway, IMHO) Add some components (otelcol-contrib, UERNSIM, ebpf related tools, etc.) Linux kernel (new) features (including host/qemu side) Real Time Kernel Extensible Scheduler Class Functional Safety – Working with ELISA? Standardization – e.g. SOVD/OBD Edge Computing – Working with AECC (https://aecc.org/) Other SBOM improvement Etc.
– Reduction of virtual disk size When running massive number of VMs using the same AGL image, there is a room to reduce total size of virtual disks because most of the contents are the same. Otherwise, sometimes you get disk-full trouble (like me). Use ‘backing_file’ feature of qcow2 Copy-on-Write type size reduction Pros : Can reduce (physical) disk usage dramatically Cons : Can cause performance bottleneck, for I/O intensive workloads Procedure Create the base OS image (with your favorite components) Create many OS images specifying the base OS image using ‘backing_file’ (-b) #!/bin/bash START=${START:-1} END=${END:-100} for n in $(seq ${START} ${END}) do qemu-img create -b agl-19.0.0-base.img -f qcow2 -F qcow2 agl-19.0.0-${n}.img 20G done
– DHCP/Connectivity Stability Issue AGL instances failed to get IP acquisition via DHCP connman has a limited set of DHCP functionalities Swtiched to systemd-networkd AGL instances eventually lost emulated 5G connections systemd-networkd default configuration on DHCP Lease Expire was to release the IP once and acquire it again UERANSIM 5G emulation protocol (RLS) heatbeat feature detected this. Changed systemd-networkd configuration # cat /etc/systemd/network/50-agl.network [Match] Name=enp3s0 [Network] DHCP=yes UseDNS=yes KeepConfiguration=dhcp