Upgrade to Pro — share decks privately, control downloads, hide ads and more …

(Re)Designed for High Performance: Solaris and ...

(Re)Designed for High Performance: Solaris and Multi-Core Systems

Solaris 11 provided a rare opportunity to redesign and reengineer many components of the Solaris kernel.

Designed to efficiently scale to the largest available systems (currently 256 cores and 64 TB of memory per system on SPARC, a little less on x64), significant work was done to remove obsolete tuning concepts and ensure that the OS can handle the large datasets (both in-memory and on disk) which are now commonplace.

The continuing growth in virtualisation plays to Solaris’ strengths: with the built-in hypervisor in the multi-threaded multi-core SPARC T-series we can provide hardware partitions even within a single processor socket. Solaris also features support for Zones (somewhat similar to the BSD Jails concept) which provide soft partitioning and allow presentation of an environment which mimics earlier releases of Solaris. This allows software limited to running on those older releases to obtain some of the performance and observability benefits of the host operating system. Many years of development and support experience have given the Solaris engineering division an acute awareness that new features must include the capacity to observe them.

While we have extensive testing, benchmarking and workload simulation capabilities to help bring a new release to the market, building in toolsto help customers and support teams diagnose problems with their real-world usage are essential. The Solaris 11 release extended the work done with DTrace in Solaris 10, providing more probe points than ever before.

Multicore World 2013

February 20, 2013
Tweet

More Decks by Multicore World 2013

Other Decks in Programming

Transcript

  1. <Insert Picture Here> (Re)Designed for High Performance James C. McPherson

    Principal Software Engineer, Solaris Modernisation Group Solaris Core Technologies
  2. Copyright (c) 2013 Oracle. All Rights Reserved Topics for discussion

    • _init() • CPUs and Memory • Device drivers: network and storage • Fault Management and System Topology • Observability • Filesystems • _fini()
  3. Copyright (c) 2013 Oracle. All Rights Reserved The following is

    intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
  4. Copyright (c) 2013 Oracle. All Rights Reserved A bit of

    history to get started • Solaris has been multi-core since v2.1 (1992) – Multiple system (cpu+memory) boards per host • Until 2004: one core per socket or module – Up to 4 cpu modules per board • Largest systems – Sun Enterprise 10000 – 64 cpus (64 sockets) – Sun Fire 6800 – 24 cpus (24 sockets)
  5. Copyright (c) 2013 Oracle. All Rights Reserved A bit of

    history (2) • From 2004 onwards: 2+ cores per socket • Largest systems – Sun Fire 6900 – 48 cores (24 sockets, UltraSPARC-IV+) – Fujitsu M9000 – 256 cores (64 sockets, SPARC64 VII+) For sun4v systems, 4 or 8 cores, building up to – Oracle SPARC T4-4 – 32 cores (8 sockets, SPARC T4) – sun4v introduced as “CoolThreads”
  6. Copyright (c) 2013 Oracle. All Rights Reserved Obsolete cpus limit

    the future • Support for booting obsolete cpus removed – 32bit Intel and AMD cpus – UltraSPARC-I/II/III/IV (sun4u from Sun Microsystems) • Design limitations in the hardware required software workarounds – Virtually Addressed Caches (sun4u) – No hardware virtualisation support (sun4u, Intel/AMD) • Interconnect technologies very limited
  7. Copyright (c) 2013 Oracle. All Rights Reserved Living in the

    [CPU] future • Everything that can be virtualised, will be – Requires on-die support [sun4v, VT-x/VT-d] • Multiple hardware threads per core • Multiple cores per socket • Multiple sockets per system • On-die support for crypto offload engines, N-gigabit Ethernet, PCI Express controllers, …
  8. Copyright (c) 2013 Oracle. All Rights Reserved Hardware threads: kernel

    view • Solaris views each hardware thread as a virtual cpu • Large systems today report hundreds of “cpus” to userland/application queries • Useful properties of each cpu exposed via APIs – NUMA-IO – Lgroup (Locality Group) • Some APIs make this available to non-global zones
  9. Copyright (c) 2013 Oracle. All Rights Reserved Hardware threads: view

    from userland • $ prtdiag -v (output elided) System Configuration: Oracle Corporation sun4v SPARC T3-1 Memory size: 128 Gigabytes ============ Virtual CPUs=============== CPU ID Frequency Implementation Status ------ --------- ---------------------- ------- 0 1649 MHz SPARC-T3 on-line 1 1649 MHz SPARC-T3 on-line … 125 1649 MHz SPARC-T3 on-line 126 1649 MHz SPARC-T3 on-line 127 1649 MHz SPARC-T3 on-line • SPARC T3-1 has a single socket, 8 hw threads x 16 cores.
  10. Copyright (c) 2013 Oracle. All Rights Reserved Hardware threads: view

    from userland • $ prtdiag -v (output elided) System Configuration: LENOVO 2742AG2 BIOS Configuration: LENOVO 9SKT58AUS 12/24/2012 ==== Processor Sockets ================================== Version Location Tag -------------------------------- -------------------------- Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz SOCKET 0 • $ psrinfo -v (output elided) Status of virtual processor 0 as of: 02/14/2013 13:07:17 on-line since 02/14/2013 10:44:38. The i386 processor operates at 3292 MHz, and has an i387 compatible floating point processor. … Status of virtual processor 3 as of: 02/14/2013 13:07:17 on-line since 02/14/2013 10:44:42. The i386 processor operates at 3301 MHz, and has an i387 compatible floating point processor.
  11. Copyright (c) 2013 Oracle. All Rights Reserved Oodles of memory,

    too • In 1992 (Solaris 2.1 released): One page was 1/1000th of system memory • In 2006 (when work started on redesigning the Solaris VM system): One page – 1 / 100millionth of system memory • 2007: Fujitsu M9000 supports 4Tb of memory. • Old assumptions and architectural limitations are a significant roadblock to growth
  12. Copyright (c) 2013 Oracle. All Rights Reserved So what do

    we do? • Virtualize! – VM pages become memory objects • Memory has important locality qualities which the kernel and applications can utilise: – Core- and socket-associativity – Memory Placement Optimization • Enable larger page sizes for greater efficiency (and to support ever larger configurations)
  13. Copyright (c) 2013 Oracle. All Rights Reserved Pulling CPUs and

    memory together • Multiple hardware threads, cache and fpu sharing, intra- and extra-socket communication. A modern OS is a very complex beast. • Something needs to tell the system what to do (and when to do it). Enter … The Scheduler
  14. Copyright (c) 2013 Oracle. All Rights Reserved Scheduling (1) •

    Solaris 11's scheduler rewritten to avoid migrating threads across cores (and sockets) where possible • API provided to bind processes to specific hw threads or groups (processor sets) • System processes and threads are named (Critical Threads), to improve observability by the sysadmin • Projects and tasks used for userland facilities
  15. Copyright (c) 2013 Oracle. All Rights Reserved Scheduling (2) •

    Different scheduling classes dispadmin(1M) has more detail (and warnings) Class Description SYS System TS Timesharing SDC System Duty Cycle FX Fixed Priority IA Interactive RT Realtime FSS Fair Share Scheduler
  16. Copyright (c) 2013 Oracle. All Rights Reserved How do I

    see this? • $ ps -o pid,class,pri,time,comm -p 1,898,3719,6000,25930 PID CLS PRI TIME COMMAND 1 TS 59 00:27 /usr/sbin/init 898 SDC 99 03:28:52 zpool-sink 3719 TS 59 16:21 /usr/java/bin/java 6000 IA 59 00:52 prstat 25930 IA 59 44:35 /opt/local/firefox-17/firefox
  17. Copyright (c) 2013 Oracle. All Rights Reserved Complexity and layers

    • Effort begun in Solaris 10 to remove tendrils (aka layering violations) continued with Solaris 11 • IHVs and ISVs writing for Solaris provided with well- defined frameworks / interfaces • Storage and Network driver frameworks (SCSAv3, GLDv3) evolved to support more features and require less internal knowledge • Improved out-of-the-box performance, remove knobs to twiddle whereever possible
  18. Copyright (c) 2013 Oracle. All Rights Reserved No pretty stack

    picture • Solaris 11 feature and performance goals: – VIRTUALISE ALL THE THINGS – Remove unnecessary layers – Move from message passing to function call API – Improve interrupt performance and become NUMA-aware – Move config into SMF rather than depending on itty bitty files • Provide new commands for administration where appropriate, rather than bolting on to existing
  19. Copyright (c) 2013 Oracle. All Rights Reserved How does this

    show up? (1) • $ dladm show-phys -L LINK DEVICE LOC iscsi0 ixgbe0 PCIe-2 e1000g0 e1000g0 MB e1000g1 e1000g1 MB e1000g2 e1000g2 MB e1000g3 e1000g3 MB • $ dladm show-phys -L LINK DEVICE LOC e1000g0 e1000g0 – rtls0 rtls0 Slot1 vboxnet0 vboxnet0 -- Sun Fire X4450 Lenovo M82 on the motherboard New command dladm(1M)
  20. Copyright (c) 2013 Oracle. All Rights Reserved • $ dladm

    show-vnic LINK OVER SPEED MACADDRESS MACADDRTYPE VIDS vniclocl0 rtls0 100 2:8:20:4d:64:eb random 0 vnsnvfcs0 e1000g0 1000 2:8:20:dc:4c:cc random 0 vns11u1v0 e1000g0 1000 2:8:20:86:3e:30 random 0 vboxvnic0 rtls0 100 8:0:27:de:f4:33 fixed 0 • $ ipadm show-addr ADDROBJ TYPE STATE ADDR lo0/v4 static ok 127.0.0.1/8 e1000g0/v4 static ok 192.168.1.20/24 lo0/v6 static ok ::1/128 e1000g0/v6 addrconf ok fe80::fe4d:d4ff:fe2f:82d1/10 e1000g0/v6 addrconf ok 2001:44b8:2188:f000:fe4d:d4ff:fe2f:82d1/64 • $ ipadm show-addr ADDROBJ TYPE STATE ADDR lo0/v4 static ok 127.0.0.1/8 net0/v4 static ok 192.168.1.31/24 lo0/v6 static ok ::1/128 net0/v6 addrconf ok fe80::a00:27ff:fede:f433/10 How does this show up? (2) Global zone Vbox instance New command ipadm(1M)
  21. Copyright (c) 2013 Oracle. All Rights Reserved Solaris 10 (and

    earlier) storage stack Userland and applications Target drivers 3rd party multipathing SCSA framework HBA drivers physical layer Kernel space MPxIO
  22. Copyright (c) 2013 Oracle. All Rights Reserved The Solaris 11

    storage stack pic Userland and applications Target drivers Generic Block device framework MPxIO SCSA framework HBA drivers physical layer Kernel space
  23. Copyright (c) 2013 Oracle. All Rights Reserved What do I

    see? (1) # format < /dev/null AVAILABLE DISK SELECTIONS: 0. c3t0d0 <ATA-ST3320620AS-K cyl 38909 alt 2 hd 255 sec 63> /pci@0,0/pci108e,534d@5/disk@0,0 1. c3t1d0 <ATA-SAMSUNG HD321KJ-0-12 cyl 38910 alt 2 hd 255 sec 63> /pci@0,0/pci108e,534d@5/disk@1,0 2. c5t5000C5004E87D24Fd0 <ATA-ST2000DM001-9YN1-CC4H-1.82TB> /scsi_vhci/disk@g5000c5004e87d24f 3. c5t5000C5004EA36039d0 <ATA-ST2000DM001-9YN1-CC4H-1.82TB> /scsi_vhci/disk@g5000c5004ea36039 4. c5t5000C500204C9482d0 <ATA-ST31000528AS-CC38...> /scsi_vhci/disk@g5000c500204c9482 5. c5t5000CCA35DE9C580d0 <ATA-Hitachi HDS72101-A3EA...> /scsi_vhci/disk@g5000cca35de9c580 … 8. c8t0d0 <ATA-ST3320620AS-D cyl 38910 alt 2 hd 255 sec 63> /pci@0,0/pci10de,376@a/pci1000,3150@0/sd@0,0 9. c8t15d0 <ATA-ST3320620AS-K cyl 38910 alt 2 hd 255 sec 63> /pci@0,0/pci10de,376@a/pci1000,3150@0/sd@f,0 • SATA, SAS and FC storage ordered by WWN (port) and SCSI INQUIRY response data if attached to supported HBA. • MPxIO on by default for more HBAs and devices Nvidia onboard SATA WWN-capable SATA disks Non-WWN- capable SATA disks
  24. Copyright (c) 2013 Oracle. All Rights Reserved What do I

    see? (2) $ cfgadm -la sata6 Ap_Id Type Receptacle Occupant Condition sata6/0::dsk/c3t0d0 disk connected configured ok sata6/1::dsk/c3t1d0 cd/dvd connected configured ok sata6/2::dsk/c3t2d0 disk connected configured ok sata6/3 sata-port empty unconfigured ok • $ cfgadm -la c14 Ap_Id Type Receptacle Occupant Condition c14 scsi-sas connected configured unknown c14::w5000cca0125dde45,0 disk-path connected configured unknown
  25. Copyright (c) 2013 Oracle. All Rights Reserved What do I

    see? (3) • $ diskinfo D:devchassis-path c:occupant-compdev --------------------------- --------------------- /dev/chassis//SYS/HDD0/disk c4t5000CCA0153966B8d0 /dev/chassis//SYS/HDD1/disk c4t5000CCA012633C24d0 /dev/chassis//SYS/HDD2/disk c4t5000CCA01263C824d0 /dev/chassis//SYS/HDD3/disk c4t5000CCA0125DC1A8d0 /dev/chassis//SYS/HDD4/disk c4t5000CCA0153A0B7Cd0 /dev/chassis//SYS/HDD5/disk c4t5000CCA0125DDE44d0 /dev/chassis//SYS/HDD6/disk c4t5000CCA0125BF590d0 /dev/chassis//SYS/HDD7/disk c4t50015179594F5DD2d0 • /dev/chassis/SUN-Storage-J4200.0848QAJ001/SCSI_Device__0/disk c11t43d0 /dev/chassis/SUN-Storage-J4200.0848QAJ001/SCSI_Device__1/disk c11t26d0 /dev/chassis/SUN-Storage-J4200.0848QAJ001/SCSI_Device__2/disk c11t38d0 /dev/chassis/SUN-Storage-J4200.0848QAJ001/SCSI_Device__3/disk c11t42d0 /dev/chassis/SUN-Storage-J4200.0848QAJ001/SCSI_Device__4/disk c11t29d0 /dev/chassis/SUN-Storage-J4200.0848QAJ001/SCSI_Device__5/disk c11t30d0 /dev/chassis/SUN-Storage-J4200.0848QAJ001/SCSI_Device__6/disk c11t41d0 SPARC T3-1 J4200 (jbod) New command diskinfo(1M)
  26. Copyright (c) 2013 Oracle. All Rights Reserved Example: SPARC T3-1

    server SPARC T3-1 front view SPARC T3-1 rear view
  27. Copyright (c) 2013 Oracle. All Rights Reserved How does Fault

    Management help me? • Work started in Solaris 10 to bring intelligence and analysis to fault diagnosis – FMA, the Fault Management Architecture • Continued and extended in Solaris 11 to support more hardware Requires knowledge of system topology • Topology maps required to maximise FMA benefits • Some maps static, some generated on the fly
  28. Copyright (c) 2013 Oracle. All Rights Reserved FMA useful reason

    #1 • Uptime depends on Availability • Availability depends on resilience to failures • Improved diagnosis and response reduces need for downtime until convenient or absolutely necessary • Memory pages can be remapped if they fail ECC • Redundant zpools continue operating until a replacement can be arranged. • VNICs can be migrated (live) to new physical link
  29. Copyright (c) 2013 Oracle. All Rights Reserved FMA useful reason

    #2 • FMA topology maps enable new online perspective of your hardware • CRO (Chassis/Receptable/Occupant) enables remote interaction with onsite hardware support • dladm(1M) and diskinfo(1M) use CRO notation – Consistent naming scheme
  30. Copyright (c) 2013 Oracle. All Rights Reserved That SPARC T3-1....

    • SPARC T3-1 is a relatively simple system – 1 CPU – 2 RU high – Up to 128 Gb ram – Up to 16 internal (front-mounted) disks – 6 low-profile PCI Express 2.0 slots – 4 onboard gigE nics • 307 elements in the topology tree
  31. Copyright (c) 2013 Oracle. All Rights Reserved How do we

    put FMA and Topology together? $ fmdump -v -u b690b4e7-3e2b-4f1f-fb04-d7e71b1a395f TIME UUID SUNW-MSG-ID EVENT Jan 23 09:56:56.8578 b690b4e7-3e2b-4f1f-fb04-d7e71b1a395f ZFS-8000-QJ Diagnosed 100% fault.fs.zfs.vdev.dtl Problem in: zfs://pool=6e954c71737e8931/vdev=2f2241570cbeccc2/pool_name=soundandvision/vdev_ name=id1,sd@n5000c5004e87d24f/a … Jan 24 02:48:35.8271 b690b4e7-3e2b-4f1f-fb04-d7e71b1a395f FMD-8000-6U Resolved 100% fault.fs.zfs.vdev.dtl Repair Attempted Problem in: zfs://pool=6e954c71737e8931/vdev=2f2241570cbeccc2/pool_name=soundandvision/vdev_ name=id1,sd@n5000c5004e87d24f/a … $ zpool status soundandvision pool: soundandvision state: ONLINE scan: resilvered 674G in 16h55m with 0 errors on Thu Jan 24 02:48:33 2013
  32. Copyright (c) 2013 Oracle. All Rights Reserved If you can

    see something … • Solaris has had /proc tools (pargs, pstack, pfiles etc) since Solaris 7 • Solaris 10 introduced DTrace, the Solaris Dynamic Tracing feature • Solaris 11 extended that work with – more probes in more areas • drivers, userland code, python and ruby – rewriting some *stat tools to use DTrace – ZFS Storage Appliance (same kernel as Solaris 11) makes heavy use of DTrace
  33. Copyright (c) 2013 Oracle. All Rights Reserved Dear UFS, your

    time is up • UFS is a 30+ year old filesystem • UFS was designed when large disks were 100 Megabytes • UFS has lots of knobs to twiddle, lots of bug fixes • UFS doesn't protect you from dodgy devices • UFS doesn't protect you from data corruption UFS DOES. NOT. SCALE. (to where we need it to, which is Exascale and larger)
  34. Copyright (c) 2013 Oracle. All Rights Reserved ZFS: heal my

    pain • Solaris 11 brought ZFS (backported to Solaris 10 Update 1) • ZFS gives you blocks • ZFS virtualises your physical disks into pools • ZFS takes away the pain in administering storage • ZFS is a 128 bit filesystem, and with virtually limitless capacity • ZFS heals your data and lets you know about it via FMA • ZFS is integrated with the NFS, CIFS and iSCSI stacks
  35. Copyright (c) 2013 Oracle. All Rights Reserved DMU is a

    general-purpose transactional object store ZFS dataset = up to 2^48 objects, each up to 2^64 bytes Key features common to all datasets Snapshots, compression, encryption, end-to-end data integrity Any flavour you want: file, block, object, network ZFS: Universal Storage Storage Pool Allocator (SPA) Data Management Unit ZFS Posix Layer pNFS Lustre DB ZFS Volume emulator local CIFS NFS iSCSI raw dump swap UFS(!)
  36. Copyright (c) 2013 Oracle. All Rights Reserved There's no going

    back to UFS • Solaris 11 installs to ZFS, and only ZFS • ZFS is plugged in to the boot system – You boot from Boot Environments (BEs) – Rollback to an earlier version is easy: • beadm activate ; reboot • The packaging system (IPS) leverages ZFS snapshots, clones and BEs so that you always have a bootable system
  37. Copyright (c) 2013 Oracle. All Rights Reserved Brief example •

    Given two disks, create filesystems for Ann, Bob and Sue • Later … add more space
  38. Copyright (c) 2013 Oracle. All Rights Reserved Brief example #

    format ... (long interactive session omitted) # metadb -a -f disk1:slice0 disk2:slice0 # metainit d10 1 1 disk1:slice1 d10: Concat/Stripe is setup # metainit d11 1 1 disk2:slice1 d11: Concat/Stripe is setup # metainit d20 -m d10 d20: Mirror is setup # metattach d20 d11 d20: submirror d11 is attached # newfs /dev/md/rdsk/d20 newfs: construct a new file system /dev/md/rdsk/d20: (y/n)? y ... (many pages of 'superblock backup' output omitted) # mount /dev/md/dsk/d20 /export/home/ann # vi /etc/vfstab ... Repeat as needed for each filesystem Total of 6 commands to repeat for each filesystem, as well as setting up the partition table, creating metadbs and editing /etc/vfstab. If we want to create 3 mirrored filesystems, this is a total of 20 commands to run.... Tedious at best!
  39. Copyright (c) 2013 Oracle. All Rights Reserved Brief example: ZFS

    Create a storage pool named “home” # zpool create home mirror disk1 disk2 Create filesystems “ann”, “bob”, “sue” # zfs create home/ann /export/home/ann # zfs create home/bob /export/home/bob # zfs create home/sue /export/home/sue Add more space to the “home” pool # zpool add home mirror disk3 disk4
  40. Copyright (c) 2013 Oracle. All Rights Reserved Where to from

    here? • Conclusion of nearly 8 years of development by more than 1000 engineers across the world • Needless complexity removed • Obsolete assumptions challenged • Solaris 11 is the performance platform of choice for scalable, general purpose multi-core operation, from the silicon to the application
  41. Copyright (c) 2013 Oracle. All Rights Reserved <Insert Picture Here>

    Why aren't you running your application on Solaris 11?
  42. Copyright (c) 2013 Oracle. All Rights Reserved The preceding is

    intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
  43. <Insert Picture Here> (Re)Designed for High Performance James C. McPherson

    Principal Software Engineer, Solaris Modernisation Group Solaris Core Technologies