A Comprehensive Presentation On 'An Analysis of Linux Scalability To Many Cores'

An Analysis of Linux Scalability to Many Cores
Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich MIT CSAIL
Presented by
Chandra Sekhar Sarma Akella Rutvij Talavdekar
Introduction
This paper asks whether traditional kernel designs can be used and implemented in a way that allows applications to scale
.
Analyze scaling a number of applications (MOSBENCH) on Linux running with a 48-core machine
Exim Memcached Apache PostgreSQL gmake the Psearchy file indexer MapReduce library
What is Scalability?
Application does N times as much work on N cores as it could on 1 core However, that is not the case, due to serial parts of the code
Scalability may be better understood by Amdahl's Law
Amdahl's Law
Identifies performance gains from adding additional cores to an application that has both serial and parallel components Serial portion of an application has disproportionate effect on performance gained by adding additional cores
As N approaches infinity, speedup approaches 1 / S (e.g. S = 10%) If 25% of a program is Serial component, adding any number of cores cannot provide speedup of more than 4
Why look at the OS kernel?

Application Amount of Time spent in Kernel
(in Percentage)
Description of Services
Exim Memcached Apache PostgreSQL
70% 80% 60% 82%
Mail Server Uses lot of forks for each incoming SMTP connection and Message delivery Distributed memory caching system Stresses Network stack Web Server has a process per Instance, stresses Network stack, File system Open Source SQL DB uses shared data structures, kernel locking interfaces, uses TCP sockets
Many applications spend considerable amount of their CPU execution time in the kernel These applications should scale with more cores If OS kernel doesn't scale, apps won't scale.
Premise of this Work

Traditional kernel designs do not scale well on multi-core processors They are rooted in uniprocessor designs Speculation is that fixing them is hard
Points to consider:
How serious are the scaling problems? Do they have alternatives? How hard is it to fix them?
These are answered by analysing Linux scalability on Multicores
Analysing Scalability of Linux

Use a off-the-shelf 48-core x86 machine Run a recent version of Linux
A Traditional OS Popular and widely used Scalability is competitive to other OSs scalabilities Has a big community constantly working to improve it
Applications chosen for benchmarking:

Applications known previously as not scaling well on Linux Good existing Parallel Implementation Are System Intensive 7 Applications chosen, together known as MOSBENCH
Contribution
Analysis of Linux scalability for 7 real system intensive applications Stock Linux limits scalability Analysis of bottlenecks
Fixes: 3002 lines of code, 16 patches Most fixes improve scalability of multiple apps Fixes made in the Kernel, minor fixes in Applications and certain changes in the way Applications use Kernel services Remaining bottlenecks were either due to shared Hardware resources or in the application
Result: Patched Kernel No kernel problems up to 48 cores, with fixes applied Except for Sloppy Counters, most fixes were applications of Standard Parallel Programming Techniques
Iterative Method To Test Scalability and fix Bottlenecks

Run the application Use of In-memory Temporary File Storage (tmpfs) file system to avoid disk IO bottlenecks Focus being identifying Kernel-related bottlenecks Find bottlenecks Fix bottlenecks and re-run application Stop when a non-trivial application fix is required, or bottleneck is by shared hardware
(e.g. DRAM or Network Cards)
MOSBENCH
Application Exim (mail server) Percentage time spent in kernel Bottleneck Process creation and small file creation and 69% deletion 80% 60% Upto 82% Upto 7.60% Upto 23% Packet processing in network stack Network stack, File system (directory name lookup) Kernel locking interfaces, network interfaces, application's internal shared data structures File system read/writes to multiple files CPU intensive, file system read/writes
memcahed (object cache)

Apache (web server)
PostgreSQL (database)
gmake (parallel build) Psearchy (file indexer) Metis (mapreduce library)
Upto 16%
Memory allocator, Soft page-fault code
48 Core Server
Comprises of 8 AMD Opteron chips with 6 cores on each chip
Each core has private 64 KB L1 cache (access in 3 CPU cycles) and a private 512 KB L2 cache (14 cycles) 6 Cores on each chip share a 6 MB L3 cache (28 cycles)
Poor scaling on Stock Linux kernel
Exim collapses on Stock Linux
Exim collapses on Stock Linux
Bottleneck: Reading mount table

Kernel calls function Lookup_mnt on behalf of Exim application to get Metadata about Mount point
struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; }

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); Critical section is short. mnt = hash_get(mnts, path); Why does it cause a scalability bottleneck? spin_unlock(&vfsmount_lock); return mnt; }
spin_lock and spin_unlock use many more cycles (~400-5000) than the critical section (~in 10s) in multi-core system
Linux spin lock implementation
Scalability collapse caused by Non-scalable Ticket based Spin Locks
Multiple cores spend more time on lock contention and congesting interconnect with lock acquisition requests, invalidations

Next Lock Holders Wait time N
As this pattern repeats, for some cores waiting
Time required to Acquire Lock
N Number of cores waiting to acquire Lock
Previous Lock Holder responds to at least Lock to next Lock holder
N / 2 cores before transferring

struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); return mnt; }
Well known problem, many solutions:

Use scalable locks Use high speed message passing Avoid locks
Solution: per-core mount caches

Observation: Global Mount table is rarely modified Solution: Per core Data Structure to store per core Mount Cache
struct vfsmount *lookup_mnt(struct path *path) { struct vfsmount *mnt; if ((mnt = hash_get(percore_mnts[cpu()], path))) return mnt; spin_lock(&vfsmount_lock); mnt = hash_get(mnts, path); spin_unlock(&vfsmount_lock); hash_put(percore_mnts[cpu()], path, mnt); return mnt; }
Common case: Cores access per-core tables for metadata of mount point Modify mount table: invalidate per-core tables
Per-core Lookup: Scalability is better
Bottleneck: Reference counting

Ref count indicates if kernel can free object
File name cache (dentry), physical pages,
void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); }
A single atomic instruction limits scalability?!
Bottleneck: Reference counting

Ref count indicates if kernel can free object File name cache (dentry), physical pages,
void dput(struct dentry *dentry) { if (!atomic_dec_and_test(&dentry->ref)) return; dentry_free(dentry); }
A single atomic instruction limits scalability?!
Reading the reference count is slow These counters can become bottlenecks if many cores update them Reading the reference count delays memory operations from other cores A central Reference Count means waiting and contention for locks, cache coherency serialization
Reading reference count is slow
Reading the reference count delays Memory Operations from other cores
Hardware Cache Line Lock
Solution: Sloppy counters

Observation: Kernel rarely needs true value of reference count Solution: Sloppy Counters
Each core holds a few spare references to an object

Observation: Kernel rarely needs true value of reference count Solution: Sloppy Counters Benefits:
Infrequent or avoided use of shared central reference counter, as each core holds a few spare references to an object Core usually updates a sloppy counter by modifying its per-core counter, an operation which typically only needs to touch data in the cores local cache Therefore, NO waiting for locks, Lock contention or cache-coherence serialization Speed up increment/decrement by use of per-core counters, and resulting in faster referencing of objects Used sloppy counters to count references to dentrys, vfsmounts, dst entrys, and to track network protocol related parameters

Observation: Kernel rarely needs true value of reference count Solution: Sloppy Counters Avoids: Frequent use of shared central reference counter, as each core can hold a few spare references to an object
Sloppy counters: More scalability
Scalability Issues
5 scalability issues are the cause of most bottlenecks: 1) Global lock used for a shared data structure : More cores longer lock wait time 2) Shared memory location : More cores overhead caused by the cache coherency algorithms 3) Tasks compete for limited size-shared hardware cache : More cores increased cache miss rates 4) Tasks compete for shared hardware resources (interconnects, DRAM interfaces) More cores more time wasted waiting 5) Too few available tasks : More cores less efficiency These issues can often be avoided (or limited) using popular parallel programming techniques
Kernel Optimization Benchmarking Results: Apache
Other Kernel Optimizations

Fine-grained locking
A modern kernel can contain thousands of locks, each protecting one small resource
Allows each processor to work on its specific task without contending for locks used by other processors
Lock-free algorithms
ensures that threads competing for a shared resource do not have their execution indefinitely postponed by mutual exclusion
Other Kernel Optimizations

Per-core data structures
Kernel data structures that caused scaling bottlenecks due to lock contention, cache coherency serialization and protocol delays due to shared data structure bottlenecks would remain if we replaced the locks with finer grained locks, because multiple cores update the data structures SOLN: Split the data structures into per-core data structures
(eg. Per-core Mount Cache for central vfs mount table)
Avoids Lock Contention, Cache Coherency serialization and slashes lock acquisition wait times Cores now query the per core data structure rather than looking up the central data structure, and avoid lock contention and serialization over the shared data structure
Summary of Changes
3002 lines of changes to the kernel 60 lines of changes to the applications Per-core Data Structures and Sloppy Counters provide across the board improvements to 3 of the 7 applications
Popular Parallel Programming Techniques

Lock-free algorithms Per-core data structures Fine-grained locking Cache-alignment
New Technique: Sloppy Counters
Better Scaling with Modifications

Total Throughput comparison of Patched Kernel and Stock Kernel
Better Scaling with Modifications

Per core Throughput comparison of Patched Kernel and Stock Kernel
Limitations
Results limited to 48 cores and small set of Applications Results may vary on different number of cores, and different set of applications Concurrent modifications to address space In-memory Temporary File storage system used instead of Disk or I/O 48-core AMD machine single 48-core chip
Current bottlenecks
Kernel code is not the bottleneck Further kernel changes might help applications or hardware
Conclusion
Stock Linux has scalability problems They are easy to fix or avoid up to 48 cores Bottlenecks can be fixed to improve scalability Linux Communities can provide better support in this regard In the context of 48 cores, no need to relook at Operating Systems and explore newer Kernel designs
References
Original Paper (pdos.csail.mit.edu/papers/linux:osdi10.pdf ) Original Presentation (usenix.org) VFSMount (http://lxr.free-electrons.com/ident?i=vfsmount) MOSBENCH (pdos.csail.mit.edu/mosbench/ ) Usenix (https://www.usenix.org/events/osdi10/tech/slides/boyd-wickizer.pdf) ACM Library (http://dl.acm.org/citation.cfm?id=1924944) Information Week (http://www.informationweek.com) Wikipedia (tmpfs, Sloppy Counters) University of Illinois (Sloppy Counters) University College London (Per Core Data structures)
THANK YOU !

A Comprehensive Presentation On 'An Analysis of Linux Scalability To Many Cores'

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

A Comprehensive Presentation On 'An Analysis of Linux Scalability To Many Cores'

Cargado por

Copyright:

Formatos disponibles

An Analysis of Linux Scalability to Many Cores

Why look at the OS kernel?

Exim Memcached Apache PostgreSQL

70% 80% 60% 82%

Premise of this Work

These are answered by analysing Linux scalability on Multicores

Analysing Scalability of Linux

Applications chosen for benchmarking:

Iterative Method To Test Scalability and fix Bottlenecks

memcahed (object cache)

Memory allocator, Soft page-fault code

Poor scaling on Stock Linux kernel

Exim collapses on Stock Linux

Exim collapses on Stock Linux

Bottleneck: Reading mount table

Bottleneck: Reading mount table

Linux spin lock implementation

Linux spin lock implementation

Linux spin lock implementation

Linux spin lock implementation

Scalability collapse caused by Non-scalable Ticket based Spin Locks

Scalability collapse caused by Non-scalable Ticket based Spin Locks

Scalability collapse caused by Non-scalable Ticket based Spin Locks

Scalability collapse caused by Non-scalable Ticket based Spin Locks

Scalability collapse caused by Non-scalable Ticket based Spin Locks

As this pattern repeats, for some cores waiting

Time required to Acquire Lock

N Number of cores waiting to acquire Lock

Previous Lock Holder responds to at least Lock to next Lock holder

N / 2 cores before transferring

Bottleneck: Reading mount table

Well known problem, many solutions:

Solution: per-core mount caches

Per-core Lookup: Scalability is better

Bottleneck: Reference counting

A single atomic instruction limits scalability?!

Bottleneck: Reference counting

A single atomic instruction limits scalability?!

Reading reference count is slow

Hardware Cache Line Lock

Solution: Sloppy counters

Each core holds a few spare references to an object

Solution: Sloppy counters

Solution: Sloppy counters

Sloppy counters: More scalability

Kernel Optimization Benchmarking Results: Apache

Other Kernel Optimizations

Other Kernel Optimizations

Popular Parallel Programming Techniques

New Technique: Sloppy Counters

Better Scaling with Modifications

Better Scaling with Modifications

También podría gustarte