Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Claudio Freire - Efficient shared memory data s...

Claudio Freire - Efficient shared memory data structures

Multithreading makes shared memory easy, but true parallelism next to impossible. Multiprocessing gives us true parallelism, but it makes sharing memory very difficult, and high overhead. In this talk, we'll explore techniques to share memory between processes efficiently, with a focus on sharing read-only massive data structures.

https://us.pycon.org/2018/schedule/presentation/140/

PyCon 2018

May 11, 2018
Tweet

More Decks by PyCon 2018

Other Decks in Programming

Transcript

  1. Sharing memory... what for? • Cache too big to fit

    in RAM... – N times with N processors • Multiprocessing: input data • Tornado / mod_wsgi: slow-evolving caches – When only a small fraction of it is frequently accessed • The “working set” fits in memory, but not the whole dataset • When disk access for infrequently needed data is acceptable 2
  2. Sharing memory... what for? • When the serialization cost becomes

    prohibitive – Big and complex data structures: you can’t afford the CPU cycles (de)serializing lots of objects – Objects of inefficient serialization: • Eg: SQLAlchemy 2
  3. Sharing memory... what for? • The transition to a “shared

    buffer” – Split the cache in two layers: • A read-only layer of slow-evolving data • A regular read-write layer with continuous (but infrequent) updates – Most data must reside in the static layer 3
  4. Why not multiprocessing? • Lock contention • Poor support for

    complex structures: 4 Note Although it is possible to store a pointer in shared memory remember that this will refer to a location in the address space of a specific process. However, the pointer is quite likely to be invalid in the context of a second process and trying to dereference the pointer from the second process may cause a crash. multiprocessing docs
  5. The shared buffer • Getting the best of both worlds

    – Compact and efficient shared memory representation of static or slow-changing data – Dynamic and fast-updateable structure for the rest User Old stuff New stuff Buffer 5
  6. How? As simple as: fileobj = open("buf", "r+") buf =

    mmap.mmap( fileobj.fileno(), 0, access = mmap.ACCESS_READ) 6
  7. How do I get objects into this thing? Slightly more

    complex • Define a schema that is: – Easily manipulated without serialization – Efficient in space and access time • Build the machinery that allows accessing it... – …as if it were an object – …without copying it to process-private memory 7
  8. How do I get objects into this thing? Slightly more

    complex • Define a schema that is: –struct— – Easily manipulated without serialization – Efficient in space and access time • Build the machinery that allows accessing it... – …as if it were an object – …without copying it to process-private memory –proxies— 7
  9. Structs In C: struct { int a; float b; bool

    c; } In Python import struct struct.pack( "if?", 1, 2.0, True) 8
  10. Structs • Why on earth get C into this? –

    Native machine code can access struct elements natively – Widely portable (most every language can parse C structs in some way or another) – Cython 8
  11. Proxies • Classes that know where a struct lays within

    a buffer • They convert attribute access to struct access: x = Proxy(buf, offset=10) x.a # reads the int x.b # reads the float x.c # reads the bool 9
  12. Proxies • Don’t require serialization – It’s enough to know

    where the struct is (ie, have a pointer) • The can easily be “repointed” – Change the offset to switch the proxy to another object – Avoids python object creation overhead • Relativley transparent – They look quite like the original object – They can even quack like the original as well 9
  13. Proxies – adding complexity struct ComplexProxy { int value; int

    child_left_offset; int child_right_offset; } class ComplexObj: def __init__(self, l = None, r = None): self.value = 3 self.left = l self.right = r 10
  14. Proxies – adding complexity class ComplexProxy: def __init__(self, buf, pos):

    self.buf = buf self.pos = pos a = IntProperty(offset=0) b = ProxyProperty(ComplexProxy, offset=4) c = ProxyProperty(ComplexProxy, offset=8) class IntProperty: def __get__(self, obj, kls): return unpack("i", obj.buf, obj.pos+self.offset) class ProxyProperty: def __get__(self, obj, kls): voffset = unpack("i", obj.buf, obj.pos+self.offset) return ComplexProxy( obj.buf, voffset) 11
  15. Proxies – cyclic references – OOPS! • It gets tricky

    when you add cyclic references – They need to be recognized when building the buffer – They require care, as always • A few options available: – Forbid them – Allow them 12
  16. Proxies – cyclic references – OOPS! Identity maps • id(object)

    → offset • When an object is packed, update the identity map – Check it also to detect already-packed objects • Compresses the file – Unifies repeated references to the same object • Breaks cycles 13
  17. Proxies – cyclic references – OOPS! Identity maps • Tricky

    points – If the buffer is built by iterating a generator, you will probably get different objects with the same id() • The identity map has to be synchronized with the lifetime of in-memory objects at all times. If an object is destroyed, its entry on the identity map must be removed as well. – The identity map can get quite big • In particular when packing millions of objects into large buffers 13
  18. Manipulation without serialization • Building a buffer is expensive –

    Kinda like serializing, sure • But... using it, isn’t – Open – Read – Search – Even write (up to a point) 13
  19. Manipulation without serialization Structure of an object Attribute bitmap present:11010000

    nulls:11010000 a : 4 bytes : int b : 4 bytes : float *c : 8 bytes : uint
  20. Manipulation without serialization Structure of an object Attribute bitmap present:11010000

    nulls:11010000 a : 4 bytes : int b : 4 bytes : float *c : 8 bytes : uint c : 12 bytes : str
  21. Manipulation without serialization Nesting objects Attribute bitmap present:11010000 nulls:11010000 a

    : 4 bytes : int b : 4 bytes : float *c : 8 bytes : uint c : N bytes : object Attribute bitmap present:11010000 nulls:11010000 a : 4 bytes : int b : 4 bytes : float *c : 8 bytes : uint c : N bytes : object
  22. Manipulation without serialization Dynamic typing Attribute bitmap present:11010000 nulls:11010000 a

    : 4 bytes : int b : 4 bytes : float *c : 8 bytes : uint c : N bytes : any typecode a : 4 bytes : int value : 8 bytes : double 17
  23. Manipulation without serialization Writing *i1 *i2 *i3 *i4 v1 :

    4b v2 : 4b v3 : 10b v4 : 40b Index Data
  24. Manipulation without serialization Writing *i1 *i2 *i3 *i4 v1 :

    4b v2 : 4b v3 : 10b v4 : 40b Index Data 20
  25. Associative maps • Compact hash table: – Sorted array of

    tuples <hash, key, value> – Binary search optimized for uniform distributions • One prediction given the known key distribution (hash) • One iteration of exponential search to adjust the prediction • Finalize with a regular binary search • Approximate hash table: – Throw away the key, assume hash collisions as acceptable error – Particularly efficient with long string keys
  26. Associative maps *k1 *k2 *k3 *k4 h1 h2 h3 h4

    *v1 *v2 *v3 *v4 pedro 2324 4141 Index Keys Values
  27. Associative maps alice bob cloe pedro 1 7 7 15

    *v1 *v2 *v3 *v4 pedro 2324 4141 Index Keys Values m['bob'] == v2 m['cloe'] == v3
  28. Approximate associative maps 1 7 7 15 *v1 *v2 *v3

    *v4 2324 4141 Index Values m['bob'] == m['cloe'] == [v2,v3]
  29. Speed • Performance: – Only "hot" data set (most used)

    needs to fit in RAM – Optimized search in 2 log( ) ɛ • ɛ being the error between prediction and actual position • < n ɛ • Approximate hash table: – Fixed size even with big keys (long strings) – Even more efficient access (no need to verify and store keys) 25
  30. Speed • Performance: – Good disk access pattern even if

    it won’t fit in RAM: • Exponential search is mostly sequential access • Good locality with good predictions – O(1) seeks on average • Possibility to preload the index to RAM – Much more likely to fit than values or keys
  31. Speed • Cython magic: – Instead of using struct everywhere

    – Avoids building python objects for temporary operations • Proxy reuse: – Instead of building new proxies, repoint a reusable one – Type transmutation to change the shape of a proxy • proxy.__class__ = new_cls 28