Upgrade to Pro — share decks privately, control downloads, hide ads and more …

expect("💩".length).toBe(1)

pilif
June 03, 2012

 expect("💩".length).toBe(1)

Slides for my talk at Swissjeese 2012 in Bern, Switzerland

pilif

June 03, 2012
Tweet

Other Decks in Programming

Transcript

  1. Others (PHP) already fail here • strlen("") == 0 •

    strlen("a") == 1 • strlen("ä") == 2 I cheated - my editor was in UTF-8 mode. I can also make strlen("ä") be 1. (or 3 or 4)
  2. Ah. But PHP sucks! Let’s use Ruby. Yes. It’s unfair

    to use an outdated version of Ruby. 1.9 has (generally) fixed this.
  3. What is a string? • Compound type • Array of

    characters • C says char* • char is defined as the “smallest addressable unit that can contain basic character set”. Integer type. Might be signed or unsigned • Ends up being a byte
  4. Traditional string APIs • Length of a string? count bytes

    until the end (\0) and divide by sizeof(char) • Accessing the n-th character? Add n*sizeof(char) to the pointer • Remember: sizeof(char) usually is 1 and guess how people “optimized”
  5. Interacting with the world • Just dump the contents of

    the memory into a file • Read back the same contents and put it in memory • Problem solved. • Until you need to do this across machines
  6. Interoperability • char is inherently implementation dependent • So is

    by definition the file you dump your char* into • Can’t move files between machines
  7. ASCII • “American Standard Code for Information Interchange” • Published

    1963 • Uses 7 bits per character (circumventing the signedness-issue) • Perfectly fine for what everybody is using (English)
  8. But I need ümläüte • Machines were used where people

    speak strange languages (i.e. not English) • ASCII is 7bit. Adding a bit gives us another 127 characters! • Depending on your country, these upper 127 characters had different meanings • No problem as texts usually don’t leave their country
  9. Unicode 1.0 • 16 bits per character • Published in

    1991, revised in 1992 • Jumped on by everybody who wanted “to do it right” • APIs were made Unicode compliant by extending the size of a character to 16 bits. Algorithms stayed the same
  10. Still just dumping memory • wchar is 16 bits •

    Endianness? See if we care! • To save to a file: Dump memory contents. • To load from a file: Read file into memory • Note they didn’t dare extending char to 16 bits • Let’s call this “Unicode”
  11. 16 bits everywhere • Windows API (XxxxXxxW uses wchar which

    is 16 bit wide) • Java uses 16 bits • Objective C uses 16 bits • And of course, JavaScript uses 16 bits • C and by extension Unix stayed away from this.
  12. It didn’t work out so well • By just dumping

    memory, there’s no way to know how to read it back • Heuristics suck (try typing “Bush hid the facts” in Windows Notepad, saving, reloading) • Most protocols on the internet allow to specify a character set
  13. BOM

  14. No. Really • Implementations lie. • Legacy software had (well.

    has.) huge problems with wide characters • Issues with updating old file formats • 65K characters are not nearly enough
  15. We learned • UTF has happened • specifically UTF-8 happened

    • Unicode 2.0 happened • Programming environments learned
  16. Unicode 2.0+ • Theoretically unlimited code space • Doesn’t talk

    about bits any more • The terminology is code point. • Currently 1.1M code points • The old characters (0000 - FFFF) are on the BMP
  17. Unicode Transformation Format • Specifies how to store Unicode on

    disk • Specifies exact byte encoding for every Unicode code point • Available for 8-, 16- and 32 bit encodings per code point • Not every byte sequence is a valid UTF byte sequence (finally!)
  18. UTF-8 • Uses an 8bit encoding to store code points

    • Is the same as ASCII for whatever’s in ASCII • Uses multiple bytes to encode code points outside of ASCII • The old algorithms don’t work any more
  19. UTF-16 • Combines the worst of both worlds • Uses

    16bit to encode a code point • Uses multiple of 16bits to encode a code point outside of the BMP • Wastes memory for ASCII, has byte-ordering- issues and still breaks the old algorithms. • Is the only way for these 16bit bandwagon jumpers to support Unicode 2.0 and later
  20. UTF-32 • 4 bytes per character • Byte ordering issues

    • Still breaking the old algorithms due to combining marks
  21. Strings are not bytes • A string is a sequence

    of characters • A byte array is a sequence of bytes • Both are incompatible with each other • You can encode a string into a byte array • You can decode a byte array into a string
  22. Which brings us back to JS • Lives back in

    1996 • Strings specified as being stored in UCS-2 (Fixed 16 bits per character) • Leaks its implementation in the API • Doesn’t know about Unicode 2.0
  23. Browsers cheat • Browsers of course support Unicode 2.0 •

    We need to display these piles of poo! • Browsers expose Unicode strings to JS using UTF-16 • The JS API doesn’t know about UTF-16 (or Unicode 2.0)
  24. String methods are leaky • String.length returns mish-mash of byte

    length and character length for strings outside the BMP • substr() can break strings • charAt() can return non-existing code- points • and let’s not talk about to*Case
  25. Samples That D8 3D is half of the UTF-16 encoding

    of U+1F4A9 which is 3d d8 a9 dc
  26. Et tu RegEx? • Character classes don’t work right •

    Counting characters doesn’t work right • Can break strings
  27. Intermission: Digraphs • ä is not the same as ä

    • ä can be “LATIN SMALL LETTER A WITH DIAERESIS” • it can also be “LATIN SMALL LETTER A” followed by “COMBINING DIAERESIS” • both look exactly the same
  28. PHP • At least you get to chose the internal

    encoding. • PHP only does bytes by default. strlen() means bytelen() • Forget a /u in preg_match and you’ll destroy strings. \s matches UTF-8 ä (U+00EF is 0xa420 and 0x20 is ASCII space) • use any non mb_* function on a utf-8 string to break it
  29. Python < 3.3 • They do clearly separate bytes and

    strings • Use str.encode() to create bytes and bytes.decode() to go back to strings • Unfortunately, UCS2 (mostly)
  30. Some did it ok • Python 3.3 (PEP 393) •

    Ruby 1.9 (avoids political issues by giving a lot of freedom) • Perl (awesome libraries since forever) • ICU, ICU4C (http://icu-project.org/)
  31. • Discussions happening for ES6 • Usable by 2040 or

    later I guess • On the server: Use ICU • Only normalization currently available at https:// github.com/astro/node-stringprep • Manual bit-twiddling • Regular expressions will still be broken • Problem safe to ignore? Solutions for JS
  32. This was just the tip of the iceberg! • Localization

    issues (Collation, Case change) • Security issues (Encoding, Homographs) • Broken Software (including “US UTF-8”)
  33. Thank you! • @pilif on twitter • https://github.com/pilif/ 
 


    Also: We are looking for a front-end designer with CSS skills. Send them to me if you know them (or are one)