Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UTF8: The last encoding you will ever need

UTF8: The last encoding you will ever need

What are charsets and encodings, which are the advantages and disadvantages of each one available, and finally, why is UTF-8 the last encoding you will ever need.
---
O que são charsets e encodings, quais as vantagens e desvantagens de usar cada um dos diversos disponíveis e, por fim, porque o UTF-8 é o último encoding de que você precisará.

Ricardo Coelho

October 20, 2016
Tweet

More Decks by Ricardo Coelho

Other Decks in Technology

Transcript

  1. UNICODE LET THERE BE ידוחיי דוק תכרעמ יהי fiat unum

    codice 捰ํӞ㮆ࠔӞጱդ嘨ᔮ妞 एक अिद्वतीय कोड िसस्टम हुन त्यहाँ गरौं hãy có một hệ thống mã độc đáo ας υπάρχει ένα μοναδικό σύστημα κωδικού ੉ ࣻ ੓ب۾ Ҋਬ ௏٘ दझమ ੌ ࣻ
  2. THE PENTALOGUE UNICODE • Thou shalt support every alphabet known

    to men • Thou shalt have ASCII compatibility • Thou shalt depend upon no code page • Thy symbol shalt have its own code point (sometimes more) • Thou shalt support bi-directional languages
  3. OR, IS IT? UCS-2: 16 BITS TO RULE THEM ALL

    216 = 65.536 “640K ought to be enough for anybody.” – Gates, William (64K) This encoding later evolved into UTF-16
  4. SURE, THAT WILL WORK! 32 BITS TO RULE THEM ALL

    This encoding later evolved into UTF-32
  5. NOT PAGES ;) PLANES Basic Multilingual Plane (BMP) U+0000 –

    U+FFFF Control ASCII, ASCII, Extended ASCII (including Box Drawing), Latin, Greek, Cyrillic (including Supplement), Hebrew, Arabic, Bengali, Runic, Thai, Cherokee, Phonetic, Balinese, Hiragana, Katakana, Vedic, Braile… (ISO-8859-x)
  6. NOT PAGES ;) PLANES Supplementary Multilingual Plane (SMP) U+10000 –

    U+1FFFF Cuneiform, Hieroglyphs, Ancient Persian, Musical Symbols (Modern and Ancient), Mahjong, Domino and Cards Symbols, Emoticons, Alchemical Symbols
  7. NOT PAGES ;) PLANES Supplementary Ideographic Plane (SIP) U+20000 –

    U+2FFFF Rarely Used Chinese/Japanese/Korean Ideographs Han Unification: Hanzi, Kanji & Hanja
  8. NOT PAGES ;) PLANES 11 Unassigned Planes (3-13) U+30000 –

    U+DFFFF Plane 3 will be called (probably) TIP: Tertiary Ideographic Plane Reserved for: Oracle Bone script, Bronze Script, Small Seal Script No glyphs assigned yet. (represents 64,7% of total planes)
  9. NOT PAGES ;) PLANES Supplementary Special-purpose Plane U+E0000 – U+EFFFF

    Invisible Text Tags (which are not recommended) and CJK Variations (99,5% unassigned)
  10. NOT PAGES ;) PLANES Private Use Area Planes (15 &

    16) U+F0000 – U+10FFFF Ligatures, auxiliary glyphs and glyph build blocks Limited interoperability, assignment from outside ISO and Unicode Consortium
  11. FOR UNICODE TRANSFORMATION FORMATS SECURITY CONCERNS • UTF-16 APIs •

    UTF-32 APIs • UTF-EBCDIC • Endianess • UTF-16LE / UTF-16BE • UTF-32LE / UTF-32BE • Efficiency
  12. IN THE REAL WORLD UNICODE Credit: Chima by LEGOTM HOW

    TO FIT 32 BITS INSIDE8-BITCHANNEL AN
  13. UTF-8 • No conversion ever needed again! • Your OS

    supports it • Your application server supports it • Your application supports it • Your database server supports it • Your API supports it • Your friends support it Let’s U+!
  14. FOR PHP PROGRAMMERS PRACTICAL ADVICES • Forget about utf8_encode /

    utf8_decode • Try iconv. You can thank me later • Find an mb_ version of your string function • mb_detect_encoding does the best it can (not its fault)
  15. REFERENCES • http://www.babelstone.co.uk/Unicode/ unicode.html • http://en.wikipedia.org/wiki/ %D0%AF#Computing_codes • http://en.wikipedia.org/wiki/UTF-8#Examples •

    http://www.fileformat.info/info/unicode/char/ 2F80/index.htm • http://dev.mysql.com/doc/refman/5.5/en/ charset-general.html • http://blog.tremend.ro/2006/09/26/mysql- php-and-utf8/ • http://www.php.net/manual/en/function.mb- detect-encoding.php • http://www.php.net/manual/en/ function.iconv.php • http://php.net/manual/en/function.utf8- decode.php • https://en.wikipedia.org/wiki/ Plane_(Unicode) • https://en.wikipedia.org/wiki/Unicode • https://en.wikipedia.org/wiki/ Comparison_of_Unicode_encodings • http://en.wikipedia.org/wiki/ASCII • http://www.asciitable.com/ • http://en.wikipedia.org/wiki/Teleprinter • http://en.wikipedia.org/wiki/ Microprocessor#8-bit_designs • http://en.wikipedia.org/wiki/Code_page_437 • http://en.wikipedia.org/wiki/ Code_pages#IBM_PC_.28OEM. 29_code_pages • http://en.wikipedia.org/wiki/Iso-8859