Pythons in the tower of babel

Pythons within the tower of Babel Or how to decode
those Unicode errors Mario Corchero News Automation @Bloomberg

History of text - ASCII • American Standard Code for
Information Interchange • 0-127

ASCII – Like EST for text • Accents? • Chineese?
• Arabic?

Unicode • 0 to 0x10ffff • A table that maps
all possible characters into a “code point” • An Unicode string is a sequence of code points A U+0041 Glyph Code point Latin capital letter A Character

Unicode - Save that to disk • A rule to
translate an Unicode string to bytes is called encoding. The same way we encode/encode letters into sound the Unicode encodings allow us to encode/decode into bytes

Some encodings • UTF-16 • UTF-32 • UTF-8 (variable length)
• ASCII (Python 2 defaults)

Python Python 2 Python 3 Default encoding: ascii Implicit conversions
Default encoding: utf8

Conversions • Unicode to Bytes: encode • Bytes to Unicode:
decode

Encodings

Take home Bytes vs Unicode – There is no “string”

Take home Encodings – Know your stuff

Take home Python2 vs Python3

Take home Unicode Sandwich

Take home • Bytes vs Unicode – There is no
“string” • Encodings – Know your stuff • Python2 vs Python3 • Unicode Sandwich • Test with unicode

Common errors

Interpreter “SyntaxError: Non-ASCII character '\xeb' in file test.py on line
1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details”

Terminal setup “UnicodeEncodeError: 'charmap' codec can't encode character u'\u1234' in
position 0: character maps to <undefined>” This means your console is not setup to display Unicode See https://wiki.python.org/moin/PrintFails

Operations on str objects

JSON JSON module is broken (yes and no) When parsing
in javascript I get: "æ± è¯"

JSON Input Ensure_ascii output u” 汉语” True '"\\u6c49\\u8bed"' u” 汉语”
False u'"\u6c49\u8bed"' u”汉语".encode("utf-8") True '"\\u6c49\\u8bed"’ u”汉语".encode("utf-8") False '"\xe6\xb1\x89\xe8\xaf\xad"' [u"汉语", u"汉语".encode("utf-8")] True '["\\u6c49\\u8bed", "\\u6c49\\u8bed"]' [u"汉语", u"汉语".encode("utf-8")] False UnicodeDecodeError There is no binary in the JSON format!

Formatting with str from __future__ import __unicode_literals__ Change the default
encoding? No thanks

Opening files • Python2 open works with bytes

Take home

Take home I ♥ Unicode And utf-8 is my encoding
Move to Py3

Questions?

Pythons in the tower of babel

Pythons in the tower of babel

Mario Corchero

More Decks by Mario Corchero

Other Decks in Programming

Featured

Transcript

Pythons within the tower of Babel Or how to decode

History of text - ASCII • American Standard Code for

ASCII – Like EST for text • Accents? • Chineese?

Unicode • 0 to 0x10ffff • A table that maps

Unicode - Save that to disk • A rule to

Some encodings • UTF-16 • UTF-32 • UTF-8 (variable length)

Python Python 2 Python 3 Default encoding: ascii Implicit conversions

Conversions • Unicode to Bytes: encode • Bytes to Unicode:

Encodings

Encodings

Take home Bytes vs Unicode – There is no “string”

Take home Encodings – Know your stuff

Take home Python2 vs Python3

Take home Unicode Sandwich

Take home • Bytes vs Unicode – There is no

Common errors

Interpreter “SyntaxError: Non-ASCII character '\xeb' in file test.py on line

Terminal setup “UnicodeEncodeError: 'charmap' codec can't encode character u'\u1234' in

Operations on str objects

JSON JSON module is broken (yes and no) When parsing

JSON Input Ensure_ascii output u” 汉语” True '"\\u6c49\\u8bed"' u” 汉语”

Formatting with str from future import __unicode_literals__ Change the default

Opening files • Python2 open works with bytes

Take home

Take home I ♥ Unicode And utf-8 is my encoding

Questions?