by 97.9% of all the websites whose character encoding they know. • https://w3techs.com/technologies/overview/character_encoding • Many programing languages support UTF-8 • Ruby, Python, Rust, C++[^1], PHP, Java[^2] 9 [^1]: char8_t: A type for UTF-8 characters and strings https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0482r0.html [^2]: JEP 400: UTF-8 by Default https://openjdk.org/jeps/400
at the fi rst n bits of a byte • UTF-8 is variable length from 1 to 4 bytes • There are 1 byte, 2 bytes, 3 bytes, 4 bytes, and ongoing byte sequences • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx 47
a value that uniquely de fi nes a Unicode character • UTF-8 encodes Unicode code points • UTF-16, UTF-32 • One-to-one correspondence between UTF-8 byte sequences and Unicode code points • Can convert from byte sequences to Unicode code points • Also can convert from Unicode code points to byte sequences 79