String usage: so many tools are already in your hands!

Strings usage: so many tools are already in your hands!
SymfonyCon Brussels 2023 - Marion Hurteau @marionherisson / 📧 [email protected] 1

Hello World 👋 Marion Hurteau @MarionHerisson /MarionLeHerisson 📧 [email protected] 2

Hello World 👋 󰠁 JoliCode since 2019 ➡ Consulting, production,
audit, expertise and training 🏰 Poney club, castle & home-made beer Drinking alcohol is dangerous for your health. Drink in moderation (and in good company) 3

What is a string? 4

“What is a string?” is a string. 💁 💬 Anything
writable, printable, or earable really 💻 💬 0 and 1 󰳕 ❓ 5

Glyph ~ is a glyph and so are n or
ñ → an image, an abstract form 6

Grapheme ~ is a grapheme and n is another grapheme
→ a minimally distinctive unit of writing →linked to the context of a particular writing system 7

Grapheme cluster ñ is a grapheme cluster → think of
it as a character 8

Grapheme n is a grapheme cluster • a minimally distinctive
unit of writing in the context of a particular writing system • think of it as a character Diacritic ~ 9

A bit of history 🔎 10

The first encodings 💻 💬 « 01100001 » means «
a » 💁 💬 󰢃 Or does it? 11

The first encodings 1963 : ANSI : ASCII ! 7
bits → 127 characters 12

ASCII 13

Not cool for the rest of the world German :
Schildkröte 🐢 Swedish : Skål! 🍻 French : Éléphant 🐘 14

bits → 127 characters 1972 : 8 bits CPUs are here ! 8 bits → 255 characters 15

Mojibake 18

Mojibake 19

Mojibake 20

bits → 127 characters 1972 : 8 bits CPUs are here ! 8 bits → 255 characters 1991 : Unicode V1.0 ! • universal • uniform • unique 21

bits → 127 characters 1972 : 8 bits CPUs are here ! 8 bits → 255 characters 1991 : Unicode V1.0 ! 16 bits → 65 536 characters • universal • uniform • unique 22

A’s code point is U+0041 B’s is U+0042 … Ω’s
is U+2126 Code points 23

bits → 127 characters 1972 : 8 bits CPUs are here ! 8 bits → 255 characters 1991 : Unicode V1.0 ! 16 bits → 65 536 characters 1996 : Unicode V2.0 presents UTF ! Encoding ≠ Code Point - UTF-32 → 32 bits, 4 bytes - UTF-16 → 16 bits, 2 bytes - UTF-8 → 8 bits, 1 to 4 bytes 24

BOM • Encoded as FE FF (or FF FE if
you are swapping high and low bytes) • Indicates that the text’s encoding is Unicode ◦ and in which Unicode encoding • Byte order (endianness) of the text’s stream for 16-bits & 32-bits encodings 26

BOM if (strpos($signature, 'WEBVTT') !== 0) { $parsing_errors[] = 'Missing
"WEBVTT" at the beginning of the file'; } 27

// Strip UTF-8 BOM $bom = pack('CCC', 0xEF, 0xBB, 0xBF);
if (substr($content, 0, 3) === $bom) { $content = substr($content, 3); } 28 BOM

BOM Handled in the Serializer component : • stripped when
decoding csv $csv = "\xEF\xBB\xBF".<<<'CSV' foo,bar hello,"hey ho" CSV; $this->encoder->decode($csv, 'csv', [CsvEncoder::AS_COLLECTION_KEY => false]); // ['foo' => 'hello', 'bar' => 'hey ho'] 29

BOM Handled in the Serializer component : • stripped when
decoding csv • can be added on demand in the output $this->encoder->encode($value, 'csv', [CsvEncoder::OUTPUT_UTF8_BOM_KEY => true])); 30

✨ UTF-8 ✨ • 8 bits • 0-127 : 1
byte → Backward compatibility with ASCII • 128+ : 2 to 6 bytes 31

Trivia: Replacement character ≠ 32

Standards 33

The first encodings 1963 : ASCII ! 7 bits →
127 characters 1972 : 8 bits CPUs are here ! → 255 characters 1991 : Unicode V1.0 ! 16 bits → 65 536 characters 1991 : Unicode V1.0 ! 16 bits → 65 536 characters 1996 : Unicode V2.0 presents UTF ! Encoding ≠ Code Point 2023 : Unicode V15.0 → 149 186 characters and around 245 000 code points assigned in a space that can contain up to 1 114 112 different code points 35

To sum it up 36 Graphemes C a f e
Glyphs C a f e ́ Code points U+0043 U+0061 U+0066 U+0065 U+0301 UTF-32 Bytes 00 00 00 43 00 00 00 61 00 00 00 66 00 00 00 65 00 00 03 01 UTF-16 Bytes 00 43 00 61 00 66 00 65 03 01 UTF-8 Bytes 43 61 66 65 CC 81 ́

Symfony’s String component 🧶 40

Symfony’s string component Provides an object oriented approach to strings
manipulation. new UnicodeString('Å'); new ByteString('Å'); new CodePointString('Å'); 41 u("Å"); b("Å"); s("Å");

Symfony’s string component $text = (new UnicodeString('This is a déjà-vu
situation.')) ->trimEnd('.') ->replace('déjà-vu', 'jamais-vu') ->append('!'); // $text = 'This is a jamais-vu situation!' 42

Symfony’s string component u('foo BAR')->upper(); // 'FOO BAR' u('FOO Bar')->lower();
// 'foo bar' u('Die O\'Brian Straße')->folded(); // "die o'brian strasse" u('Foo: Bar-baz.')->camel(); // 'fooBarBaz' u('Foo: Bar-baz.')->snake(); // 'foo_bar_baz' u('Foo: Bar-baz.')->camel()->title(); // 'FooBarBaz' 43

Symfony’s string component u('abc')->indexOf('B'); // null u('abc')->ignoreCase()->indexOf('B'); // 1 u('hello')->append('world');
// 'helloworld' u('hello')->append(' ', 'world'); // 'hello world' u('User')->ensureEnd('Controller'); // 'UserController' u('UserController')->ensureEnd('Controller'); // 'UserController' 44

Symfony’s string component u(' Lorem Ipsum ')->padBoth(20, '-'); // '---
Lorem Ipsum ----' u('_.')->repeat(10); // '_._._._._._._._._._.' u(' Lorem Ipsum ')->trim(); // 'Lorem Ipsum' u('http://symfony.com')->replace('http://', 'https://'); u('Symfony is great')->slice(0, 7); // 'Symfony' u('template_name.html.twig')->split('.'); // ['template_name', 'html', 'twig'] 45

ByteString specific methods // returns TRUE if the string contents
// are valid UTF-8 contents b('Lorem Ipsum')->isUtf8(); // true b("\xc3\x28")->isUtf8(); // false 46

CodePointString and UnicodeString specific methods u('नमस्ते')->ascii(); // 'namaste' u('さよなら')->ascii(); //
'sayonara' u('спасибо')->ascii(); // 'spasibo' u('नमस्ते')->codePointsAt(0); // न [2344] u('नमस्ते')->codePointsAt(1); // म [2350] u('नमस्ते')->codePointsAt(2); // स्ते [2360, 2381, 2340, 2375] 47

Why is "Å" !== "Å" !== "Å"? Combination of A
(U+0041) and ̊ (U+030A) The codepoint U+00C5 which gives Å, or “Latin Capital Letter A with Ring Above” The codepoint U+212B for “Angstrom Sign Å” 48

Normalization Canonical normalization : NFD : Canonical Decomposition Å =>
A + ̊ NFC : Canonical Composition A + ̊ => Å Compatibility normalization : NFKD : Compatibility Decomposition Å => A + ̊ NFKC : Compatibility Composition A + ̊ => Å 49

Normalization 50 // these encode the letter as a single
code point: U+00E5 u('å')->normalize(UnicodeString::NFC); u('å')->normalize(UnicodeString::NFKC); // these encode the letter as two code points: U+0061 + U+030A // a + ◌̊ u('å')->normalize(UnicodeString::NFD); u('å')->normalize(UnicodeString::NFKD);

Normalization $ARing = "\xC3\x85"; // Å (U+00C5) $ARingComposed = "A"."\xCC\x8A";
// A◌̊ (U+030A) $norm1 = Normalizer::normalize($ARing, Normalizer::FORM_C); $norm2 = Normalizer::normalize($ARingComposed, Normalizer::FORM_C); var_dump($ARing === $ARingComposed); // FALSE var_dump($norm1 === $norm2); // TRUE 51

AsciiSlugger $slugger = new AsciiSlugger(); $slug = $slugger->slug('Wôrķšƥáçè ~~sèťtïñğš~~', '/');
// $slug = 'Workspace/settings' 52

AsciiSlugger $slugger = new AsciiSlugger(); $slug = $slugger->slug('Wôrķšƥáçè ~~sèťtïñğš~~', '/');
// $slug = 'Workspace/settings' $slugger = $slugger→withEmoji(); $slug = $slugger→slug('a 😺, and a 🦁 go to 🏞', '-', 'en'); // $slug = 'a-grinning-cat-and-a-lion-go-to-national-park'; $slug = $slugger→slug('un 😺, et un 🦁 vont au 🏞', '-', 'fr'); // $slug = 'un-chat-qui-sourit-et-un-tete-de-lion-vont-au-parc-national'; 53

Inflector $inflector = new EnglishInflector(); $result = $inflector->singularize('teeth'); // ['tooth']
$result = $inflector->singularize('radii'); // ['radius'] $result = $inflector->singularize('leaves'); // ['leaf', 'leave', 'leaff'] $result = $inflector->pluralize('bacterium'); // ['bacteria'] $result = $inflector->pluralize('news'); // ['news'] $result = $inflector->pluralize('person'); // ['persons', 'people'] 54

Symfony’s string component (new ByteString('󰣷'))->length(); (new CodePointString('󰣷'))->length(); (new UnicodeString('󰣷'))->length(); //
17 // 5 // 1 55

CodePointString and UnicodeString specific methods u('󰔮')->codePointsAt(0); [ 0 => 128105
1 => 8205 2 => 128105 3 => 8205 4 => 128103 5 => 8205 6 => 128102 ] U+1F469 👩 WOMAN U+0200D ZERO WIDTH JOINER U+1F469 👩 WOMAN U+0200D ZERO WIDTH JOINER U+1F467 👧 GIRL U+0200D ZERO WIDTH JOINER U+1F466 👦 BOY 56

What is the length of “󰣷”? 57

python3 : >>> len("󰣷") == 5 True JavaScript : "󰣷".length
== 7 True Rust : "󰣷".len() == 17 true Elixir : String.length("󰣷") // 1 Swift : var s = "󰣷" print(s.count) // 1 print(s.unicodeScalars.count) // 5 print(s.utf16.count) // 7 print(s.utf8.count) // 17 58 What is the length of “󰣷”?

PHP : strlen('󰣷'); // 17 mb_strlen('󰣷', 'UTF-8'); // 5 59
What is the length of “󰣷”?

PHP : strlen('󰣷'); // 17 mb_strlen('󰣷', 'UTF-8'); // 5 Symfony
: u('󰣷')→length(); // 1 60 What is the length of “󰣷”?

Unicode scalar UTF-32 code units UTF-16 code units UTF-8 code
units UTF-32 bytes UTF-16 bytes UTF-8 bytes U+1F926 FACE PALM 🤦 1 2 4 4 4 4 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 🏼 1 2 4 4 4 4 U+200D ZERO WIDTH JOINER 1 1 3 4 2 3 U+2642 MALE SIGN ♂ 1 1 3 4 2 3 U+FE0F VARIATION SELECTOR-16 1 1 3 4 2 3 Total 5 7 17 20 14 17 61 What is the length of “󰣷”?

// Symfony\Polyfill\Intl\Grapheme\Grapheme.php public static function grapheme_strlen($s) { preg_replace('/'.SYMFONY_GRAPHEME_CLUSTER_RX.'/u', '', $s,
-1, $len); return 0 === $len && '' !== $s ? null : $len; } 62 What is the length of “󰣷”?

Wrong encoding kills 📱 63

u('i')->upper()->toString(); // I u('ı')->upper()->toString(); // I u('I')->lower()->toString(); // i u('İ')->lower()->toString();
// i. u('i')->upper()->codePointsAt(0); // [73] u('ı')->upper()->codePointsAt(0); // [73] u('I')->lower()->codePointsAt(0); // [105] u('İ')->lower()->codePointsAt(0); // [105, 775] 65 Case of the Turkish i

// Symfony\Component\String\AbstractUnicodeString.php public function lower(): static { $str = clone
$this; $str->string = mb_strtolower(str_replace('İ', 'i', $str->string), 'UTF-8'); return $str; } 66 Case of the Turkish i

Case of the Turkish i > So no, you can’t
convert string to lowercase without knowing what language that string is written in. var en_US = Locale.of("en", "US"); var tr = Locale.of("tr"); "I".toLowerCase(en_US); // => "i" "I".toLowerCase(tr); // => "ı" "i".toUpperCase(en_US); // => "I" "i".toUpperCase(tr); // => "İ" 67

Trivia: the flags in Unicode - No fixed codepoint -
Something once in Unicode stays in it forever - Flags might become obsolete - The ISO (International Organisation for Standardization) is the reference with its list of flags recognised by the O.N.U. 󰎐 => 󰎐, flag Belgium 69 without font with the right font

Symfony’s Intl Component 🌍 70

Symfony’s Intl Component Provides access to the ICU data: •
Language and Script Names • Country Names • Locales • Currencies • Timezones • … 71

EmojiTransliterator use Symfony\Component\Intl\Transliterator\EmojiTransliterator; // describe emojis in English $transliterator =
EmojiTransliterator::create('en'); $transliterator->transliterate('Menus with 🍕 or 🍝'); // => 'Menus with pizza or spaghetti' // describe emojis in Ukrainian $transliterator = EmojiTransliterator::create('uk'); $transliterator->transliterate('Menus with 🍕 or 🍝'); // => 'Menus with піца or спагеті' 72

EmojiTransliterator use Symfony\Component\Intl\Transliterator\EmojiTransliterator; // describe emojis in Slack short code
$transliterator = EmojiTransliterator::create('slack'); $transliterator->transliterate('Menus with 🥗 or 🧆'); // => 'Menus with :green_salad: or :falafel:' // use this to describe emojis in Github short code $transliterator = EmojiTransliterator::create('github'); 73

EmojiTransliterator $reverseTransliterator = EmojiTransliterator::create('github', EmojiTransliterator::REVERSE); $reverseTransliterator ->transliterate('Menus with :green_salad: or
:falafel:'); // => 'Menus with 🥗 or 🧆' 74

PHP’s native functions 🐘 76

Transliteration? • from a script to another • Αθήνα →
Athena • You might want to transliterate data before indexing it 78 [ Antwerpen Brussel Cannes // … Zurich Αθήνα ]

Transliteration? transliterator_transliterate( 'Any-Latin; Latin-ASCII; Lower()', "A æ Übérmensch på høyeste
nivå! И я люблю PHP! ﬁ" ); // "a ae ubermensch pa hoyeste niva! i a lublu php! fi" 79

ASCII: chr() and ord() /** * Generate a single-byte string
from a number * @param int $codepoint : The ascii code. * @return string the specified character. */ #[Pure] function chr(int $codepoint): string {} /** * Convert the first byte of a string to a value between 0 and 255 * @param string $character : A character. * @return int<0, 255> the ASCII value as an integer. */ #[Pure] function ord(string $character): int {} 80

utf8_ functions utf8_encode() and utf8_decode() 81

utf8_ functions utf8_encode() and utf8_decode() Only from and to ISO-8859-1
! Deprecated since PHP 8.2.0 82

mb_ functions • Like the PHP’s string fonctions, but on
more than one byte • str_replace works just fine if needle and haystack have the same encoding • You have to manually enable the mbstring extension in PHP 83

Emojis as class names… interface 🍚 {} interface 🐟 {}
class 🍣 implements 🍚, 🐟 { } 84

…or any other Unicode character interface ┻━┻ {} class （╯°□°）╯︵┻━┻
extends Exception implements ┻━┻ { public function __construct($message = __CLASS__, $code = 0, Exception $previous = null) { parent::__construct($message, $code, $previous); } } class （ノ゜Д゜）ノ︵┻━┻ extends Exception implements ┻━┻ { public function __construct($message = __CLASS__, $code = 0, Exception $previous = null) { parent::__construct($message, $code, $previous); } } 85

Trivia: kaomojis ¯\_(ツ)_/¯ (╯￣Д￣)╯╘═╛ ┬─┬ノ(ಠ_ಠノ) ( ͡° ʖ ̯ ͡°)
(~‾▽‾)~ (凸ಠ益ಠ)凸 86

More readable mathematics ! $√2π = sqrt(2 * $π); $⟮z
＋ g ＋½⟯ᶻ⁺½ = pow($z + $g + 0.5, $z + 0.5); $ℯ＾−⟮z ＋ g ＋½⟯ = exp(-($z + $g + 0.5)); /** * Put it all together: * __ / 1 \ z+½ * √2π | z + g + - | e^-(z+g+½) A(z) * \ 2 / */ return $√2π * $⟮z ＋ g ＋½⟯ᶻ⁺½ * $ℯ＾−⟮z ＋ g ＋½⟯ * $A⟮z⟯; 87

Spaces in method names 88

Homoglyphs ( U+0028 LEFT PARENTHESIS （ U+FF08 FULLWIDTH LEFT PARENTHESIS
﹙ U+FE59 SMALL LEFT PARENTHESIS ⁽ U+207D SUPERSCRIPT LEFT PARENTHESIS ₍ U+208D SUBSCRIPT LEFT PARENTHESIS ❨ U+2768 MEDIUM LEFT PARENTHESIS ORNAMENT 89

Homoglyphs 90

Inside your database 💾 91

Set names, Charset, Collate? SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; CREATE
DATABASE awesome_database CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 92

Charsets & Collations glyph encoding Character set 93 my_alphabet =
[ 0 = A, 1 = B, 2 = a, 3 = b ]

Charsets & Collations 94 my_alphabet = [ 0 = A,
1 = B, 2 = a, 3 = b ] A ? B

Charsets & Collations 95 my_alphabet = [ 0 = A,
1 = B, 2 = a, 3 = b ] A < B

Charsets & Collations my_alphabet = [ 0 = A, 1
= B, 2 = a, 3 = b ] a < b Collation 96

• utf8mb4_unicode_ci : ◦ sorts "ß" like "ss" ◦ "Œ"
like "OE" ◦ Ignorable characters are skipped • utf8mb4_general_ci : as single characters ◦ "ß" like "s” ◦ "Œ" like "e" Example of collations 97

SET NAMES? SET character_set_client = 'utf8mb4' SET character_set_connection = 'utf8mb4'
SET character_set_results = 'utf8mb4' 98

Naming of UTF-8 • PostgreSQL : UTF8 ◦ 8 bits
◦ 1 to 4 bytes • Oracle : AL32UTF8 (Real UTF-8, Unicode 9.0) ◦ Not UTF8 (Actually CESU-8, Unicode 3.0) • MySQL : utf8mb4 ◦ Not utf8 (UTF-8 on 3 bytes) 99

MySQL’s “utf8” 2002 : UTF-8 standard would allow up to
6 bytes per character Speed boost if all rows are the same number of bytes in a table People would use CHAR because it has a defined number of characters, no matter which value is stored CHAR(1) = 6 bytes, CHAR(2) = 12 bytes, … 2003 : The old UTF-8 standard is declared obsolete by Unicode to make room to the new one Will people try to encode their CHAR columns into UTF-8? Let’s change the size! 100

Security issues 🔓 103

Homoglyph attack Password forgotten? Enter a fake email address, looking
like the one you’re attacking [email protected] != mｉᎬᎬ@example.org The mail will be normalized before looking it up in the database A token for [email protected] is generated then sent to mｉᎬᎬ@example.org who can now connect as [email protected] 104

Phabricator > On inserting Unicode characters with code points greater
than 0xFFFF into columns that have a utf8 charset. MySQL then truncates a string as soon as it reaches such a character. Domain restricted subscription Enter “[email protected]🍕@allowed-domain.com” If the check on domain is valid Only “[email protected]” is stocked in the DB ! 105

Happy encoding! 🔧 106

Thank you! Questions? SymfonyCon Brussels 2023 - Marion Hurteau @marionherisson
[email protected] 107

SOURCES - Webpages https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434 https://decodeunicode.org/ https://deliciousbrains.com/how-unicode-works/ https://dev.mysql.com/doc/refman/8.0/en/charset-general.html https://github.com/brefphp/bref/blob/f4df37277181dc76b6f644663de236eae7a793d2/src/functions.php#L11 https://github.com/captioning/captioning/issues/86 https://github.com/jolicode/emoji-search
https://github.com/markrogoyski/math-php https://github.com/mysql/mysql-server/commit/43a506c0ced0e6ea101d3ab8b4b423ce3fa327d0 https://github.com/PHP-CS-Fixer/PHP-CS-Fixer/blob/master/src/Fixer/Basic/EncodingFixer.php https://github.com/sgolemon/table-flip/blob/master/src/TableFlip.php https://github.com/symfony/symfony/blob/85b97226def5e4a50c1e3805a6c31bb6642efb70/src/Symfony/Component/Intl/Test s/Transliterator/EmojiTransliteratorTest.php https://github.com/symfony/symfony/pull/33896/files https://gizmodo.com/a-cellphones-missing-dot-kills-two-people-puts-three-m-382026 https://hackerone.com/reports/2233 https://hsivonen.fi/string-length/ https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-kno w-about-unicode-and-character-sets-no-excuses/ https://jolicode.github.io/unicode-conf/index.html#/ https://kunststube.net/encoding/ https://news.ycombinator.com/item?id=8892157 108

https://www.php.net/manual/en/function.utf8-decode.php https://www.php.net/manual/en/mbstring.supported-encodings.php https://www.php.net/manual/fr/refs.international.php https://www.postgresql.org/docs/current/multibyte.html https://pyrech.github.io/php-wtf/#/15?_k=dyazd4 https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicod e-ci/ https://symfony.com/blog/new-in-symfony-6-2-better-emoji-support https://symfony.com/doc/current/components/intl.html https://symfony.com/doc/current/components/string.html
https://tonsky.me/blog/unicode/ http://www.unicode.org/charts/ http://unicode.org/emoji/charts/emoji-variants.html https://unicode-org.github.io/icu/userguide/transforms/general/#script-transliteration https://unicode.org/glossary/ https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Unicode/Test https://en.wikipedia.org/wiki/Character_encoding https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/Mojibake https://en.wikipedia.org/wiki/Byte_order_mark https://www.youtube.com/watch?v=kaucJce8hhE&t=19s&ab_channel=TheUnicodeConsortium 109

SOURCES - Images https://unsplash.com/photos/a-plate-of-cheese-jntQPBIK_sE (Christina Deravedisian) https://unsplash.com/photos/command-computer-keyboard-key-46T6nVjRc2w (Hannah Joshua) https://unsplash.com/photos/crt-monitor-turned-off-aiqKc07b5PA
(Federica Galli) 110

SOURCES - Books Unicode à gogo ! by Design Brouhaha
111

String usage: so many tools are already in your...

String usage: so many tools are already in your hands!

More Decks by Marion Hurteau

Other Decks in Programming

Featured

Transcript