Upgrade to Pro — share decks privately, control downloads, hide ads and more …

String usage: so many tools are already in your...

String usage: so many tools are already in your hands!

SymfonyCon Brussels 2023

Avatar for Marion Hurteau

Marion Hurteau

December 07, 2023
Tweet

More Decks by Marion Hurteau

Other Decks in Programming

Transcript

  1. Strings usage: so many tools are already in your hands!

    SymfonyCon Brussels 2023 - Marion Hurteau @marionherisson / 📧 [email protected] 1
  2. Hello World 👋 󰠁 JoliCode since 2019 ➡ Consulting, production,

    audit, expertise and training 🏰 Poney club, castle & home-made beer Drinking alcohol is dangerous for your health. Drink in moderation (and in good company) 3
  3. “What is a string?” is a string. 💁 💬 Anything

    writable, printable, or earable really 💻 💬 0 and 1 󰳕 ❓ 5
  4. Glyph ~ is a glyph and so are n or

    ñ → an image, an abstract form 6
  5. Grapheme ~ is a grapheme and n is another grapheme

    → a minimally distinctive unit of writing →linked to the context of a particular writing system 7
  6. Grapheme n is a grapheme cluster • a minimally distinctive

    unit of writing in the context of a particular writing system • think of it as a character Diacritic ~ 9
  7. The first encodings 💻 💬 « 01100001 » means «

    a » 💁 💬 󰢃 Or does it? 11
  8. The first encodings 1963 : ANSI : ASCII ! 7

    bits → 127 characters 12
  9. Not cool for the rest of the world German :

    Schildkröte 🐢 Swedish : Skål! 🍻 French : Éléphant 🐘 14
  10. The first encodings 1963 : ANSI : ASCII ! 7

    bits → 127 characters 1972 : 8 bits CPUs are here ! 8 bits → 255 characters 15
  11. 16

  12. 17

  13. The first encodings 1963 : ANSI : ASCII ! 7

    bits → 127 characters 1972 : 8 bits CPUs are here ! 8 bits → 255 characters 1991 : Unicode V1.0 ! • universal • uniform • unique 21
  14. The first encodings 1963 : ANSI : ASCII ! 7

    bits → 127 characters 1972 : 8 bits CPUs are here ! 8 bits → 255 characters 1991 : Unicode V1.0 ! 16 bits → 65 536 characters • universal • uniform • unique 22
  15. The first encodings 1963 : ANSI : ASCII ! 7

    bits → 127 characters 1972 : 8 bits CPUs are here ! 8 bits → 255 characters 1991 : Unicode V1.0 ! 16 bits → 65 536 characters 1996 : Unicode V2.0 presents UTF ! Encoding ≠ Code Point - UTF-32 → 32 bits, 4 bytes - UTF-16 → 16 bits, 2 bytes - UTF-8 → 8 bits, 1 to 4 bytes 24
  16. 25

  17. BOM • Encoded as FE FF (or FF FE if

    you are swapping high and low bytes) • Indicates that the text’s encoding is Unicode ◦ and in which Unicode encoding • Byte order (endianness) of the text’s stream for 16-bits & 32-bits encodings 26
  18. // Strip UTF-8 BOM $bom = pack('CCC', 0xEF, 0xBB, 0xBF);

    if (substr($content, 0, 3) === $bom) { $content = substr($content, 3); } 28 BOM
  19. BOM Handled in the Serializer component : • stripped when

    decoding csv $csv = "\xEF\xBB\xBF".<<<'CSV' foo,bar hello,"hey ho" CSV; $this->encoder->decode($csv, 'csv', [CsvEncoder::AS_COLLECTION_KEY => false]); // ['foo' => 'hello', 'bar' => 'hey ho'] 29
  20. BOM Handled in the Serializer component : • stripped when

    decoding csv • can be added on demand in the output $this->encoder->encode($value, 'csv', [CsvEncoder::OUTPUT_UTF8_BOM_KEY => true])); 30
  21. ✨ UTF-8 ✨ • 8 bits • 0-127 : 1

    byte → Backward compatibility with ASCII • 128+ : 2 to 6 bytes 31
  22. 34

  23. The first encodings 1963 : ASCII ! 7 bits →

    127 characters 1972 : 8 bits CPUs are here ! → 255 characters 1991 : Unicode V1.0 ! 16 bits → 65 536 characters 1991 : Unicode V1.0 ! 16 bits → 65 536 characters 1996 : Unicode V2.0 presents UTF ! Encoding ≠ Code Point 2023 : Unicode V15.0 → 149 186 characters and around 245 000 code points assigned in a space that can contain up to 1 114 112 different code points 35
  24. To sum it up 36 Graphemes C a f e

    Glyphs C a f e ́ Code points U+0043 U+0061 U+0066 U+0065 U+0301 UTF-32 Bytes 00 00 00 43 00 00 00 61 00 00 00 66 00 00 00 65 00 00 03 01 UTF-16 Bytes 00 43 00 61 00 66 00 65 03 01 UTF-8 Bytes 43 61 66 65 CC 81 ́
  25. To sum it up 37 Graphemes C a f e

    Glyphs C a f e ́ Code points U+0043 U+0061 U+0066 U+0065 U+0301 UTF-32 Bytes 00 00 00 43 00 00 00 61 00 00 00 66 00 00 00 65 00 00 03 01 UTF-16 Bytes 00 43 00 61 00 66 00 65 03 01 UTF-8 Bytes 43 61 66 65 CC 81 ́
  26. To sum it up 38 Graphemes C a f e

    Glyphs C a f e ́ Code points U+0043 U+0061 U+0066 U+0065 U+0301 UTF-32 Bytes 00 00 00 43 00 00 00 61 00 00 00 66 00 00 00 65 00 00 03 01 UTF-16 Bytes 00 43 00 61 00 66 00 65 03 01 UTF-8 Bytes 43 61 66 65 CC 81 ́
  27. To sum it up 39 Graphemes C a f e

    Glyphs C a f e ́ Code points U+0043 U+0061 U+0066 U+0065 U+0301 UTF-32 Bytes 00 00 00 43 00 00 00 61 00 00 00 66 00 00 00 65 00 00 03 01 UTF-16 Bytes 00 43 00 61 00 66 00 65 03 01 UTF-8 Bytes 43 61 66 65 CC 81 ́
  28. Symfony’s string component Provides an object oriented approach to strings

    manipulation. new UnicodeString('Å'); new ByteString('Å'); new CodePointString('Å'); 41 u("Å"); b("Å"); s("Å");
  29. Symfony’s string component $text = (new UnicodeString('This is a déjà-vu

    situation.')) ->trimEnd('.') ->replace('déjà-vu', 'jamais-vu') ->append('!'); // $text = 'This is a jamais-vu situation!' 42
  30. Symfony’s string component u('foo BAR')->upper(); // 'FOO BAR' u('FOO Bar')->lower();

    // 'foo bar' u('Die O\'Brian Straße')->folded(); // "die o'brian strasse" u('Foo: Bar-baz.')->camel(); // 'fooBarBaz' u('Foo: Bar-baz.')->snake(); // 'foo_bar_baz' u('Foo: Bar-baz.')->camel()->title(); // 'FooBarBaz' 43
  31. Symfony’s string component u('abc')->indexOf('B'); // null u('abc')->ignoreCase()->indexOf('B'); // 1 u('hello')->append('world');

    // 'helloworld' u('hello')->append(' ', 'world'); // 'hello world' u('User')->ensureEnd('Controller'); // 'UserController' u('UserController')->ensureEnd('Controller'); // 'UserController' 44
  32. Symfony’s string component u(' Lorem Ipsum ')->padBoth(20, '-'); // '---

    Lorem Ipsum ----' u('_.')->repeat(10); // '_._._._._._._._._._.' u(' Lorem Ipsum ')->trim(); // 'Lorem Ipsum' u('http://symfony.com')->replace('http://', 'https://'); u('Symfony is great')->slice(0, 7); // 'Symfony' u('template_name.html.twig')->split('.'); // ['template_name', 'html', 'twig'] 45
  33. ByteString specific methods // returns TRUE if the string contents

    // are valid UTF-8 contents b('Lorem Ipsum')->isUtf8(); // true b("\xc3\x28")->isUtf8(); // false 46
  34. CodePointString and UnicodeString specific methods u('नमस्ते')->ascii(); // 'namaste' u('さよなら')->ascii(); //

    'sayonara' u('спасибо')->ascii(); // 'spasibo' u('नमस्ते')->codePointsAt(0); // न [2344] u('नमस्ते')->codePointsAt(1); // म [2350] u('नमस्ते')->codePointsAt(2); // स्ते [2360, 2381, 2340, 2375] 47
  35. Why is "Å" !== "Å" !== "Å"? Combination of A

    (U+0041) and ̊ (U+030A) The codepoint U+00C5 which gives Å, or “Latin Capital Letter A with Ring Above” The codepoint U+212B for “Angstrom Sign Å” 48
  36. Normalization Canonical normalization : NFD : Canonical Decomposition Å =>

    A + ̊ NFC : Canonical Composition A + ̊ => Å Compatibility normalization : NFKD : Compatibility Decomposition Å => A + ̊ NFKC : Compatibility Composition A + ̊ => Å 49
  37. Normalization 50 // these encode the letter as a single

    code point: U+00E5 u('å')->normalize(UnicodeString::NFC); u('å')->normalize(UnicodeString::NFKC); // these encode the letter as two code points: U+0061 + U+030A // a + ◌̊ u('å')->normalize(UnicodeString::NFD); u('å')->normalize(UnicodeString::NFKD);
  38. Normalization $ARing = "\xC3\x85"; // Å (U+00C5) $ARingComposed = "A"."\xCC\x8A";

    // A◌̊ (U+030A) $norm1 = Normalizer::normalize($ARing, Normalizer::FORM_C); $norm2 = Normalizer::normalize($ARingComposed, Normalizer::FORM_C); var_dump($ARing === $ARingComposed); // FALSE var_dump($norm1 === $norm2); // TRUE 51
  39. AsciiSlugger $slugger = new AsciiSlugger(); $slug = $slugger->slug('Wôrķšƥáçè ~~sèťtïñğš~~', '/');

    // $slug = 'Workspace/settings' $slugger = $slugger→withEmoji(); $slug = $slugger→slug('a 😺, and a 🦁 go to 🏞', '-', 'en'); // $slug = 'a-grinning-cat-and-a-lion-go-to-national-park'; $slug = $slugger→slug('un 😺, et un 🦁 vont au 🏞', '-', 'fr'); // $slug = 'un-chat-qui-sourit-et-un-tete-de-lion-vont-au-parc-national'; 53
  40. Inflector $inflector = new EnglishInflector(); $result = $inflector->singularize('teeth'); // ['tooth']

    $result = $inflector->singularize('radii'); // ['radius'] $result = $inflector->singularize('leaves'); // ['leaf', 'leave', 'leaff'] $result = $inflector->pluralize('bacterium'); // ['bacteria'] $result = $inflector->pluralize('news'); // ['news'] $result = $inflector->pluralize('person'); // ['persons', 'people'] 54
  41. CodePointString and UnicodeString specific methods u('󰔮')->codePointsAt(0); [ 0 => 128105

    1 => 8205 2 => 128105 3 => 8205 4 => 128103 5 => 8205 6 => 128102 ] U+1F469 👩 WOMAN U+0200D ZERO WIDTH JOINER U+1F469 👩 WOMAN U+0200D ZERO WIDTH JOINER U+1F467 👧 GIRL U+0200D ZERO WIDTH JOINER U+1F466 👦 BOY 56
  42. python3 : >>> len("󰣷") == 5 True JavaScript : "󰣷".length

    == 7 True Rust : "󰣷".len() == 17 true Elixir : String.length("󰣷") // 1 Swift : var s = "󰣷" print(s.count) // 1 print(s.unicodeScalars.count) // 5 print(s.utf16.count) // 7 print(s.utf8.count) // 17 58 What is the length of “󰣷”?
  43. PHP : strlen('󰣷'); // 17 mb_strlen('󰣷', 'UTF-8'); // 5 Symfony

    : u('󰣷')→length(); // 1 60 What is the length of “󰣷”?
  44. Unicode scalar UTF-32 code units UTF-16 code units UTF-8 code

    units UTF-32 bytes UTF-16 bytes UTF-8 bytes U+1F926 FACE PALM 🤦 1 2 4 4 4 4 U+1F3FC EMOJI MODIFIER FITZPATRICK TYPE-3 🏼 1 2 4 4 4 4 U+200D ZERO WIDTH JOINER 1 1 3 4 2 3 U+2642 MALE SIGN ♂ 1 1 3 4 2 3 U+FE0F VARIATION SELECTOR-16 1 1 3 4 2 3 Total 5 7 17 20 14 17 61 What is the length of “󰣷”?
  45. 64

  46. u('i')->upper()->toString(); // I u('ı')->upper()->toString(); // I u('I')->lower()->toString(); // i u('İ')->lower()->toString();

    // i. u('i')->upper()->codePointsAt(0); // [73] u('ı')->upper()->codePointsAt(0); // [73] u('I')->lower()->codePointsAt(0); // [105] u('İ')->lower()->codePointsAt(0); // [105, 775] 65 Case of the Turkish i
  47. // Symfony\Component\String\AbstractUnicodeString.php public function lower(): static { $str = clone

    $this; $str->string = mb_strtolower(str_replace('İ', 'i', $str->string), 'UTF-8'); return $str; } 66 Case of the Turkish i
  48. Case of the Turkish i > So no, you can’t

    convert string to lowercase without knowing what language that string is written in. var en_US = Locale.of("en", "US"); var tr = Locale.of("tr"); "I".toLowerCase(en_US); // => "i" "I".toLowerCase(tr); // => "ı" "i".toUpperCase(en_US); // => "I" "i".toUpperCase(tr); // => "İ" 67
  49. 68

  50. Trivia: the flags in Unicode - No fixed codepoint -

    Something once in Unicode stays in it forever - Flags might become obsolete - The ISO (International Organisation for Standardization) is the reference with its list of flags recognised by the O.N.U. 󰎐 => 󰎐, flag Belgium 69 without font with the right font
  51. Symfony’s Intl Component Provides access to the ICU data: •

    Language and Script Names • Country Names • Locales • Currencies • Timezones • … 71
  52. EmojiTransliterator use Symfony\Component\Intl\Transliterator\EmojiTransliterator; // describe emojis in English $transliterator =

    EmojiTransliterator::create('en'); $transliterator->transliterate('Menus with 🍕 or 🍝'); // => 'Menus with pizza or spaghetti' // describe emojis in Ukrainian $transliterator = EmojiTransliterator::create('uk'); $transliterator->transliterate('Menus with 🍕 or 🍝'); // => 'Menus with піца or спагеті' 72
  53. EmojiTransliterator use Symfony\Component\Intl\Transliterator\EmojiTransliterator; // describe emojis in Slack short code

    $transliterator = EmojiTransliterator::create('slack'); $transliterator->transliterate('Menus with 🥗 or 🧆'); // => 'Menus with :green_salad: or :falafel:' // use this to describe emojis in Github short code $transliterator = EmojiTransliterator::create('github'); 73
  54. 75

  55. 77

  56. Transliteration? • from a script to another • Αθήνα →

    Athena • You might want to transliterate data before indexing it 78 [ Antwerpen Brussel Cannes // … Zurich Αθήνα ]
  57. Transliteration? transliterator_transliterate( 'Any-Latin; Latin-ASCII; Lower()', "A æ Übérmensch på høyeste

    nivå! И я люблю PHP! fi" ); // "a ae ubermensch pa hoyeste niva! i a lublu php! fi" 79
  58. ASCII: chr() and ord() /** * Generate a single-byte string

    from a number * @param int $codepoint : The ascii code. * @return string the specified character. */ #[Pure] function chr(int $codepoint): string {} /** * Convert the first byte of a string to a value between 0 and 255 * @param string $character : A character. * @return int<0, 255> the ASCII value as an integer. */ #[Pure] function ord(string $character): int {} 80
  59. mb_ functions • Like the PHP’s string fonctions, but on

    more than one byte • str_replace works just fine if needle and haystack have the same encoding • You have to manually enable the mbstring extension in PHP 83
  60. Emojis as class names… interface 🍚 {} interface 🐟 {}

    class 🍣 implements 🍚, 🐟 { } 84
  61. …or any other Unicode character interface ┻━┻ {} class (╯°□°)╯︵┻━┻

    extends Exception implements ┻━┻ { public function __construct($message = __CLASS__, $code = 0, Exception $previous = null) { parent::__construct($message, $code, $previous); } } class (ノ゜Д゜)ノ︵┻━┻ extends Exception implements ┻━┻ { public function __construct($message = __CLASS__, $code = 0, Exception $previous = null) { parent::__construct($message, $code, $previous); } } 85
  62. More readable mathematics ! $√2π = sqrt(2 * $π); $⟮z

    + g +½⟯ᶻ⁺½ = pow($z + $g + 0.5, $z + 0.5); $ℯ^−⟮z + g +½⟯ = exp(-($z + $g + 0.5)); /** * Put it all together: * __ / 1 \ z+½ * √2π | z + g + - | e^-(z+g+½) A(z) * \ 2 / */ return $√2π * $⟮z + g +½⟯ᶻ⁺½ * $ℯ^−⟮z + g +½⟯ * $A⟮z⟯; 87
  63. Homoglyphs ( U+0028 LEFT PARENTHESIS ( U+FF08 FULLWIDTH LEFT PARENTHESIS

    ﹙ U+FE59 SMALL LEFT PARENTHESIS ⁽ U+207D SUPERSCRIPT LEFT PARENTHESIS ₍ U+208D SUBSCRIPT LEFT PARENTHESIS ❨ U+2768 MEDIUM LEFT PARENTHESIS ORNAMENT 89
  64. Set names, Charset, Collate? SET NAMES utf8mb4 COLLATE utf8mb4_general_ci; CREATE

    DATABASE awesome_database CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; 92
  65. Charsets & Collations my_alphabet = [ 0 = A, 1

    = B, 2 = a, 3 = b ] a < b Collation 96
  66. • utf8mb4_unicode_ci : ◦ sorts "ß" like "ss" ◦ "Œ"

    like "OE" ◦ Ignorable characters are skipped • utf8mb4_general_ci : as single characters ◦ "ß" like "s” ◦ "Œ" like "e" Example of collations 97
  67. Naming of UTF-8 • PostgreSQL : UTF8 ◦ 8 bits

    ◦ 1 to 4 bytes • Oracle : AL32UTF8 (Real UTF-8, Unicode 9.0) ◦ Not UTF8 (Actually CESU-8, Unicode 3.0) • MySQL : utf8mb4 ◦ Not utf8 (UTF-8 on 3 bytes) 99
  68. MySQL’s “utf8” 2002 : UTF-8 standard would allow up to

    6 bytes per character Speed boost if all rows are the same number of bytes in a table People would use CHAR because it has a defined number of characters, no matter which value is stored CHAR(1) = 6 bytes, CHAR(2) = 12 bytes, … 2003 : The old UTF-8 standard is declared obsolete by Unicode to make room to the new one Will people try to encode their CHAR columns into UTF-8? Let’s change the size! 100
  69. 101

  70. 102

  71. Homoglyph attack Password forgotten? Enter a fake email address, looking

    like the one you’re attacking [email protected] != miᎬᎬ@example.org The mail will be normalized before looking it up in the database A token for [email protected] is generated then sent to miᎬᎬ@example.org who can now connect as [email protected] 104
  72. Phabricator > On inserting Unicode characters with code points greater

    than 0xFFFF into columns that have a utf8 charset. MySQL then truncates a string as soon as it reaches such a character. Domain restricted subscription Enter “[email protected]🍕@allowed-domain.com” If the check on domain is valid Only “[email protected]” is stocked in the DB ! 105
  73. SOURCES - Webpages https://adamhooper.medium.com/in-mysql-never-use-utf8-use-utf8mb4-11761243e434 https://decodeunicode.org/ https://deliciousbrains.com/how-unicode-works/ https://dev.mysql.com/doc/refman/8.0/en/charset-general.html https://github.com/brefphp/bref/blob/f4df37277181dc76b6f644663de236eae7a793d2/src/functions.php#L11 https://github.com/captioning/captioning/issues/86 https://github.com/jolicode/emoji-search

    https://github.com/markrogoyski/math-php https://github.com/mysql/mysql-server/commit/43a506c0ced0e6ea101d3ab8b4b423ce3fa327d0 https://github.com/PHP-CS-Fixer/PHP-CS-Fixer/blob/master/src/Fixer/Basic/EncodingFixer.php https://github.com/sgolemon/table-flip/blob/master/src/TableFlip.php https://github.com/symfony/symfony/blob/85b97226def5e4a50c1e3805a6c31bb6642efb70/src/Symfony/Component/Intl/Test s/Transliterator/EmojiTransliteratorTest.php https://github.com/symfony/symfony/pull/33896/files https://gizmodo.com/a-cellphones-missing-dot-kills-two-people-puts-three-m-382026 https://hackerone.com/reports/2233 https://hsivonen.fi/string-length/ https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-kno w-about-unicode-and-character-sets-no-excuses/ https://jolicode.github.io/unicode-conf/index.html#/ https://kunststube.net/encoding/ https://news.ycombinator.com/item?id=8892157 108
  74. https://www.php.net/manual/en/function.utf8-decode.php https://www.php.net/manual/en/mbstring.supported-encodings.php https://www.php.net/manual/fr/refs.international.php https://www.postgresql.org/docs/current/multibyte.html https://pyrech.github.io/php-wtf/#/15?_k=dyazd4 https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicod e-ci/ https://symfony.com/blog/new-in-symfony-6-2-better-emoji-support https://symfony.com/doc/current/components/intl.html https://symfony.com/doc/current/components/string.html

    https://tonsky.me/blog/unicode/ http://www.unicode.org/charts/ http://unicode.org/emoji/charts/emoji-variants.html https://unicode-org.github.io/icu/userguide/transforms/general/#script-transliteration https://unicode.org/glossary/ https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Unicode/Test https://en.wikipedia.org/wiki/Character_encoding https://en.wikipedia.org/wiki/UTF-8 https://en.wikipedia.org/wiki/Mojibake https://en.wikipedia.org/wiki/Byte_order_mark https://www.youtube.com/watch?v=kaucJce8hhE&t=19s&ab_channel=TheUnicodeConsortium 109