Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Special characters and where to find them

Special characters and where to find them

There are 23 official languages within the European Union and many, if not all, of them have special characters. In German, for example, there are Umlauts (“üöä”) and the “ß”; and in other languages, there are more. Many characters exist in a pre-composed version and as a combination of two characters. Using the two-character version can lead to a broken search, broken spell check, broken transliteration for the slug, and broken images if this happens in a filename in combination with some server configuration and browsers.

Presentation from WordCamp Europe 2019.

Torsten Landsiedel

June 21, 2019
Tweet

More Decks by Torsten Landsiedel

Other Decks in Technology

Transcript

  1. SPECIAL CHARACTERS AND WHERE TO FIND THEM W O R

    D C A M P E U R O P E 2 0 1 9 // B E R L I N Ümlaut, Baby! To r s t e n L a n d s i e d e l // @ z o d i a c 1 9 7 8
  2. S I N C E 2 0 1 2 Fulltime


    WordPress Freelancer 01. 2 0 0 9 – 2 0 1 7 General Translation Editor (for German) 02. 2 0 1 3 – 2 0 1 8 Support forum moderator
 (for German) 03. WHO AM I? @zodiac1978
  3. There are Do we really want to support them 23

    WHY ARE WE HERE? official languages W I T H I N T H E E U R O P E A N U N I O N ?
  4. N O N E N G L I S H

    I N S T A L L A T I O N S 48,5%
  5. Normalization WHAT AM I TALKING ABOUT? S U B S

    E T S F O N T S L O C A L E S K O L L A T I O N E N C O D I N G C H A R A C T E R S E T S
  6. WAPUU THE BEAR T H I S I S A

    L I T T L E S T O R Y A B O U T …
  7. bear = Bär (in German) WAPUU THE BEAR T H

    I S I S A L I T T L E S T O R Y A B O U T …
  8. WAPUU THE BEAR ä is an Umlaut T H I

    S I S A L I T T L E S T O R Y A B O U T …
  9. WAPUU THE BEAR ä = ASCII 228 T H I

    S I S A L I T T L E S T O R Y A B O U T …
  10. WAPUU THE BEAR T H I S I S A

    L I T T L E S T O R Y A B O U T … ä ≠ a +¨
  11. Even though UTF-8 provides a single byte sequence for each

    character sequence, the existence of multiple character sequences for "the same thing" may have security consequences whenever string matching, indexing, searching, sorting, regular expression matching and selection are involved. „ R F C 3 6 2 9
  12. UNICODE NORMALIZATION FORMS NORMALIZATION FORM Description D (NFD) Canonical Decomposition

    C (NFC) Canonical Decomposition, followed by Canonical Composition KD (NFKD) Compatibility Decomposition KC (NFKC) Compatibility Decomposition, followed by Canonical Composition
  13. However, it can be difficult for users to assure that

    a given resource or set of resources uses a consistent textual representation because the differences are usually not visible when viewed as text. „ W W W . W 3 . O R G / T R / C H A R M O D - N O R M / # U N I C O D E N O R M A L I Z A T I O N
  14. Tools [like WordPress] and implementations thus need to consider the

    difficulties experienced by users when visually or logically equivalent strings that "ought to" match (in the user's mind) are considered to be distinct values. „ W W W . W 3 . O R G / T R / C H A R M O D - N O R M / # U N I C O D E N O R M A L I Z A T I O N
  15. Providing a means for users to see these differences and/or

    normalize them as appropriate makes it possible for end users to avoid failures that spring from invisible differences in their source documents. For example, the W3C Validator warns when an HTML document is not fully in Unicode Normalization Form C. „ W W W . W 3 . O R G / T R / C H A R M O D - N O R M / # U N I C O D E N O R M A L I Z A T I O N
  16. And then picking NFD normalization – and making it visible,

    and actively converting correct unicode into that absolutely horrible format, that’s just inexcusable. Even the people who think normalization is a good thing admit that NFD is a bad format, and certainly not for data exchange. It’s not even „paste-eater“ quality thinking. It’s actually actively corrupting user data. By design. Christ. „ L I N U S T O R V A L D S
  17. HFS Plus converts all file names to decomposed Unicode, while

    Macintosh keyboards generally produce precomposed Unicode. This isn't a problem as long as you use system-provided APIs to process text. Apple's APIs correctly handle both precomposed and decomposed Unicode. T E C H N I C A L Q & A Q A 1 2 3 5 ( A P P L E )
  18. PHP N O R M A L I Z E

    R : : N O R M A L I Z E Normalizes the input provided and returns the normalized string P H P 5 ≧ 5 . 3 . 0 , P H P 7 , P E C L i n t l ≧ 1 . 0 . 0
  19. JAVASCRIPT S T R I N G . P R

    O T O T Y P E . N O R M A L I Z E ( ) The normalize() method returns the specified Unicode Normalization Form of the string. It does not affect the value of the string itself. E C M A S c r i p t 2 0 1 5 ( 6 t h E d i t i o n , E C M A - 2 6 2 )
  20. BLOCK EDITOR/ GUTENBERG (JS) H T M L = H

    T M L . N O R M A L I Z E ( ) ; BUT, only for the content ( # 1 4 1 7 8 ) and no polyfill for older browsers ( # 1 3 1 5 7 ) G i t h u b I s s u e n u m b e r s
  21. L E T ’ S T R Y T H

    A T … ✴ Dictionary/proofreading fails ✴ Transliteration (Bär -> Baer) 
 does not work:
 
 Slug is showing %cc%88 BLOCK EDITOR/ GUTENBERG (JS)
  22. M A Y B E A F T E R

    S A V E ? ✴ Editor still shows 
 „wrong“ permalink BLOCK EDITOR/ GUTENBERG (JS)
  23. I N T E R N A L S E

    A R C H B R O K E N ✴ Search for „Bär“ should have found the post 
 with the title „Bär“ but is showing „No posts found“ BLOCK EDITOR/ GUTENBERG (JS)
  24. B R O W S E R S E A

    R C H B R O K E N ( F I R E F O X ) BLOCK EDITOR/ GUTENBERG (JS)
  25. B R O W S E R S E A

    R C H W O R K S ( C H R O M E ) BLOCK EDITOR/ GUTENBERG (JS)
  26. B R O W S E R S E A

    R C H B R O K E N ! 
 ( S A F A R I ) BLOCK EDITOR/ GUTENBERG (JS)
  27. W E I R D B E H A V

    I O U R BLOCK EDITOR/ GUTENBERG (JS) ✴ This happens on first pasting and publishing
  28. W E I R D B E H A V

    I O U R BLOCK EDITOR/ GUTENBERG (JS) ✴ This happens after deleting the slug and re-generating it
  29. TRAC IDEAS SO FAR A D D I T I

    O N A L F I L T E R S Using the PHP normalizer function 
 (if available) 01. R E G E X 
 P O L Y F I L L S / F A L L B A C K S If PHP normalizer function not available 02. G L O B A L J S / P A S T E S O L U T I O N Current solution
 is limited 
 to the editor 03.
  30. PICTURE CREDITS #2 Photo by Harrison Moore on Unsplash #2

    Photo by Jonas Jacobsson on Unsplash #4 https:/ /wordpress.org/about/stats/ #3 Photo by Adi Goldstein on Unsplash #5+6 Photo by Nad X on Unsplash #8 Thanks to Michael Schäfer! #9 Screenshot from https:/ /walktowc.eu/ #11 Photo by Darius Bashar on Unsplash #12-16 Wapuu the BER
 #22 Photo by Jon Tyson on Unsplash #26 Photo by Javier García on Unsplash #27 Wikimedia: Krd CC BY-SA 3.0 #29 Photo by Nick Fewings on Unsplash #33 Photo by Conor Samuel on Unsplash #43 Photo by Ivana Milakovic on Unsplash #46 Photo by Aarón Blanco Tejedor on Unsplash #47 Photo by Bruce Warrington on Unsplash #48 Photo by Mr TT on Unsplash #48 Photo by Chris Barbalis on Unsplash #48 Photo by Ashim D’Silva on Unsplash #49 Photo by Cris DiNoto on Unsplash #50 Photo by Chris Barbalis on Unsplash