Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CJK and Unicode From a PHP Committer

CJK and Unicode From a PHP Committer

Avatar for てきめん tekimen

てきめん tekimen PRO

August 30, 2025
Tweet

More Decks by てきめん tekimen

Other Decks in Programming

Transcript

  1. Self introdution me てきめん (tekimen) • https://tekitoh-memdhoi.info • X(twitter): @youkidearitai

    • https://github.com/youkidearitai • https://mstdn.jp/tekimen • https://phpc.social/@youkidearitai • A PHP committer in mbstring/Unicode
  2. Agenda • Unicode is good • About CJK – Ideographic

    character and Phonogram character • About grapheme functions • What’s new in Unicode side of PHP
  3. Unicode is good • Unicode is good • Freedom from

    mojibake(garbled characters) • We can also use emoj 🎉🎉 i
  4. About 漢字(Kanji) • It originated in China and is pronounced

    “Hanji” in Chinese. • 漢字 is ideographic character(表意文字). – It means “Words have meaning.” – For example, English is phonogram character(表音 文字). It means “Words have sounds”.
  5. How to learn Kanji (or Ideograph character)? Although incorrect, it

    is easy to understand when learning ideographic characters. ストーリーで覚える漢字300 英語・韓国語・ポルトガル語・スペ イン語訳版 https://www.9640.jp/nihongo/ja/detail/?402 https://www.amazon.co.jp/dp/4874244025
  6. Nit: Correct(maybe) history of “駅” • A glimpse into the

    history of the station reveals that it was formed using pictograms. • Left side is horse(馬)、 Right side is eye(目) + handcuff symbol(幸) → 驛(old character) → 駅 • 「駅/驛」という漢字の意味・成り立ち・読み方・画数・部首を学習 https://okjiten.jp/kanji462.html • 駅とは (エキとは) [単語記事] - ニコニコ大百科 https://dic.nicovideo.jp/a/%E9%A7%85
  7. About CJK • Kanji originated in China and spread throughout

    East Asia, where various characters are now shared. • However, the shape of kanji characters varies depending on the country or region. • By standardizing to Unicode, these kanji characters have also been integrated. – It says Han unification(漢字統合). • These countries are referred to as “CJK” from the first letters of their alphabets (China, Japan, Korea). – Vietnamese may also be included, in which case it is called CJKV.
  8. Example: 化 U+5316 化 化 化 Chinese Noto Sans SC

    Japanese Noto Sans JP Korean Noto Sans KR Not protruding
  9. Example: 乳 U+4E73 乳 乳 乳 Chinese Noto Sans SC

    Japanese Noto Sans JP Korean Noto Sans KR Different point
  10. Example: 誤 U+8AA4 誤 誤 誤 Chinese Noto Sans SC

    Japanese Noto Sans JP Korean Noto Sans KR Same Code Point, But different all!
  11. Unicode has unified the world. But... • On the other

    hand, there are also regional differences. CJK is one example. • Unicode has a concept called locale. – The Turkish character “İ (U+0130)” cannot be identified as a lowercase character in root locale. • The lowercase letter “ss” cannot be recognized as lowercase, even though it is the lowercase form of the German letter “ß (U+1E9E)”.
  12. Grapheme cluster • 🇯🇵 is included “ 🇯” and “

    🇵”. – Emoji sometimes has multi-codepoint. • Also, Japanese Kanji has multi-codepoint characters. For example, 「邉」 – 邉 邉 邉󠄁 邉󠄂 邉󠄃 邉󠄄 邉󠄅 邉󠄆 邉󠄈 邉󠄉 邉󠄊 邉󠄋 邉󠄌 邉󠄍 邉󠄎 – Code point is U+9089 U+E0101 • Variation Selector: U+E0100 between U+E01EF – This said IVS(Ideographic Variation Sequence)
  13. Grapheme cluster • A cluster of graphemes that appears as

    a single character is called a grapheme cluster. – Kanji(漢字) – Emoji(✌️ ) – Maybe supports characters from around the world
  14. PHP Unicode Support • The grapheme function is supported. This

    supports grapheme clusters. • In PHP 8.5, grapheme functions are add locale parameter. – That means support to problems to previous slide. (İ (U+0130) matches lower case and ß (U+1E9E) matches “ss”) • Use $locale for “de_DE-u-ks-primary”
  15. Grapheme functions • It is located in the Intl extension.

    – Require –enable-intl option. • The number of characters in a cluster is counted as the number of characters seen. – Emojis are useful 🙆‍♂️🙆‍♀️
  16. grapheme function and mbstring function mb_str_split is splits in Unicode

    code point unit grapheme_str_split is splits in grapheme cluster unit Splits Unicode code point unit • U+1F646 • U+200D • U+2640 • U+FE0F Splits grapheme cluster unit
  17. Add locale for grapheme functions • https://wiki.php.net/rfc/graph eme_add_locale_for_case_in sensitive –

    I add to $locale parameter for grapheme functions. – Based on LDML(Locale Data Markup Language) • https://www.unicode.o rg/reports/tr35/
  18. How to use the $locale parameter • クリックしてテキストを 追加 •

    A İ (U+0130) matches i use a “tr_TR” • A ß(U+1E9E) matches ss use a “u-ks-level1” • A 邉󠄅 not matches 邊 use a “u-ks-identic”
  19. Conclusion • Enhancements to the grapheme function enable further Unicode

    support. • Unicode is difficult, but we can communicate in the world. – I explained in detail using Kanji, but I think the difference in characters is acceptable. • I think disappear discomfort in several generation that CJK's seems different.
  20. Appendix: references • https://maidonanews.jp/article/13142228 • https://www.unicode.org/reports/tr35/ • https://heistak.github.io/your-code-displays-japanese-wrong/ • https://www.9640.jp/nihongo/ja/detail/?402

    • https://speakerdeck.com/youkidearitai/wen-zi-tohananika-phpno wen-zi-kodochu-li-nituite-php-lovers-meetup-number-5 • https://ken-lunde.medium.com/genuine-han-unification-redux-391 2b561ecae – JP: https://medium.com/@takagi.yuusuke/genuine-han-unifica tion-%E6%97%A5%E6%9C%AC%E8%AA%9E%E8%A8%B3-24a 705d77f9b