CJK and Unicode From a PHP Committer

Self introdution me てきめん (tekimen) • https://tekitoh-memdhoi.info • X(twitter): @youkidearitai
• https://github.com/youkidearitai • https://mstdn.jp/tekimen • https://phpc.social/@youkidearitai • A PHP committer in mbstring/Unicode

Agenda • Unicode is good • About CJK – Ideographic
character and Phonogram character • About grapheme functions • What’s new in Unicode side of PHP

Unicode is good • Unicode is good • Freedom from
mojibake(garbled characters) • We can also use emoj 🎉🎉 i

About 漢字(Kanji) • It originated in China and is pronounced
“Hanji” in Chinese. • 漢字 is ideographic character(表意文字). – It means “Words have meaning.” – For example, English is phonogram character(表音文字). It means “Words have sounds”.

How to learn Kanji (or Ideograph character)? Although incorrect, it
is easy to understand when learning ideographic characters. ストーリーで覚える漢字300 英語・韓国語・ポルトガル語・スペイン語訳版 https://www.9640.jp/nihongo/ja/detail/?402 https://www.amazon.co.jp/dp/4874244025

Nit: Correct(maybe) history of “駅” • A glimpse into the
history of the station reveals that it was formed using pictograms. • Left side is horse(馬)、 Right side is eye(目) + handcuff symbol(幸) → 驛(old character) → 駅 • 「駅/驛」という漢字の意味・成り立ち・読み方・画数・部首を学習 https://okjiten.jp/kanji462.html • 駅とは (エキとは) [単語記事] - ニコニコ大百科 https://dic.nicovideo.jp/a/%E9%A7%85

About CJK • Kanji originated in China and spread throughout
East Asia, where various characters are now shared. • However, the shape of kanji characters varies depending on the country or region. • By standardizing to Unicode, these kanji characters have also been integrated. – It says Han unification(漢字統合). • These countries are referred to as “CJK” from the first letters of their alphabets (China, Japan, Korea). – Vietnamese may also be included, in which case it is called CJKV.

Example: 化 U+5316 化化化 Chinese Noto Sans SC
Japanese Noto Sans JP Korean Noto Sans KR Not protruding

Example: 乳 U+4E73 乳乳乳 Chinese Noto Sans SC
Japanese Noto Sans JP Korean Noto Sans KR Different point

Example: 誤 U+8AA4 誤誤誤 Chinese Noto Sans SC
Japanese Noto Sans JP Korean Noto Sans KR Same Code Point, But different all!

Further information • Your Code Displays Japanese Wrong – How
to fix: Please specify country font. •

Unicode has unified the world. But... • On the other
hand, there are also regional differences. CJK is one example. • Unicode has a concept called locale. – The Turkish character “İ (U+0130)” cannot be identified as a lowercase character in root locale. • The lowercase letter “ss” cannot be recognized as lowercase, even though it is the lowercase form of the German letter “ß (U+1E9E)”.

Grapheme cluster • 🇯🇵 is included “ 🇯” and “
🇵”. – Emoji sometimes has multi-codepoint. • Also, Japanese Kanji has multi-codepoint characters. For example, 「邉」 – 邉邉邉󠄁 邉󠄂 邉󠄃 邉󠄄 邉󠄅 邉󠄆 邉󠄈 邉󠄉 邉󠄊 邉󠄋 邉󠄌 邉󠄍 邉󠄎 – Code point is U+9089 U+E0101 • Variation Selector: U+E0100 between U+E01EF – This said IVS(Ideographic Variation Sequence)

Grapheme cluster • A cluster of graphemes that appears as
a single character is called a grapheme cluster. – Kanji(漢字) – Emoji(✌️ ) – Maybe supports characters from around the world

PHP Unicode Support • The grapheme function is supported. This
supports grapheme clusters. • In PHP 8.5, grapheme functions are add locale parameter. – That means support to problems to previous slide. (İ (U+0130) matches lower case and ß (U+1E9E) matches “ss”) • Use $locale for “de_DE-u-ks-primary”

Grapheme functions • It is located in the Intl extension.
– Require –enable-intl option. • The number of characters in a cluster is counted as the number of characters seen. – Emojis are useful 🙆‍♂️🙆‍♀️

grapheme function and mbstring function mb_str_split is splits in Unicode
code point unit grapheme_str_split is splits in grapheme cluster unit Splits Unicode code point unit • U+1F646 • U+200D • U+2640 • U+FE0F Splits grapheme cluster unit

Add locale for grapheme functions • https://wiki.php.net/rfc/graph eme_add_locale_for_case_in sensitive –
I add to $locale parameter for grapheme functions. – Based on LDML(Locale Data Markup Language) • https://www.unicode.o rg/reports/tr35/

How to use the $locale parameter • クリックしてテキストを追加 •
A İ (U+0130) matches i use a “tr_TR” • A ß(U+1E9E) matches ss use a “u-ks-level1” • A 邉󠄅 not matches 邊 use a “u-ks-identic”

Conclusion • Enhancements to the grapheme function enable further Unicode
support. • Unicode is difficult, but we can communicate in the world. – I explained in detail using Kanji, but I think the difference in characters is acceptable. • I think disappear discomfort in several generation that CJK's seems different.

Thank you!

Appendix: references • https://maidonanews.jp/article/13142228 • https://www.unicode.org/reports/tr35/ • https://heistak.github.io/your-code-displays-japanese-wrong/ • https://www.9640.jp/nihongo/ja/detail/?402
• https://speakerdeck.com/youkidearitai/wen-zi-tohananika-phpno wen-zi-kodochu-li-nituite-php-lovers-meetup-number-5 • https://ken-lunde.medium.com/genuine-han-unification-redux-391 2b561ecae – JP: https://medium.com/@takagi.yuusuke/genuine-han-unifica tion-%E6%97%A5%E6%9C%AC%E8%AA%9E%E8%A8%B3-24a 705d77f9b

CJK and Unicode From a PHP Committer

CJK and Unicode From a PHP Committer

てきめん tekimen PRO

More Decks by てきめん tekimen

Other Decks in Programming

Featured

Transcript

CJK and Unicode From a PHP Committer

Self introdution me てきめん (tekimen) • https://tekitoh-memdhoi.info • X(twitter): @youkidearitai

Agenda • Unicode is good • About CJK – Ideographic

Unicode is good • Unicode is good • Freedom from

About 漢字(Kanji) • It originated in China and is pronounced

How to learn Kanji (or Ideograph character)? Although incorrect, it

Nit: Correct(maybe) history of “駅” • A glimpse into the

About CJK • Kanji originated in China and spread throughout

Example: 化 U+5316 化化化 Chinese Noto Sans SC

Example: 乳 U+4E73 乳乳乳 Chinese Noto Sans SC

Example: 誤 U+8AA4 誤誤誤 Chinese Noto Sans SC

Further information • Your Code Displays Japanese Wrong – How

Unicode has unified the world. But... • On the other

Grapheme cluster • 🇯🇵 is included “ 🇯” and “

Grapheme cluster • A cluster of graphemes that appears as

PHP Unicode Support • The grapheme function is supported. This

Grapheme functions • It is located in the Intl extension.

grapheme function and mbstring function mb_str_split is splits in Unicode

Add locale for grapheme functions • https://wiki.php.net/rfc/graph eme_add_locale_for_case_in sensitive –

How to use the $locale parameter • クリックしてテキストを追加 •

Conclusion • Enhancements to the grapheme function enable further Unicode

Thank you!

Appendix: references • https://maidonanews.jp/article/13142228 • https://www.unicode.org/reports/tr35/ • https://heistak.github.io/your-code-displays-japanese-wrong/ • https://www.9640.jp/nihongo/ja/detail/?402