README.md in unicode-scripts-1.10.0 vs README.md in unicode-scripts-1.11.0

- old
+ new

@@ -1,22 +1,22 @@ # Unicode::Scripts [![[version]](https://badge.fury.io/rb/unicode-scripts.svg)](https://badge.fury.io/rb/unicode-scripts) [![[ci]](https://github.com/janlelis/unicode-scripts/workflows/Test/badge.svg)](https://github.com/janlelis/unicode-scripts/actions?query=workflow%3ATest) -Retrieve the [Unicode script(s)](https://en.wikipedia.org/wiki/Script_%28Unicode%29) a string belongs to. Can also return the *Script_Extension* property which is defined as characters which are "commonly used with more than one script, but with a limited number of scripts". +Retrieve all [Unicode script(s)](https://en.wikipedia.org/wiki/Script_%28Unicode%29) a string belongs to. Can also return the *Script_Extension* property (scx) which is defined as characters which are "commonly used with more than one script, but with a limited number of scripts". +Based on the *Script_Extension*, this library can also return the [augmented script set](https://www.unicode.org/reports/tr39/#def-augmented-script-set) to figure out if a string is **mixed-script** or **single-script**. Mixed scripts can be an indicator of suspicious user inputs. + Unicode version: **16.0.0** (September 2024) -Supported Rubies: **3.3**, **3.2**, **3.1**, **3.0** +Supported Rubies: **3.x** (might work: **2.x**) -Old Rubies that might still work: **2.7**, **2.6**, **2.5**, **2.4**, **2.3**, **2.X** - ## Gemfile ```ruby gem "unicode-scripts" ``` -## Usage +## Usage - Scripts and Script Extensions ```ruby require "unicode/scripts" Unicode::Scripts.scripts("СC") # => ["Cyrillic", "Latin"] @@ -32,390 +32,100 @@ # => ["Bengali", "Devanagari", "Dogra", "Grantha", "Gujarati", "Gunjala_Gondi", "Gurmukhi","Gurung_Khema", "Kannada","Khudawadi", "Limbu", "Mahajani", "Malayalam", "Masaram_Gondi", "Nandinagari", "Ol_Onal", "Oriya", "Sinhala", "Syloti_Nagri", "Takri", "Tamil", "Telugu", "Tirhuta"] ``` -## Hints -### Regex Matching +## Usage - Augmented Scripts -If you have a string and want to match a substring/character from a specific Unicode script, you actually won't need this gem. Instead, you can use the [Regexp Unicode Property Syntax `\p{}`](https://ruby-doc.org/core/Regexp.html#class-Regexp-label-Character+Properties): +Like script extensions, but adds meta scripts for Asian languages and treats _Common_/_Inherited_ values as ALL scripts. ```ruby -"Coptic letter: ⲁ".scan(/\p{Coptic}/) # => ["ⲁ"] +require "unicode/scripts" + +Unicode::Scripts.augmented_scripts("ねガ") # => ['Hira', 'Kana', 'Jpan'] +Unicode::Scripts.augmented_scripts("1") # => ["Adlm", "Aghb", "Ahom", … ] ``` -See [Idiosyncratic Ruby: Proper Unicoding](https://idiosyncratic-ruby.com/41-proper-unicoding.html) for more info. +## Usage - Resolved Script -### Script Names +Intersection of all augmented scripts per character. +```ruby +require "unicode/scripts" + +Unicode::Scripts.resolved_scripts("СігсӀе") # => [ 'Cyrl' ] +Unicode::Scripts.resolved_scripts("Сirсlе") # => [] +Unicode::Scripts.resolved_scripts("𝖢𝗂𝗋𝖼𝗅𝖾") # => ['Adlm', 'Aghb', 'Ahom', … ] +Unicode::Scripts.resolved_scripts("1") # => ['Adlm','Aghb', 'Ahom', … ] +Unicode::Scripts.resolved_scripts("ねガ") # => ['Hira', 'Kana', 'Jpan'] +``` + +Please note that the **resolved script** can contain multiple scripts, as per standard. + +## Usage - Mixed-Script Detection + +Mixed-script if resolved script set is empty, single-script otherwise. + +```ruby +require "unicode/scripts" + +Unicode::Scripts.mixed?("СігсӀе"); # => false +Unicode::Scripts.mixed?("Сirсlе"); # => true +Unicode::Scripts.mixed?("𝖢𝗂𝗋𝖼𝗅𝖾"); # => false +Unicode::Scripts.mixed?("1"); # => false +Unicode::Scripts.mixed?("ねガ"); # => false + +Unicode::Scripts.single?("СігсӀе"); # => true +Unicode::Scripts.single?("Сirсlе"); # => false +Unicode::Scripts.single?("𝖢𝗂𝗋𝖼𝗅𝖾"); # => true +Unicode::Scripts.single?("1"); # => true +Unicode::Scripts.single?("ねガ"); # => true +``` + +Please note that a **single-script** string might actually contain multiple scripts, as per standard (e.g. for Asian languages) + +### List of All Scripts + You can extract all script names from the gem like this: ```ruby require "unicode/scripts" -puts Unicode::Scripts.names +puts Unicode::Scripts.names # list of scripts +``` -# # # Output # # # +To get all 4 letter script codes (ISO 15924): -Adlam -Ahom -Anatolian_Hieroglyphs -Arabic -Armenian -Avestan -Balinese -Bamum -Bassa_Vah -Batak -Bengali -Bhaiksuki -Bopomofo -Brahmi -Braille -Buginese -Buhid -Canadian_Aboriginal -Carian -Caucasian_Albanian -Chakma -Cham -Cherokee -Chorasmian -Common -Coptic -Cuneiform -Cypriot -Cypro_Minoan -Cyrillic -Deseret -Devanagari -Dives_Akuru -Dogra -Duployan -Egyptian_Hieroglyphs -Elbasan -Elymaic -Ethiopic -Garay -Georgian -Glagolitic -Gothic -Grantha -Greek -Gujarati -Gunjala_Gondi -Gurmukhi -Gurung_Khema -Han -Hangul -Hanifi_Rohingya -Hanunoo -Hatran -Hebrew -Hiragana -Imperial_Aramaic -Inherited -Inscriptional_Pahlavi -Inscriptional_Parthian -Javanese -Kaithi -Kannada -Katakana -Katakana_Or_Hiragana -Kawi -Kayah_Li -Kharoshthi -Khitan_Small_Script -Khmer -Khojki -Khudawadi -Kirat_Rai -Lao -Latin -Lepcha -Limbu -Linear_A -Linear_B -Lisu -Lycian -Lydian -Mahajani -Makasar -Malayalam -Mandaic -Manichaean -Marchen -Masaram_Gondi -Medefaidrin -Meetei_Mayek -Mende_Kikakui -Meroitic_Cursive -Meroitic_Hieroglyphs -Miao -Modi -Mongolian -Mro -Multani -Myanmar -Nabataean -Nag_Mundari -Nandinagari -New_Tai_Lue -Newa -Nko -Nushu -Nyiakeng_Puachue_Hmong -Ogham -Ol_Chiki -Ol_Onal -Old_Hungarian -Old_Italic -Old_North_Arabian -Old_Permic -Old_Persian -Old_Sogdian -Old_South_Arabian -Old_Turkic -Old_Uyghur -Oriya -Osage -Osmanya -Pahawh_Hmong -Palmyrene -Pau_Cin_Hau -Phags_Pa -Phoenician -Psalter_Pahlavi -Rejang -Runic -Samaritan -Saurashtra -Sharada -Shavian -Siddham -SignWriting -Sinhala -Sogdian -Sora_Sompeng -Soyombo -Sundanese -Sunuwar -Syloti_Nagri -Syriac -Tagalog -Tagbanwa -Tai_Le -Tai_Tham -Tai_Viet -Takri -Tamil -Tangsa -Tangut -Telugu -Thaana -Thai -Tibetan -Tifinagh -Tirhuta -Todhri -Toto -Tulu_Tigalari -Ugaritic -Unknown -Vai -Vithkuqi -Wancho -Warang_Citi -Yezidi -Yi -Zanabazar_Square +```ruby +require "unicode/scripts" +puts Unicode::Scripts.names(format: :short) # list of scripts ``` -### Short Script Names -You can extract all 4 letter script names from the gem like this: +Augmented scripts: ```ruby require "unicode/scripts" -puts Unicode::Scripts.names(format: :short) +puts Unicode::Scripts.names(format: :short, augmented: :only) +``` -# # # Output # # # +You can find a list of all scripts in Unicode, with links to Wikipedia on [character.construction/scripts](https://character.construction/scripts) -Adlm -Aghb -Ahom -Arab -Armi -Armn -Avst -Bali -Bamu -Bass -Batk -Beng -Bhks -Bopo -Brah -Brai -Bugi -Buhd -Cakm -Cans -Cari -Cham -Cher -Chrs -Copt -Cpmn -Cprt -Cyrl -Deva -Diak -Dogr -Dsrt -Dupl -Egyp -Elba -Elym -Ethi -Gara -Geor -Glag -Gong -Gonm -Goth -Gran -Grek -Gujr -Gukh -Guru -Hang -Hani -Hano -Hatr -Hebr -Hira -Hluw -Hmng -Hmnp -Hrkt -Hung -Ital -Java -Kali -Kana -Kawi -Khar -Khmr -Khoj -Kits -Knda -Krai -Kthi -Lana -Laoo -Latn -Lepc -Limb -Lina -Linb -Lisu -Lyci -Lydi -Mahj -Maka -Mand -Mani -Marc -Medf -Mend -Merc -Mero -Mlym -Modi -Mong -Mroo -Mtei -Mult -Mymr -Nagm -Nand -Narb -Nbat -Newa -Nkoo -Nshu -Ogam -Olck -Onao -Orkh -Orya -Osge -Osma -Ougr -Palm -Pauc -Perm -Phag -Phli -Phlp -Phnx -Plrd -Prti -Qaac -Qaai -Rjng -Rohg -Runr -Samr -Sarb -Saur -Sgnw -Shaw -Shrd -Sidd -Sind -Sinh -Sogd -Sogo -Sora -Soyo -Sund -Sunu -Sylo -Syrc -Tagb -Takr -Tale -Talu -Taml -Tang -Tavt -Telu -Tfng -Tglg -Thaa -Thai -Tibt -Tirh -Tnsa -Todr -Toto -Tutg -Ugar -Vaii -Vith -Wara -Wcho -Xpeo -Xsux -Yezi -Yiii -Zanb -Zinh -Zyyy -Zzzz +## Hints +### Regex Matching + +If you have a string and want to match a substring/character from a specific Unicode script, you actually won't need this gem. Instead, you can use the [Regexp Unicode Property Syntax `\p{}`](https://ruby-doc.org/core/Regexp.html#class-Regexp-label-Character+Properties): + +```ruby +"Coptic letter: ⲁ".scan(/\p{Coptic}/) # => ["ⲁ"] ``` -See [unicode-x](https://github.com/janlelis/unicode-x) for more Unicode related micro libraries. +See [Idiosyncratic Ruby: Proper Unicoding](https://idiosyncratic-ruby.com/41-proper-unicoding.html) for more info. + +## Also See + +- JavaScript implementation (same data & algorithms): [unicode-script.js](https://github.com/janlelis/unicode-script.js) +- Index created with: [unicoder](https://github.com/janlelis/unicoder) +- Get the Unicode blocks of a string: [unicode-blocks gem](https://github.com/janlelis/unicode-blocks) +- See [unicode-x](https://github.com/janlelis/unicode-x) for more Unicode related micro libraries for Ruby. ## MIT License - Copyright (C) 2016-2024 Jan Lelis <https://janlelis.com>. Released under the MIT license. - Unicode data: https://www.unicode.org/copyright.html#Exhibit1