README.md in unicode-emoji-3.7.0 vs README.md in unicode-emoji-3.8.0

- old
+ new

@@ -1,120 +1,163 @@ # Unicode::Emoji [![[version]](https://badge.fury.io/rb/unicode-emoji.svg)](https://badge.fury.io/rb/unicode-emoji) [![[ci]](https://github.com/janlelis/unicode-emoji/workflows/Test/badge.svg)](https://github.com/janlelis/unicode-emoji/actions?query=workflow%3ATest) -Provides regular expressions to find Emoji in strings, incorporating the latest Unicode and Emoji standards. +Provides regular expressions to find Emoji in strings, incorporating the latest Unicode / Emoji standards. Additional features: -- A categorized list of recommended Emoji +- A categorized list of Emoji (RGI: Recommended for General Interchange) - Retrieve Emoji properties info about specific codepoints (Emoji_Modifier, Emoji_Presentation, etc.) Emoji version: **16.0** (September 2024) -CLDR version (used for sub-region flags): **45** (April 2024) +CLDR version (used for sub-region flags): **46** (October 2024) ## Gemfile ```ruby gem "unicode-emoji" ``` -## Usage +## Usage โ€“ Regex Matching -### Regex - The gem includes multiple Emoji regexes, which are compiled out of various Emoji Unicode data sources. ```ruby require "unicode/emoji" -string = "String which contains all kinds of emoji: +string = "String which contains all types of Emoji sequences: - Singleton Emoji: ๐Ÿ˜ด - Textual singleton Emoji with Emoji variation: โ–ถ๏ธ - Emoji with skin tone modifier: ๐Ÿ›Œ๐Ÿฝ - Region flag: ๐Ÿ‡ต๐Ÿ‡น - Sub-Region flag: ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ - Keycap sequence: 2๏ธโƒฃ - Sequence using ZWJ (zero width joiner): ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ - " string.scan(Unicode::Emoji::REGEX) # => ["๐Ÿ˜ด", "โ–ถ๏ธ", "๐Ÿ›Œ๐Ÿฝ", "๐Ÿ‡ต๐Ÿ‡น", "๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ", "2๏ธโƒฃ", "๐Ÿคพ๐Ÿฝโ€โ™€๏ธ"] ``` -#### Main Regexes +Depending on your exact usecase, you can choose between multiple levels of Emoji detection: -There are multiple levels of Emoji detection: +### Main Regexes Regex | Description | Example Matches | Example Non-Matches ------------------------------|-------------|-----------------|-------------------- -`Unicode::Emoji::REGEX` | **Use this one if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *recommended* Emoji sequences | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ` | `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿค โ€๐Ÿคข`, `1` -`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *valid* Emoji sequences | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข` | `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต`, `1` -`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *well-formed* Emoji sequences | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ‡ต๐Ÿ‡ต` | `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `1` -`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `1` | +`Unicode::Emoji::REGEX` | **Use this one if unsure!** Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *recommended* Emoji sequences (RGI/FQE) | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ` | `๐Ÿคพ๐Ÿฝโ€โ™€`, `๐ŸŒโ€โ™‚๏ธ`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿค โ€๐Ÿคข`, `1`, `1โƒฃ` +`Unicode::Emoji::REGEX_VALID` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *valid* Emoji sequences | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿคพ๐Ÿฝโ€โ™€` ,`๐ŸŒโ€โ™‚๏ธ`, `๐Ÿค โ€๐Ÿคข` | `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต`, `1`, `1โƒฃ` +`Unicode::Emoji::REGEX_WELL_FORMED` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji) and all kinds of *well-formed* Emoji sequences | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿคพ๐Ÿฝโ€โ™€`,`๐ŸŒโ€โ™‚๏ธ` , `๐Ÿค โ€๐Ÿคข`, `๐Ÿ‡ต๐Ÿ‡ต` | `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `1`, `1โƒฃ` +`Unicode::Emoji::REGEX_POSSIBLE` | Matches all singleton Emoji, singleton components, all kinds of Emoji sequences, and even single digits (except for: unqualified keycap sequences) | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿคพ๐Ÿฝโ€โ™€`, `๐ŸŒโ€โ™‚๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `1` | `1โƒฃ` -##### Picking the Right Emoji Regex +#### Include Text Emoji -- Usually you just want `REGEX` (RGI set) -- If you want broader matching (any ZJW sequences, more sub-region flags), choose `REGEX_VALID` -- Even brolader is `REGEX_WELL_FORMED`, which will also match any region flag and any tag sequence -- And then there is `REGEX_POSSIBLE` , which is a quick check for possible Emoji, which might contain false positives, [suggested in the Unicode Standard](https://www.unicode.org/reports/tr51/#EBNF_and_Regex) +By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix: -Property | Escaped | `REGEX` (RGI / Recommended) | `REGEX_VALID` (Valid) | `REGEX_WELL_FORMED` (Well-formed) | `REGEX_POSSIBLE` ----------|---------|-----------------------------|-----------------------|-----------------------------------|----------------- -Region "๐Ÿ‡ต๐Ÿ‡น" | `\u{1F1F5 1F1F9}` | Yes | Yes | Yes | Yes -Region "๐Ÿ‡ต๐Ÿ‡ต" | `\u{1F1F5 1F1F5}` | No | No | Yes | Yes -Tag Sequence "๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ" | `\u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F}` | Yes | Yes | Yes | Yes -Tag Sequence "๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ" | `\u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F}` | No | Yes | Yes | Yes -Tag Sequence "๐Ÿ˜ด๓ ง๓ ข๓ ก๓ ก๓ ก๓ ฟ" | `\u{1F634 E0067 E0062 E0061 E0061 E0061 E007F}` | No | No | Yes | Yes -ZWJ Sequence "๐Ÿคพ๐Ÿฝโ€โ™€๏ธ" | `\u{1F93E 1F3FD 200D 2640 FE0F}` | Yes | Yes | Yes | Yes -ZWJ Sequence "๐Ÿค โ€๐Ÿคข" | `\u{1F920 200D 1F922}` | No | Yes | Yes | Yes +Regex | Description | Example Matches | Example Non-Matches +------------------------------|-------------|-----------------|-------------------- +`Unicode::Emoji::REGEX_INCLUDE_TEXT` | `REGEX` + `REGEX_TEXT` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `1โƒฃ` | `๐Ÿคพ๐Ÿฝโ€โ™€`, `๐ŸŒโ€โ™‚๏ธ`, `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿค โ€๐Ÿคข`, `1` +`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT` | `REGEX_VALID` + `REGEX_TEXT` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿคพ๐Ÿฝโ€โ™€`, `๐ŸŒโ€โ™‚๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `1โƒฃ` | `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต`, `1` +`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT` | `REGEX_WELL_FORMED` + `REGEX_TEXT` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿคพ๐Ÿฝโ€โ™€`, `๐ŸŒโ€โ™‚๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `1โƒฃ` | `๐Ÿป`, `1` -Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for more details, examples, explanations. +#### Minimally-qualified and Unqualified Sequences -More info about valid vs. recommended Emoji can also be found in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/). +Regex | Description | Example Matches | Example Non-Matches +------------------------------|-------------|-----------------|-------------------- +`Unicode::Emoji::REGEX_INCLUDE_MQE` | Like `REGEX`, but additionally includes Emoji with missing Emoji Presentation Variation Selectors, where the first partial Emoji has all required Variation Selectors | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿคพ๐Ÿฝโ€โ™€` | `๐ŸŒโ€โ™‚๏ธ`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿค โ€๐Ÿคข`, `1`, `1โƒฃ` +`Unicode::Emoji::REGEX_INCLUDE_MQE_UQE` | Like `REGEX`, but additionally includes Emoji with missing Emoji Presentation Variation Selectors | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿคพ๐Ÿฝโ€โ™€`, `๐ŸŒโ€โ™‚๏ธ` | `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿค โ€๐Ÿคข`, `1`, `1โƒฃ` +[List of MQE and UQE Emoji sequences](https://character.construction/unqualified-emoji) + #### Singleton Regexes Matches only simple one-codepoint (+ optional variation selector) Emoji: Regex | Description | Example Matches | Example Non-Matches ------------------------------|-------------|-----------------|-------------------- -`Unicode::Emoji::REGEX_BASIC` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all | `๐Ÿ˜ด`, `โ–ถ๏ธ` | `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `๐Ÿ‡ต๐Ÿ‡ต`,`2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `1` -`Unicode::Emoji::REGEX_TEXT` | Matches only textual singleton Emoji (except for singleton components, like digits) | `๐Ÿ˜ด๏ธŽ`, `โ–ถ` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿป`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `๐Ÿ‡ต๐Ÿ‡ต`,`2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `1` +`Unicode::Emoji::REGEX_BASIC` | Matches (non-textual) singleton Emoji (except for singleton components, like a skin tone modifier without base Emoji), but no sequences at all | `๐Ÿ˜ด`, `โ–ถ๏ธ` | `๐Ÿ˜ด๏ธŽ`, `โ–ถ`, `๐Ÿป`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `๐Ÿ‡ต๐Ÿ‡ต`,`2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿคพ๐Ÿฝโ€โ™€`, `๐ŸŒโ€โ™‚๏ธ`, `๐Ÿค โ€๐Ÿคข`, `1` +`Unicode::Emoji::REGEX_TEXT` | Matches only textual singleton Emoji (except for singleton components, like digits) | `๐Ÿ˜ด๏ธŽ`, `โ–ถ` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿป`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `๐Ÿ‡ต๐Ÿ‡ต`,`2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿคพ๐Ÿฝโ€โ™€`, `๐ŸŒโ€โ™‚๏ธ`, `๐Ÿค โ€๐Ÿคข`, `1` -#### Include Textual Emoji +Here is a list of all Emoji that can be matched using the two regexes: [character.construction/emoji-vs-text](https://character.construction/emoji-vs-text) -By default, textual Emoji (emoji characters with text variation selector or those that have a default text presentation) will not be included in the default regexes (except in `REGEX_POSSIBLE`). However, if you wish to match for them too, you can include them in your regex by appending the `_INCLUDE_TEXT` suffix: +While `REGEX_BASIC` is part of the above regexes, `REGEX_TEXT` is only included in the `*_INCLUDE_TEXT` or `*_UQE` variants. -Regex | Description | Example Matches | Example Non-Matches -------------------------------|-------------|-----------------|-------------------- -`Unicode::Emoji::REGEX_INCLUDE_TEXT` | `REGEX` + `REGEX_TEXT` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ` | `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿค โ€๐Ÿคข` -`Unicode::Emoji::REGEX_VALID_INCLUDE_TEXT` | `REGEX_VALID` + `REGEX_TEXT` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ` | `๐Ÿป`, `๐Ÿ‡ต๐Ÿ‡ต` -`Unicode::Emoji::REGEX_WELL_FORMED_INCLUDE_TEXT` | `REGEX_WELL_FORMED` + `REGEX_TEXT` | `๐Ÿ˜ด`, `โ–ถ๏ธ`, `๐Ÿ›Œ๐Ÿฝ`, `๐Ÿ‡ต๐Ÿ‡น`, `2๏ธโƒฃ`, `๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ`, `๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ`, `๐Ÿคพ๐Ÿฝโ€โ™€๏ธ`, `๐Ÿค โ€๐Ÿคข`, `๐Ÿ‡ต๐Ÿ‡ต`, `๐Ÿ˜ด๏ธŽ`, `โ–ถ` | `๐Ÿป` +### Comparison -#### Extended Pictographic Regex +1) Fully-qualified RGI Emoji ZWJ sequence +2) Minimally-qualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selectors, but not in the first Emoji character) +3) Unqualified RGI Emoji ZWJ sequence (lacks Emoji Presentation Selector, including in the first Emoji character). Unqualified Emoji include all basic Emoji in Text Presentation (see column 11/12). +4) Non-RGI Emoji ZWJ sequence +5) Valid Region made from a pair of Regional Indicators +6) Any Region made from a pair of Regional Indicators +7) RGI Flag Emoji Tag Sequences (England, Scotland, Wales) +8) Valid Flag Emoji Tag Sequences (any known subdivision) +9) Any Emoji Tag Sequences (any tag sequence with any base) +10) Basic Default Emoji Presentation Characters or Text characters with Emoji Presentation Selector +11) Basic Default Text Presentation Characters or Basic Emoji with Text Presentation Selector +12) Non-Emoji (unqualified) keycap +Regex | 1 RGI/FQE | 2 RGI/MQE | 3 RGI/UQE | 4 Non-RGI | 5 Valid Reยญgion | 6 Any Reยญgion | 7 RGI Tag | 8 Valid Tag | 9 Any Tag | 10 Basic Emoji | 11 Basic Text | 12 Text Keyยญcap +-|-|-|-|-|-|-|-|-|-|-|-|- +REGEX | โœ… | โŒ | โŒ | โŒ | โœ… | โŒ | โœ… | โŒ | โŒ | โœ… | โŒ | โŒ +REGEX INCLUDE TEXT | โœ… | โŒ | โŒ | โŒ | โœ… | โŒ | โœ… | โŒ | โŒ | โœ… | โœ… | โœ… +REGEX INCLUDE MQE | โœ… | โœ… | โŒ | โŒ | โœ… | โŒ | โœ… | โŒ | โŒ | โœ… | โŒ | โŒ +REGEX INCLUDE MQE UQE | โœ… | โœ… | โœ… | โŒ | โœ… | โŒ | โœ… | โŒ | โŒ | โœ… | โœ… | โœ… +REGEX VALID | โœ… | โœ… | (โœ…)ยน | โœ… | โœ… | โŒ | โœ… | โœ… | โŒ | โœ… | โŒ | โŒ +REGEX VALID INCLUDE TEXT | โœ… | โœ… | โœ… | โœ… | โœ… | โŒ | โœ… | โœ… | โŒ | โœ… | โœ… | โœ… +REGEX WELL FORMED | โœ… | โœ… | (โœ…)ยน | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โŒ | โŒ +REGEX WELL FORMED INCLUDE TEXT | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… +REGEX POSSIBLE | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โœ… | โŒ +REGEX BASIC | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โœ… | โŒ | โŒ +REGEX TEXT | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โŒ | โœ… | โœ… + +ยน Matches all unqualified Emoji, except for textual singleton Emoji (see columns 11, 12) + +See [spec files](/spec) for detailed examples about which regex matches which kind of Emoji. + +### Picking the Right Emoji Regex + +- Usually you just want `REGEX` (recommended Emoji set, RGI) +- Use `REGEX_INCLUDE_MQE` or `REGEX_INCLUDE_MQE_UQE` if you want to catch Emoji sequences with missing Variation Selectors. +- If you want broader matching (any ZWJ sequences, more sub-region flags), choose `REGEX_VALID` +- If you need to match any region flag and any tag sequence, choose `REGEX_WELL_FORMED` +- Use the `_INCLUDE_TEXT` suffix with any of the above base regexes, if you want to also match basic textual Emoji +- And finally, there is also the option to use `REGEX_POSSIBLE`, which is a simplified test for possible Emoji, comparable to `REGEX_WELL_FORMED*`. It might contain false positives, however, the regex is less complex and [suggested in the Unicode standard itself](https://www.unicode.org/reports/tr51/#EBNF_and_Regex) as a first check. + +### Examples + +Desc | Emoji | Escaped | `REGEX` (RGI/FQE) | `REGEX_INCLUDE_MQE` (RGI/MQE) | `REGEX_VALID` | `REGEX_WELL_FORMED` / `REGEX_POSSIBLE` +-----|-------|---------|---------------|-----------------------|-----------------------------------|----------------- +RGI ZWJ Sequence | ๐Ÿคพ๐Ÿฝโ€โ™€๏ธ | `\u{1F93E 1F3FD 200D 2640 FE0F}` | โœ… | โœ… | โœ… | โœ… +RGI ZWJ Sequence MQE | ๐Ÿคพ๐Ÿฝโ€โ™€ | `\u{1F93E 1F3FD 200D 2640}` | โŒ | โœ… | โœ… | โœ… +Valid ZWJ Sequence, Non-RGI | ๐Ÿค โ€๐Ÿคข | `\u{1F920 200D 1F922}` | โŒ | โŒ | โœ… | โœ… +Known Region | ๐Ÿ‡ต๐Ÿ‡น | `\u{1F1F5 1F1F9}` | โœ… | โœ… | โœ… | โœ… +Unknown Region | ๐Ÿ‡ต๐Ÿ‡ต | `\u{1F1F5 1F1F5}` | โŒ | โŒ | โŒ | โœ… +RGI Tag Sequence | ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ | `\u{1F3F4 E0067 E0062 E0073 E0063 E0074 E007F}` | โœ… | โœ… | โœ… | โœ… +Valid Tag Sequence | ๐Ÿด๓ ง๓ ข๓ ก๓ ง๓ ข๓ ฟ | `\u{1F3F4 E0067 E0062 E0061 E0067 E0062 E007F}` | โŒ | โŒ | โœ… | โœ… +Well-formed Tag Sequence | ๐Ÿ˜ด๓ ง๓ ข๓ ก๓ ก๓ ก๓ ฟ | `\u{1F634 E0067 E0062 E0061 E0061 E0061 E007F}` | โŒ | โŒ | โŒ | โœ… + +Please see [the standard](https://www.unicode.org/reports/tr51/#Emoji_Sets) for more details, examples, explanations. + +More info about valid vs. recommended Emoji can also be found in this [blog article on Emojipedia](https://blog.emojipedia.org/unicode-behind-the-curtain/). + +### Extended Pictographic Regex + `Unicode::Emoji::REGEX_PICTO` matches single codepoints with the **Extended_Pictographic** property. For example, it will match `โœ€` BLACK SAFETY SCISSORS. `Unicode::Emoji::REGEX_PICTO_NO_EMOJI` matches single codepoints with the **Extended_Pictographic** property, but excludes Emoji characters. See [character.construction/picto](https://character.construction/picto) for a list of all non-Emoji pictographic characters. -#### Partial Regexes +### Partial Regexes -Matches potential Emoji parts (often, this is not what you want): +`Unicode::Emoji::REGEX_ANY`, same as `\p{Emoji}`. Deprecated: Will be removed or renamed in the future. -Regex | Description | Example Matches | Example Non-Matches -------------------------------|-------------|-----------------|-------------------- -`Unicode::Emoji::REGEX_ANY` | Matches any Emoji-related codepoint (but no variation selectors, tags, or zero-width joiners). Please not that this will match Emoji-parts rather than complete Emoji, for example, single digits! | `๐Ÿ˜ด`, `โ–ถ`, `๐Ÿป`, `๐Ÿ›Œ`, `๐Ÿฝ`, `๐Ÿ‡ต`, `๐Ÿ‡น`, `2`, `๐Ÿด`, `๐Ÿคพ`, `โ™€`, `๐Ÿค `, `๐Ÿคข` | - +## Usage โ€“ List +Use `Unicode::Emoji::LIST` or the **list** method to get a ordered and categorized list of Emoji: -### List - -Use `Unicode::Emoji::LIST` or the list method to get a grouped (and ordered) list of Emoji: - ```ruby Unicode::Emoji.list.keys # => ["Smileys & Emotion", "People & Body", "Component", "Animals & Nature", "Food & Drink", "Travel & Places", "Activities", "Objects", "Symbols", "Flags"] Unicode::Emoji.list("Food & Drink").keys @@ -122,16 +165,16 @@ Unicode::Emoji.list("Food & Drink", "food-asian") => ["๐Ÿฑ", "๐Ÿ˜", "๐Ÿ™", "๐Ÿš", "๐Ÿ›", "๐Ÿœ", "๐Ÿ", "๐Ÿ ", "๐Ÿข", "๐Ÿฃ", "๐Ÿค", "๐Ÿฅ", "๐Ÿฅฎ", "๐Ÿก", "๐ŸฅŸ", "๐Ÿฅ ", "๐Ÿฅก"] ``` -Please note that categories might change with future versions of the Emoji standard. This gem will issue warnings when attempting to retrieve old categories using the `#list` method. +Please note that categories might change with future versions of the Emoji standard, although this has not happened often. A list of all Emoji (generated from this gem) can be found at [character.construction/emoji](https://character.construction/emoji). -### Properties +## Usage โ€“ Properties Data -Allows you to access the codepoint data form Unicode's [emoji-data.txt](https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt) file: +Allows you to access the codepoint data for a single character form Unicode's [emoji-data.txt](https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt) file: ```ruby require "unicode/emoji" Unicode::Emoji.properties "โ˜" # => ["Emoji", "Emoji_Modifier_Base"]