# CharacterSet [![Gem Version](https://badge.fury.io/rb/character_set.svg)](http://badge.fury.io/rb/character_set) [![Build Status](https://github.com/jaynetics/character_set/workflows/tests/badge.svg)](https://github.com/jaynetics/character_set/actions) [![Build Status](https://github.com/jaynetics/character_set/workflows/gouteur/badge.svg)](https://github.com/jaynetics/character_set/actions) [![codecov](https://codecov.io/gh/jaynetics/character_set/branch/master/graph/badge.svg)](https://codecov.io/gh/jaynetics/character_set) This is a C-extended Ruby gem to work with sets of Unicode codepoints. It can [read](#parseinitialize) and [write](#write) sets of codepoints in various formats and it implements the stdlib `Set` interface for them. It also offers a [way of scrubbing and scanning characters in Strings](#interact-with-strings) that is more semantic and consistently offers better performance than `Regexp` and `String` methods from the stdlib for this (see [benchmarks](./BENCHMARK.md)). Many parts can be used independently, e.g.: - `CharacterSet::Character` - `CharacterSet::ExpressionConverter` - `CharacterSet::Parser` - `CharacterSet::Writer` ## Usage ### Usage examples ```ruby CharacterSet.url_query.cover?('?a=(b$c;)') # => true CharacterSet.non_ascii.delete_in!(string) CharacterSet.emoji.sample(5) # => ["⛷", "👈", "🌞", "♑", "⛈"] ``` ### Parse/Initialize These all produce a `CharacterSet` containing `a`, `b` and `c`: ```ruby CharacterSet['a', 'b', 'c'] CharacterSet[97, 98, 99] CharacterSet.new('a'..'c') CharacterSet.new(0x61..0x63) CharacterSet.of('abacababa') CharacterSet.parse('[a-c]') CharacterSet.parse('\U00000061-\U00000063') ``` If the gems [`regexp_parser`](https://github.com/ammar/regexp_parser) and [`regexp_property_values`](https://github.com/jaynetics/regexp_property_values) are installed, `Regexp` and unicode property names can also be read. Regexp intersections, negations, and set nesting are covered, but the `i`-flag is ignored; call `#case_insensitive` on the result if needed. ```ruby CharacterSet.of(/./) # => # CharacterSet.of_property('Thai') # => # require 'character_set/core_ext/regexp_ext' /[\D&&[:ascii:]&&\p{emoji}]/.character_set.size # => 2 ``` ### Predefined utility sets `ascii`, `ascii_alnum`, `ascii_letter`, `assigned`, `bmp`, `crypt`, `emoji`, `newline`, `surrogate`, `unicode`, `url_fragment`, `url_host`, `url_path`, `url_query`, `whitespace` ```ruby CharacterSet.ascii # => # # all can be prefixed with `non_`, e.g. CharacterSet.non_ascii ``` ### Interact with Strings `CharacterSet` can replace some types of `String` handling with better performance than the stdlib. `#used_by?` and `#cover?` can replace some `Regexp#match?` calls: ```ruby CharacterSet.ascii.used_by?('Tüür') # => true CharacterSet.ascii.cover?('Tüür') # => false CharacterSet.ascii.cover?('Tr') # => true ``` `#delete_in(!)` and `#keep_in(!)` can replace `String#gsub(!)` and the like: ```ruby string = 'Tüür' CharacterSet.ascii.delete_in(string) # => 'üü' CharacterSet.ascii.keep_in(string) # => 'Tr' string # => 'Tüür' CharacterSet.ascii.delete_in!(string) # => 'üü' string # => 'üü' CharacterSet.ascii.keep_in!(string) # => '' string # => '' ``` `#count_in` and `#scan` can replace `String#count` and `String#scan`: ```ruby CharacterSet.non_ascii.count_in('Tüür') # => 2 CharacterSet.non_ascii.scan('Tüür') # => ['ü', 'ü'] ``` There is also a core extension for String interaction. ```ruby require 'character_set/core_ext/string_ext' "a\rb".character_set & CharacterSet.newline # => CharacterSet["\r"] "a\rb".uses_character_set?(CharacterSet['ä', 'ö', 'ü']) # => false "a\rb".covered_by_character_set?(CharacterSet.newline) # => false # predefined sets can also be referenced via Symbols "a\rb".covered_by_character_set?(:ascii) # => true "a\rb".delete_character_set(:newline) # => 'ab' # etc. ``` ### Manipulate Use [any Ruby Set method](https://ruby-doc.org/stdlib-2.5.1/libdoc/set/rdoc/Set.html), e.g. `#+`, `#-`, `#&`, `#^`, `#intersect?`, `#<`, `#>` etc. to interact with other sets. Use `#add`, `#delete`, `#include?` etc. to change or check for members. Where appropriate, methods take both chars and codepoints, e.g.: ```ruby CharacterSet['a'].add('b') # => CharacterSet['a', 'b'] CharacterSet['a'].add(98) # => CharacterSet['a', 'b'] CharacterSet['a'].include?('a') # => true CharacterSet['a'].include?(0x61) # => true ``` `#inversion` can be used to create a `CharacterSet` with all valid Unicode codepoints that are not in the current set: ```ruby non_a = CharacterSet['a'].inversion # => # non_a.include?('a') # => false non_a.include?('ü') # => true # surrogate pair halves are not included by default CharacterSet['a'].inversion(include_surrogates: true) # => # ``` `#case_insensitive` can be used to create a `CharacterSet` where upper/lower case codepoints are supplemented: ```ruby CharacterSet['1', 'A'].case_insensitive # => CharacterSet['1', 'A', 'a'] ``` ### Write ```ruby set = CharacterSet['a', 'b', 'c', 'j', '-'] # safely printable ASCII chars are not escaped by default set.to_s # => 'a-cj\x2D' set.to_s(escape_all: true) # => '\x61-\x63\x6A\x2D' # brackets may be added set.to_s(in_brackets: true) # => '[a-cj\x2D]' # the default escape format is Ruby/ES6 compatible, others are available set = CharacterSet['a', 'b', 'c', 'ɘ', '🤩'] set.to_s # => 'a-c\u0258\u{1F929}' set.to_s(format: 'U+') # => 'a-cU+0258U+1F929' set.to_s(format: 'Python') # => "a-c\u0258\U0001F929" set.to_s(format: 'raw') # => 'a-cɘ🤩' # or pass a block set.to_s { |char| "[#{char.codepoint}]" } # => "a-c[600][129321]" set.to_s(escape_all: true) { |c| "<#{c.hex}>" } # => "<61>-<63><258><1F929>" # disable abbreviation (grouping of codepoints in ranges) set.to_s(abbreviate: false) # => "abc\u0258\u{1F929}" # astral members require some trickery if we want to target environments # that are based on UTF-16 or "UCS-2 with surrogates", such as JavaScript. set = CharacterSet['a', 'b', '🤩', '🤪', '🤫'] # Use #to_s_with_surrogate_ranges e.g. for JavaScript: set.to_s_with_surrogate_ranges # => '(?:[ab]|\uD83E[\uDD29-\uDD2B])' # Or use #to_s_with_surrogate_alternation if such surrogate set pairs # don't work in your target environment: set.to_s_with_surrogate_alternation # => '(?:[ab]|\uD83E\uDD29|\uD83E\uDD2A|\uD83E\uDD2B)' ``` ### Other features #### Secure tokens Generate secure random strings of characters from a set: ```ruby CharacterSet.new('a'..'z').secure_token(8) # => "ugwpujmt" CharacterSet.crypt.secure_token # => "8.1w7aBT737/pMfcMoO4y2y8/=0xtmo:" ``` #### Unicode planes There are some methods to check for planes and to handle ASCII, [BMP](https://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane) and astral parts: ```Ruby CharacterSet['a', 'ü', '🤩'].ascii_part # => CharacterSet['a'] CharacterSet['a', 'ü', '🤩'].ascii_part? # => true CharacterSet['a', 'ü', '🤩'].ascii_only? # => false CharacterSet['a', 'ü', '🤩'].ascii_ratio # => 0.3333333 CharacterSet['a', 'ü', '🤩'].bmp_part # => CharacterSet['a', 'ü'] CharacterSet['a', 'ü', '🤩'].astral_part # => CharacterSet['🤩'] CharacterSet['a', 'ü', '🤩'].bmp_ratio # => 0.6666666 CharacterSet['a', 'ü', '🤩'].planes # => [0, 1] CharacterSet['a', 'ü', '🤩'].plane(1) # => CharacterSet['🤩'] CharacterSet['a', 'ü', '🤩'].member_in_plane?(7) # => false CharacterSet::Character.new('a').plane # => 0 ``` ## Contributions Feel free to send suggestions, point out issues, or submit pull requests.