This is an extension and modification of the standard String class. We do a lot of UTF-8 character processing in the parser. Ruby 1.8 does not have good enough UTF-8 support and Ruby 1.9 only handles UTF-8 characters as Strings. This is very inefficient compared to representing them as Fixnum objects. Some of these hacks can be removed once we have switched to 1.9 support only.
Replacement for the existing << operator that also works for characters above Fixnum 255 (UTF-8 characters).
# File lib/UTF8String.rb, line 59 59: def << (obj) 60: if obj.is_a?(String) || (obj < 256) 61: # In this case we can use the built-in concat. 62: concat(obj) 63: else 64: # UTF-8 characters have a maximum length of 4 byte and no byte is 0. 65: mask = 0xFF000000 66: pos = 3 67: while pos >= 0 68: # Use the built-in concat operator for each byte. 69: concat((obj & mask) >> (8 * pos)) if (obj & mask) != 0 70: # Move mask and position to the next byte. 71: mask = mask >> 8 72: pos -= 1 73: end 74: end 75: end
Iterate over the String calling the block for each UTF-8 character in the String. This implementation looks more awkward but is noticeably faster than the often propagated regexp based implementations.
# File lib/UTF8String.rb, line 28 28: def each_utf8_char 29: c = '' 30: length = 0 31: each_byte do |b| 32: c << b 33: if length > 0 34: # subsequent unicode byte 35: if (length -= 1) == 0 36: # end of unicode character reached 37: yield c 38: c = '' 39: end 40: elsif (b & 0xC0) == 0xC0 41: # first unicode byte 42: length = 1 43: while (b & 0x80) != 0 44: length += 1 45: b = b << 1 46: end 47: else 48: # ASCII character 49: yield c 50: c = '' 51: end 52: end 53: end
Return the number of UTF8 characters in the String. We don’t override the built-in length() function here as we don’t know who else uses it for what purpose.
# File lib/UTF8String.rb, line 80 80: def length_utf8 81: len = 0 82: each_utf8_char { |c| len += 1 } 83: len 84: end
UTF-8 aware version of reverse that replaces the built-in one.
# File lib/UTF8String.rb, line 89 89: def reverse 90: a = [] 91: each_utf8_char { |c| a << c } 92: a.reverse.join 93: end
Disabled; run with --debug to generate this.
Generated with the Darkfish Rdoc Generator 1.1.6.