RubygemsResearch

Sha256: 6a050a72d5e148fd46f82cc74b0c1307ff7bb5ee67383aea17c02466cc5d9633

Contents?: true

Size: 1.59 KB

Versions: 1

Compression:

Stored size: 1.59 KB

# encoding: binary

# jruby 1.6.7 (ruby-1.9.2-p312) (2012-02-22 3e82bc8) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_20) [linux-amd64-java]

# There is a letter in cyrillic that looks kind of like a capital
# H.  In the cp866 encoding (http://en.wikipedia.org/wiki/Code_page_866)
# it's represented by "\x8D" which is decimal 141. 
#
# In ruby 1.9, it _ought_ to be possible to have those bytes
# in a string, and tell ruby it's cp866. 

cp866 = "\x8D".force_encoding("IBM866")

# in MRI 1.9.3, if we inspect that, we get "\x8D", just like we expect.
# and if we look at #bytes.to_a, we get [141], just like we expect. 
puts cp866.inspect
puts cp866.bytes.to_a.inspect
# However, in jruby if we #inspect instead of getting "\x8D", 
# we get "\u008D" -- this is wrong, it's NOT that unicode codepoint.
# In jruby, bytes.to_a.inspect is still [141], it hasn't changed
# the bytes, but it's confused about what's going on. 

# We see this encoding confusion demonstrated if we try
# a String#encode. 
#
# MRI 1.9.3 is perfectly capable of transcoding this to UTF-8

utf8 = cp866.encode("UTF-8")
puts utf8.inspect # =>  in MRI displays cyrillic in terminal no prob
puts utf8.bytes.to_a.inspect # => in MRI [208, 157], proper bytes for utf8

# In jruby, puts utf8.inspect displays "\u008D", and
# utf8.bytes.to_a.inspect is [194, 141]. I don't know where the
# 191 came from, but it has NOT succesfully transcoded to utf8. 

# In other cases, the #encode will actually raise an illegal byte
# exception if the original bytes were not legal for UTF8 (or UTF16?) --
# but the original bytes were not meant to be considered unicode at all.

Version data entries

1 entries across 1 versions & 1 rubygems

Version	Path
marc-0.5.0	test/jruby_just_string.rb