Sha256: 11f76e241e53f00fd965140962a3d93fca3d1eedc628ece90451a887492214eb
Contents?: true
Size: 1.71 KB
Versions: 5
Compression:
Stored size: 1.71 KB
Contents
class String # Levenshtein distance algorithm implementation for Ruby, with UTF-8 support. # # The Levenshtein distance is a measure of how similar two strings s and t are, # calculated as the number of deletions/insertions/substitutions needed to # transform s into t. The greater the distance, the more the strings differ. # # The Levenshtein distance is also sometimes referred to as the # easier-to-pronounce-and-spell 'edit distance'. # # Calculate the Levenshtein distance between two strings +self+ and +str2+. # +self+ and +str2+ should be ASCII, UTF-8, or a one-byte-per character encoding # such as ISO-8859-*. # # The strings will be treated as UTF-8 if $KCODE is set appropriately (i.e. 'u'). # Otherwise, the comparison will be performed byte-by-byte. There is no specific support # for Shift-JIS or EUC strings. # # When using Unicode text, be aware that this algorithm does not perform normalisation. # If there is a possibility of different normalised forms being used, normalisation # should be performed beforehand. # # CREDIT: Paul Battley # def edit_distance(str2) str1 = self if $KCODE =~ /^U/i unpack_rule = 'U*' else unpack_rule = 'C*' end s = str1.unpack(unpack_rule) t = str2.unpack(unpack_rule) n = s.length m = t.length return m if (0 == n) return n if (0 == m) d = (0..m).to_a x = nil (0...n).each do |i| e = i+1 (0...m).each do |j| cost = (s[i] == t[j]) ? 0 : 1 x = [ d[j+1] + 1, # insertion e + 1, # deletion d[j] + cost # substitution ].min d[j] = e e = x end d[m] = x end return x end end
Version data entries
5 entries across 5 versions & 1 rubygems