= loose_tight_dictionary Find a needle in a haystack based on string similarity (using the Pair Distance algorithm and Levenshtein distance) and regular expressions. == Quickstart >> require 'loose_tight_dictionary' => true >> LooseTightDictionary.new(%w{seamus andy ben}).find('Shamus') => "seamus" == String similarity matching Uses {Dice's Coefficient}[http://en.wikipedia.org/wiki/Dice's_coefficient] algorithm (aka Pair Distance). If that judges two strings to be be equally similar to a third string, then Levenshtein distance is used. For example, pair distance considers "RATZ" and "CATZ" to be equally similar to "RITZ" so we invoke Levenshtein. >> require 'amatch' => true >> 'RITZ'.pair_distance_similar 'RATZ' => 0.3333333333333333 >> 'RITZ'.pair_distance_similar 'CATZ' # <-- pair distance can't tell the difference, so we fall back to levenshtein... => 0.3333333333333333 >> 'RITZ'.levenshtein_similar 'RATZ' => 0.75 >> 'RITZ'.levenshtein_similar 'CATZ' # <-- which properly shows that RATZ should win => 0.5 == Production use Over 2 years in {Brighter Planet's environmental impact API}[http://impact.brighterplanet.com] and {reference data service}[http://data.brighterplanet.com]. == Haystacks and how to read them The (admittedly imperfect) metaphor is "look for a needle in a haystack" * needle - the search term * haystack - the records you are searching (your result will be an object from here) So, what if your needle is a string like youruguay and your haystack is full of Country objects like ? >> LooseTightDictionary.new(countries, :read => :name).find('youruguay') => == Regular expressions You can improve the default matchings with regular expressions. * Emphasize important words using blockings and tighteners * Filter out stop words with tighteners * Prevent impossible matches with blockings and identities * Ignore words with stop words === Blockings Setting a blocking of /Airbus/ ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better blocking in this case would probably be /airbus/i. === Tighteners Adding a tightener like /(boeing).*(7\d\d)/i will cause "BOEING COMPANY 747" and "boeing747" to be scored as if they were "BOEING 747" and "boeing 747", respectively. See also "Case sensitivity" below. === Identities Adding an identity like /(F)\-?(\d50)/ ensures that "Ford F-150" and "Ford F-250" never match. === Stop words Adding a stop word like THE ensures that it is not taken into account when comparing "THE CAT", "THE DAT", and "THE CATT" == Case sensitivity Scoring is case-insensitive. Everything is downcased before scoring. This is a change from previous versions. Your regexps may still be case-sensitive, though. == Examples Check out the tests. == Speed (and who to thank for the algorithms) If you add the amatch[http://flori.github.com/amatch/] gem to your Gemfile, it will use that, which is much faster (but {segfaults have been seen in the wild}[https://github.com/flori/amatch/issues/3]). Thanks {Flori}[https://github.com/flori]! Otherwise, pure ruby versions of the string similarity algorithms derived from the {answer to a StackOverflow question}[http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings] and {the text gem}[https://github.com/threedaymonk/text/blob/master/lib/text/levenshtein.rb] are used. Thanks {marzagao}[http://stackoverflow.com/users/10997/marzagao] and {threedaymonk}[https://github.com/threedaymonk]! == Authors * Seamus Abshere * Ian Hough * Andy Rossmeissl == Copyright Copyright 2011 Brighter Planet, Inc.