= loose_tight_dictionary
Match things based on string similarity (using the Pair Distance algorithm) and regular expressions.
== Quickstart
>> require 'loose_tight_dictionary'
=> true
>> LooseTightDictionary.new(%w{seamus andy ben}).find('Shamus')
=> "seamus"
== String similarity matching
Exclusively uses {Dice's Coefficient}[http://en.wikipedia.org/wiki/Dice's_coefficient] algorithm (aka Pair Distance).
== Production use
Over 2 years in {Brighter Planet's environmental impact API}[http://impact.brighterplanet.com] and {reference data service}[http://data.brighterplanet.com].
== Haystacks and how to read them
The (admittedly imperfect) metaphor is "look for a needle in a haystack"
* needle - the search term
* haystack - the records you are searching (your result will be an object from here)
So, what if your needle is a string like youruguay and your haystack is full of Country objects like ?
>> LooseTightDictionary.new(countries, :read => :name).find('youruguay')
=>
== Regular expressions
You can improve the default matchings with regular expressions.
* Emphasize important words using blockings and tighteners
* Filter out stop words with tighteners
* Prevent impossible matches with blockings and identities
=== Blockings
Setting a blocking of /Airbus/ ensures that strings containing "Airbus" will only be scored against to other strings containing "Airbus". A better blocking in this case would probably be /airbus/i.
=== Tighteners
Adding a tightener like /(boeing).*(7\d\d)/i will cause "BOEING COMPANY 747" and "boeing747" to be scored as if they were "BOEING 747" and "boeing 747", respectively. See also "Case sensitivity" below.
=== Identities
Adding an identity like /(F)\-?(\d50)/ ensures that "Ford F-150" and "Ford F-250" never match.
== Case sensitivity
Scoring is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.
== Examples
Check out the tests.
== Speed
If you add the amatch[http://flori.github.com/amatch/] gem to your Gemfile, it will use that, which is much faster (but {segfaults have been seen in the wild}[https://github.com/flori/amatch/issues/3]). Thanks Flori!
Otherwise, a pure ruby version derived from the {answer to a StackOverflow question}[http://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings] is used. Thanks {marzagao}[http://stackoverflow.com/users/10997/marzagao]!
== Authors
* Seamus Abshere
* Ian Hough
* Andy Rossmeissl
== Copyright
Copyright 2011 Brighter Planet, Inc.