= XapianDb
== What's in the box?
XapianDb is a ruby gem that combines features of nosql databases and fulltext indexing into one piece. The result: Rich documents and very fast queries. It is based on {Xapian}[http://xapian.org/], an efficient and powerful indexing library.
XapianDb is inspired by {xapian-fu}[https://github.com/johnl/xapian-fu] and {xapit}[https://github.com/ryanb/xapit].
Thank you John and Ryan for your great work. It helped me learning to understand the xapian library and I borrowed an idea
or two from you ;-)
== Why yet another indexing gem?
In the good old days I used {ferret}[https://github.com/dbalmain/ferret] and {acts_as_ferret}[https://github.com/jkraemer/acts_as_ferret]
as my fulltext indexing solution and everything was fine. But time moved on and Ferret didn't.
So I started to rethink fulltext indexing again. I looked for something that
* is under active development
* is fast
* is lightweight and easy to install / deploy
* is framework and database agnostic and works with pure POROS (plain old ruby objects)
* is configurable anywhere, not just inside the model classes; I think that index configurations should not be part of the domain model
* supports document configuration at the class level, not the database level; each class has its own document structure
* integrates with popular Ruby / Rails ORMs like ActiveRecord or Datamapper through a plugin architecture
* returns rich document objects that do not necessarily need a database roundtrip to render the search results (but know how to get the underlying object, if needed)
* updates the index realtime (no scheduled reindexing jobs)
* supports all major features of a full text indexer, namely wildcards!!
I tried hard but I couldn't find such a thing so I decided to write it, based on the Xapian library.
== Getting started
If you want to use xapian_db in a Rails app, you need Rails 3 or newer.
=== Install Xapian if not already installed
To use xapian_db, make sure you have the Xapian library and ruby bindings installed. At the time of this writing, the newest release of Xapian was 1.2.3. You might
want to adjust the URLs below to load the most current release of Xapian.
The example code works for OSX. On linux you might want to use wget instead of curl.
A future release of xapian_db might include the Xapian binaries and make this step obsolete.
==== Install Xapian
curl -O http://oligarchy.co.uk/xapian/1.2.3/xapian-core-1.2.3.tar.gz
tar xzvf xapian-core-1.2.3.tar.gz
cd xapian-core-1.2.3
./configure --prefix=/usr/local
make
sudo make install
==== Install ruby bindings for Xapian
curl -O http://oligarchy.co.uk/xapian/1.2.2/xapian-bindings-1.2.3.tar.gz
tar xzvf xapian-bindings-1.2.3.tar.gz
cd xapian-bindings-1.2.3
./configure --prefix=/usr/local XAPIAN_CONFIG=/usr/local/bin/xapian-config
make
sudo make install
For a first look, look at the examples in the examples folder. There's the simple ruby script basic.rb that shows the basic
usage of XapianDB without rails. In the basic_rails folder you'll find a very simple Rails app unsing XapianDb.
The following steps assume that you are using xapian_db within a Rails app.
=== Configure your databases
Without a config file, xapian_db creates the database in the db folder for development and production
environments. If you are in the test environment, xapian_db creates an in memory database.
It assumes you are using ActiveRecord.
You can override these defaults by placing a config file named 'xapian_db.yml' into your config folder. Here's an example:
# XapianDb configuration
defaults: &defaults
adapter: datamapper # Avaliable adapters: :active_record, :datamapper
language: de # Global language; can be overridden for specific blueprints
development:
database: db/xapian_db/development
<<: *defaults
test:
database: ":memory:" # Use an in memory database for tests
<<: *defaults
production:
database: db/xapian_db/production
<<: *defaults
=== Configure an index blueprint
In order to get your models indexed, you must configure a document blueprint for each class you want to index:
XapianDb::DocumentBlueprint.setup(Person) do |blueprint|
blueprint.attribute :name, :weight => 10
blueprint.attribute :first_name
end
The example above assumes that you have a class Person
with the methods name
and first_name
.
Attributes will get indexed and are stored in the documents. You will be able to access the name and the first name in your search results.
If you want to index additional data but do not need access to it from a search result, use the index method:
blueprint.index :remarks, :weight => 5
If you config a class that has a language property, e.g.
class Person
attr_reader :language
end
The method must return the iso code for the language (:en, :de, ...) as a symbol or a string. Don't worry if you have languages in your database that are not supported by Xapian. If the language is not supported, XapianDb will fall back to the global language configuration or none, if you haven't configured one.
If you want to declare multiple attributes or indexes with default options, you can do this in one statement:
XapianDb::DocumentBlueprint.setup(Person) do |blueprint|
blueprint.attributes :name, :first_name, :profession
blueprint.index :notes, :remarks, :cv
end
Note that you cannot add options using this mass declaration syntax (e.g. blueprint.attributes :name, :weight => 10, :first_name
is not valid).
Use blocks for complex evaluations of attributes or indexed values:
XapianDb::DocumentBlueprint.setup(IndexedObject) do |blueprint|
blueprint.attribute :complex do
if @id == 1
"One"
else
"Not one"
end
end
end
place these configurations either into the corresponding class or - I prefer to have the index configurations outside
the models - into the file config/xapian_blueprints.rb.
=== Update the index
xapian_db injects some helper methods into your configured model classes that update the index automatically
for you when you create, save or destroy models. If you already have models that should now go into the index,
use the method rebuild_xapian_index
:
Person.rebuild_xapian_index
To get info about the reindex process, use the verbose option:
Person.rebuild_xapian_index :verbose => true
In verbose mode, XapianDb will use the progressbar gem if available.
=== Query the index
A simple query looks like this:
results = XapianDb.search "Foo"
You can use wildcards and boolean operators:
results = XapianDb.search "fo* or baz"
You can query attributes:
results = XapianDb.search "name:Foo"
You can query objects of a specific class:
results = Person.search "name:Foo"
If you want to override the default of 10 docs per page, pass the :per_page argument:
results = Person.search "name:Foo", :per_page => 20
On class queries you can specifiy order options:
results = Person.search "name:Foo", :order => :first_name
results = Person.search "Fo*", :order => [:name, :first_name], :sort_decending => true
Please note that the order option is not avaliable for global searches (XapianDb.search...)
=== Process the results
XapianDb.search
returns a resultset object. You can access the number of hits directly:
results.size # Very fast, does not load the resulting documents
If you use a persistent database, the resultset may contain a spelling correction:
# Assuming you have at least one document containing "mouse"
results = XapianDb.search("moose")
results.spelling_suggestion # "mouse"
To access the found documents, get a page from the resultset:
page = results.paginate # Get the first page
page = results.paginate :page => 2 # Get the second page
Now you can access the documents:
doc = page.first
puts doc.indexed_class # Get the type of the indexed object as a string, e.g. "Person"
puts doc.name # We can access the configured attributes
person = doc.indexed_object # Access the object behind this doc (lazy loaded)
Use a search result with will_paginate in a view:
<%= will_paginate @results %>
=== Facets
If you want to implement a simple drilldown for your searches, you can use a global facets query:
search_expression = "Foo"
facets = XapianDb.facets(search_expression)
facets.each do |klass, count|
puts "#{klass.name}: #{count} hits"
# This is how you would get all documents for the facet
# doc = klass.search search_expression
end
A global facet search always groups the results by the class of the indexed objects. There is a class level facet query syntax available, too:
search_expression = "Foo"
facets = Person.facets(:name, search_expression)
facets.each do |name, count|
puts "#{name}: #{count} hits"
end
At the class level, any attribute can be used for a facet query.
== Production setup
Since Xapian allows only one database instance to write to the index, the default setup of XapianDb will not work
with multiple app instances trying to write to the same database (you will get lock errors).
Therefore, XapianDb provides a solution based on beanstalk to overcome this.
=== 1. Install beanstalkd
Make sure you have the {beanstalk daemon}[http://kr.github.com/beanstalkd/] installed
==== OSX
The easiest way is to use macports or homebrew:
port install beanstalkd
brew install beanstalkd
==== Debian (Lenny)
# Add backports to /etc/apt/sources.list:
deb http://ftp.de.debian.org/debian-backports lenny-backports main contrib non-free
deb-src http://ftp.de.debian.org/debian-backports lenny-backports main contrib non-free
sudo apt-get update
sudo apt-get -t lenny-backports install libevent-1.4-2
sudo apt-get -t lenny-backports install libevent-dev
cd /tmp
curl http://xph.us/dist/beanstalkd/beanstalkd-1.4.6.tar.gz | tar zx
cd beanstalkd-1.4.6/
./configure
make
sudo make install
=== 2. Add the beanstalk-client gem to your config
gem 'beanstalk-client' # Add this to your Gemfile
bundle install
=== 3. Configure your production environment in config/xapian_db.yml
production:
database: db/xapian_db/production
writer: beanstalk
beanstalk_daemon: localhost:11300
=== 4. start the beanstalk daemon
beanstalk -d
=== 5. start the beanstalk worker from within your Rails app root directory
rake RAILS_ENV=production xapian_db:beanstalk_worker
Important: Do not start multiple instances of this worker task!