= Harvestdor::Indexer
{
}[https://travis-ci.org/sul-dlss/harvestdor-indexer]
{
}[https://coveralls.io/r/sul-dlss/harvestdor-indexer]
{
}[https://gemnasium.com/sul-dlss/harvestdor-indexer]
{
}[http://badge.fury.io/rb/harvestdor-indexer]
A Gem to harvest meta/data from DOR and the skeleton code to index it and write to Solr.
== Installation
Add this line to your application's Gemfile:
gem 'harvestdor-indexer'
And then execute:
$ bundle
Or install it yourself as:
$ gem install harvestdor-indexer
== Usage
You must override the index method and provide configuration options. It is recommended to write a script to run it, too - example below.
=== Configuration / Set up
Create a yml config file for your collection going to a Solr index.
Note: Because of an update to underlying HTTP libraries, versions of this gem > 0.0.12 require an updated syntax. Errors like "unknown method timeout" might be because you're using an older version of a config file. The new configuration looks like this:
http_options:
ssl:
verify: false
# timeouts are in seconds; timeout -> open/read, open_timeout -> connection open
request:
timeout: 180
open_timeout: 180
See spec/config/ap.yml for an example.
You will want to copy that file and change the following settings:
1. log_name
2. default_set
3. blacklist or whitelist if you are using them
Update the dor-fetcher-client.yml file in the config directory with the location of the URL of the dor-fetcher-service provider. The defaulted value is the 3000 port for a localhost - dor_fetcher_service_url: http://127.0.0.1:3000
=== Override the Harvestdor::Indexer.index method
In your code, override this method from the Harvestdor::Indexer class
# create Solr doc for the druid and add it to Solr, unless it is on the blacklist.
# NOTE: don't forget to send commit to Solr, either once at end (already in harvest_and_index), or for each add, or ...
def index druid
if blacklist.include?(druid)
logger.info("Druid #{druid} is on the blacklist and will have no Solr doc created")
else
logger.error("You must override the index method to transform druids into Solr docs and add them to Solr")
doc_hash = {}
doc_hash[:id] = druid
# doc_hash[:title_tsim] = smods_rec(druid).short_title
# you might add things from Indexer level class here
# (e.g. things that are the same across all documents in the harvest)
solr_client.add(doc_hash)
# logger.info("Just created Solr doc for #{druid}")
# TODO: provide call to code to update DOR object's workflow datastream??
end
end
=== Run it
(bundle install)
I suggest you write a script to run the code. Your script might look like this:
#!/usr/bin/env ruby
$LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..'))
$LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
require 'rubygems'
begin
require 'your_indexer'
rescue LoadError
require 'bundler/setup'
require 'your_indexer'
end
config_yml_path = ARGV.pop
if config_yml_path.nil?
puts "** You must provide the full path to a collection config yml file **"
exit
end
if client_config_path.nil?
puts "** You must provide the full path to dor-fetcher-client config yml file **"
exit
end
indexer = Harvestdor::Indexer.new(config_yml_path, client_config_path, opts)
indexer.harvest_and_index
Then you run the script like so:
./bin/indexer config/(your coll).yml
I suggest you run your code on harvestdor-dev, as it is already set up to be able to harvest from the DorFetcher
== Contributing
# Fork it
# Create your feature branch (`git checkout -b my-new-feature`)
# Write code and tests.
# Commit your changes (`git commit -am 'Added some feature'`)
# Push to the branch (`git push origin my-new-feature`)
# Create new Pull Request
== Releases
* 2.0.0 Complete refactor to update APIs, merge configuration yml files, update to rspec 3
* 1.0.4 Set skip_heartbeat to true in the initialization of the DorFetcher::Client for ease of testing
* 1.0.3 Implemented class level config so anything that inherits from Harvestdor::Indexer can share configuration settings
* 1.0.0 Replaced OAI harvesting mechanism with dor-fetcher
* 0.0.13 Upgrade to latest faraday HTTP client syntax; Use retries gem (https://github.com/ooyala/retries) to make retrying of index process more robust
* 0.0.12 fix total_object nil error
* 0.0.11 fix error_count and success_count, allow setting of max-tries (retry solr add if error)
* 0.0.7 adding additional logging of error, success counts, and time to index and harvest
* 0.0.6 tweak error handling for public xml pieces
* 0.0.5 make rake release a no-op
* 0.0.4 add confstruct runtime dependency
* 0.0.3 add methods for public_xml, content_metadata, identity_metadata ...
* 0.0.2 better model code for index method (thanks, Bess!)
* 0.0.1 initial commit