= Harvestdor::Indexer {Build Status}[https://travis-ci.org/sul-dlss/harvestdor-indexer] {Coverage Status}[https://coveralls.io/r/sul-dlss/harvestdor-indexer] {Dependency Status}[https://gemnasium.com/sul-dlss/harvestdor-indexer] {Gem Version}[http://badge.fury.io/rb/harvestdor-indexer] A Gem to harvest meta/data from DOR and the skeleton code to index it and write to Solr. == Installation Add this line to your application's Gemfile: gem 'harvestdor-indexer' And then execute: $ bundle Or install it yourself as: $ gem install harvestdor-indexer == Usage You must override the index method and provide configuration options. It is recommended to write a script to run it, too - example below. === Configuration / Set up Create a yml config file for your collection going to a Solr index. See spec/config/ap.yml for an example. You will want to copy that file and change the following settings: # whitelist # dor fetcher service_url # solr url # harvestdor log_dir, log_nam Note: Because of an update to underlying HTTP libraries, versions of this gem > 0.0.12 require an updated syntax. Errors like "unknown method timeout" might be because you're using an older version of a config file. The new configuration looks like this: http_options: ssl: verify: false # timeouts are in seconds; timeout -> open/read, open_timeout -> connection open request: timeout: 180 open_timeout: 180 ==== Whitelist Note: the whitelist is how you specify which objects to index. The whitelist can be * an Array of druids inline in the config yml file * a filename containing a list of druids (one per line) If a druid, per the object's identityMetadata at purl page, is for a * collection record: then we process all the item druids in that collection (as if they were included individually in the whitelist) * non-collection record: then we process the druid as an individual item === Override the Harvestdor::Indexer.index method In your code, override this method from the Harvestdor::Indexer class # create Solr doc for the druid and add it to Solr # NOTE: don't forget to send commit to Solr, either once at end (already in harvest_and_index), or for each add, or ... def index resource benchmark "Indexing #{resource.druid}" do logger.debug "About to index #{resource.druid}" doc_hash = {} doc_hash[:id] = resource.druid # you might add things from Indexer level class here # (e.g. things that are the same across all documents in the harvest) solr.add doc_hash # TODO: provide call to code to update DOR object's workflow datastream?? end end === Run it (bundle install) You may want to write a script to run the code. Your script might look like this: #!/usr/bin/env ruby $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..')) $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib')) require 'rubygems' begin require 'your_indexer' rescue LoadError require 'bundler/setup' require 'your_indexer' end config_yml_path = ARGV.pop if config_yml_path.nil? puts "** You must provide the full path to a collection config yml file **" exit end indexer = Harvestdor::Indexer.new(config_yml_path, opts) indexer.harvest_and_index Then you run the script like so: ./bin/indexer config/(your coll).yml Run from deployed instance, as that box is already set up to be able to talk to DOR Fetcher service and to SUL Solr indexes. == Contributing # Fork it # Create your feature branch (`git checkout -b my-new-feature`) # Write code and tests. # Commit your changes (`git commit -am 'Added some feature'`) # Push to the branch (`git push origin my-new-feature`) # Create new Pull Request == Releases * 2.0.0 Complete refactor to update APIs, merge configuration yml files, update to rspec 3 * 1.0.4 Set skip_heartbeat to true in the initialization of the DorFetcher::Client for ease of testing * 1.0.3 Implemented class level config so anything that inherits from Harvestdor::Indexer can share configuration settings * 1.0.0 Replaced OAI harvesting mechanism with dor-fetcher * 0.0.13 Upgrade to latest faraday HTTP client syntax; Use retries gem (https://github.com/ooyala/retries) to make retrying of index process more robust * 0.0.12 fix total_object nil error * 0.0.11 fix error_count and success_count, allow setting of max-tries (retry solr add if error) * 0.0.7 adding additional logging of error, success counts, and time to index and harvest * 0.0.6 tweak error handling for public xml pieces * 0.0.5 make rake release a no-op * 0.0.4 add confstruct runtime dependency * 0.0.3 add methods for public_xml, content_metadata, identity_metadata ... * 0.0.2 better model code for index method (thanks, Bess!) * 0.0.1 initial commit