README.md in traject-0.9.1 vs README.md in traject-0.10.0

- old
+ new

@@ -5,27 +5,33 @@ Generalizable to tools for configuring mapping records to associative array data structures, and sending them somewhere. **Currently under development, not production ready** +[![Gem Version](https://badge.fury.io/rb/traject.png)](http://badge.fury.io/rb/traject) +[![Build Status](https://travis-ci.org/jrochkind/traject.png)](https://travis-ci.org/jrochkind/traject) + + ## Background/Goals -Existing tools for indexing Marc to Solr exist, and have served many of us for many years. But I was having more and more difficulty working with the existing tools, and difficulty providing the custom logic I needed in a maintainable way. I realized that for me, to create a tool with the flexibility, maintainability, and performance I wanted, I would need to do it in jruby (ruby on the JVM). +Existing tools for indexing Marc to Solr exist, and have served us well for many years, and have many useful things about them -- which I've tried to preserve in traject. But I was having more and more difficulty working with the existing tools, including difficulty providing the custom logic I needed in a maintainable way. I realized that for me, to create a tool with the flexibility, maintainability, and performance I wanted, I would need to do it in jruby (ruby on the JVM). Some goals: * Aim to be accessible even to non-rubyists * Concise and maintainable local configuration -- including an only gradual increase in difficulty to write your own simple logic. * Support reusable and shareable mapping logic routines. * Built of modular and composable elements: If you want to change part of what traject does, you should be able to do so without having to reimplement other things you don't want to change. * A maintainable internal architecture, well-factored with seperated concerns and DRY logic. Aim to be comprehensible to newcomer developers, and well-covered by tests. * High performance, using multi-threaded concurrency where appropriate to maximize throughput. Actual throughput can depend on complexity of your mapping rules and capacity of your server(s), but I am getting throughput 2-5x greater than previous solutions. +* Cooperate well in unix batch/pipeline, with control over output/logging of errors, proper exit codes, use of stdin/stdout, etc. ## Installation -Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations. +Traject runs under jruby (ruby on the JVM). I recommend [chruby](https://github.com/postmodern/chruby) and [ruby-install](https://github.com/postmodern/ruby-install#readme) for installing and managing ruby installations. (traject is tested +and supported for ruby 1.9 -- recent versions of jruby should run under 1.9 mode by default). Then just `gem install traject`. ( **Note**: We may later provide an all-in-one .jar distribution, which does not require you to install jruby or use on your system. This is hypothetically possible. Is it a good idea?) @@ -149,10 +155,15 @@ # Can limit to certain indicators with || chars. # "*" is a wildcard in indicator spec. So # 856 with first indicator '0', subfield u. to_field "email_addresses", extract_marc("856|0*|u") + + # Instead of joining subfields from the same field + # into one string, joined by spaces, leave them + # each in seperate strings: + to_field "isbn", extract_marc("020az", :seperator => nil) ~~~ The `extract_marc` function *by default* includes any linked MARC `880` fields with alternate-script versions. Another reason to use the `:first` option if you really only want one. @@ -212,13 +223,18 @@ # To make use of marc extraction by specification, just like # marc_extract does, you may want to use the Traject::MarcExtractor # class to_field "weirdo" do |record, accumulator, context| - list = MarcExtractor.extract_by_spec(record, "700a") + # use MarcExtractor.cached for performance, globally + # caching the MarcExtractor we create. See docs + # at MarcExtractor. + list = MarcExtractor.cached("700a").extract(record) + # combine all the 700a's in ONE string, cause we're weird list = list.join(" ") + accumulator << list end ~~~ You can also *combine* a macro and a direct block for some @@ -262,10 +278,14 @@ to_field("foo") {...} # will be called first on each record each_record {...} # will always be called AFTER above has potentially added values to_field("foo") {...} # and will be called after each of the preceding for each record ~~~ +#### Sample config + +A fairly complex sample config file can be found at [./test/test_support/demo_config.rb](./test/test_support/demo_config.rb) + #### Built-in MARC21 Semantics There is another package of 'macros' that comes with Traject for extracting semantics from Marc21. These are sometimes 'opinionated', using heuristics or algorithms that are not inherently part of Marc21, but have proven useful in actual practice. @@ -290,11 +310,11 @@ The simplest invocation is: traject -c conf_file.rb marc_file.mrc Traject assumes marc files are in ISO 2709 binary format; it is not -currently able to buess marc format type. If you are reading +currently able to guess marc format type from filenames. If you are reading marc files in another format, you need to tell traject either with the `marc_source.type` or the command-line shortcut: traject -c conf.rb -t xml marc_file.xml You can supply more than one conf file with repeated `-c` arguments. @@ -321,27 +341,51 @@ Use `-u` as a shortcut for `s solr.url=X` traject -c conf_file.rb -u http://example.com/solr marc_file.mrc -Also see `-I load_path` and `-g Gemfile` options under Extending Logic +Also see `-I load_path` and `-g Gemfile` options under Extending With Your Own Code. -## Extending Logic +See also [Hints for batch and cronjob use](./doc/batch_execution.md) of traject. -TODO fill out nicer. +## Extending With Your Own Code -Basically: +Traject config files are full live ruby files, where you can do anything, +including declaring new classes, etc. -command line `-I` can be used to append to the ruby $LOAD_PATH, and then you can simply `require` your local files, and then use them for -whatever. Macros, utility functions, translation maps, whatever. +However, beyond limited trivial logic, you'll want to organize your +code reasonably into seperate files, not jam everything into config +files. -If you want to use logic from other gems in your configuration mapping, you can do that too. This works for traject-specific -functionality like translation maps and macros, or for anything else. -To use gems, you can _either_ use straight rubygems, simply by -installing gems in your system and using `require` or `gem` commands... **or** you can use Bundler for dependency locking and other dependency management. To have traject use Bundler, create a `Gemfile` and then call traject command line with the `-g` option. With the `-g` option alone, Bundler will look in the CWD and parents for the first `Gemfile` it finds. Or supply `-g ./somewhere/MyGemfile` to anywhere. +Traject wants to make sure it makes it convenient for you to do so, +whether project-specific logic in files local to the traject project, +or in ruby gems that can be shared between projects. +There are standard ruby mechanisms you can use to do this, and +traject provides a couple features to make sure this remains +convenient with the traject command line. +For more information, see documentation page on [Extending With Your +Own Code](./doc/extending.md) + +**Expert summary** : +* Traject `-I` argument command line can be used to list directories to + add to the load path, similar to the `ruby -I` argument. You + can then 'require' local project files from the load path. + * translation map files found on the load path or in a + "./translation_maps" subdir on the load path will be found + for Traject translation maps. +* Traject `-g` command line can be used to tell traject to use + bundler with a `Gemfile` located at current working dirctory + (or give an argument to `-g ./some/myGemfile`) + +## More + +* [Other traject commands](./doc/other_commands.md) including `marcout`, and `commit` +* [Hints for batch and cronjob use](./doc/batch_execution.md) of traject. + + # Development Run tests with `rake test` or just `rake`. Tests are written using Minitest (please, no rspec). We use the spec-style describe/it to list the tests -- but generally prefer unit-style "assert_*" methods to make actual assertions, for clarity. @@ -349,10 +393,13 @@ Some tests need to run against a solr instance. Currently no solr instance is baked in. You can provide your own solr instance to test against and set shell ENV variable "solr_url", and the tests will use it. Otherwise, tests will use a mocked up Solr instance. +To make a pull request, please make a feature branch *created from the master branch*, not from an existing feature branch. (If you need to do a feature branch dependent on an existing not-yet merged feature branch... discuss +this with other developers first!) + Pull requests should come with tests, as well as docs where applicable. Docs can be inline rdoc-style, edits to this README, and/or extra files in ./docs -- as appropriate for what needs to be docs. ## TODO @@ -362,10 +409,11 @@ * Should it normalize to NFC on the way in, to make sure translation maps and other string comparisons match properly? * Either way, all optional/configurable of course. based on Settings. -* Command line code. It's only 150 lines, but it's kind of messy -jammed into one file *and lacks tests*. I couldn't figure out -what to do with it or how to test it. Needs a bit of love. +* CommandLine class isn't covered by tests -- it's written using functionality +from Indexer and other classes taht are well-covered, but the CommandLine itself +probably needs some tests -- especially covering error handling, which probably +needs a bit more attention and using exceptions instead of exits, etc. * Optional built-in jetty stop/start to allow indexing to Solr that wasn't running before. maybe https://github.com/projecthydra/jettywrapper ?