README.md in traject-0.15.0 vs README.md in traject-0.16.0

- old
+ new

@@ -47,12 +47,13 @@ The traject command-line utility requires you to supply it with a configuration file. So let's start by describing the configuration file. Configuration files are actually just ruby -- so by convention they end in `.rb`. -Don't worry, you don't neccesarily need to know ruby well to write them, they give you a subset of ruby to work with. But the full power -of ruby is available to you. +We hope you can write basic useful configuration files without being a ruby expert, +they give you a subset of ruby to work with. But the full power +of ruby is available to you if needed. **rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can call ordinary ruby `require` in config files, etc., too, to load external functionality. See more at Extending Logic below. @@ -82,13 +83,10 @@ # default source type is binary, traject can't guess # you have to tell it. provide "marc_source.type", "xml" - # settings can be set on command line instead of - # config file too. - # various others... provide "solrj_writer.commit_on_close", "true" # By default, we use the Traject::Marc4JReader, which # can read marc8 and ISO8859_1 -- if your records are all in UTF8, @@ -161,44 +159,49 @@ # Can limit to certain indicators with || chars. # "*" is a wildcard in indicator spec. So # 856 with first indicator '0', subfield u. to_field "email_addresses", extract_marc("856|0*|u") - - # Instead of joining subfields from the same field - # into one string, joined by spaces, leave them - # each in separate strings: - to_field "isbn", extract_marc("020az", :separator => nil) - # Same thing, but more explicit - to_field "isbn", extract_marc("020a:020z") - - - # Make sure that you don't get any duplicates - # by passing in ":deduplicate => true" - to_field 'language008', extract_marc('008[35-37]', :deduplicate=>true) + # Can list tag twice with different field combinations + # to extract separately + to_field "isbn", extract_marc("245a:245abcde") ~~~ The `extract_marc` function *by default* includes any linked MARC `880` fields with alternate-script versions. Another reason to use the `:first` option if you really only want one. +By default, specifications with multiple subfields (like "240abc") will produce +one single string of output for each matching field. Specifications +with single subfields (like "020a") will split subfields and produce +an output string for each matching subfield. + For MARC control (aka 'fixed') fields, you can use square brackets to take a slice by byte offset. +~~~ruby to_field "langauge_code", extract_marc("008[35-37]") +~~~ +For more information on extraction specifications, see +the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)). + `extract_marc` also supports `translation maps` similar -to SolrMarc's. There will be some translation maps built in, -and you can provide your own. translation maps can be supplied +to SolrMarc's. There are some translation maps provided by traject, +and you can also define your own. translation maps can be supplied in yaml or ruby. Translation maps are especially useful -for mapping form MARC codes to user-displayable strings. See Traject::TranslationMap for more info: +for mapping form MARC codes to user-displayable strings: +~~~ruby # "translation_map" will be passed to Traject::TranslationMap.new # and the created map used to translate all values to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code") +~~~ +See [Traject::TranslationMap](./lib/traject/translation_map.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/TranslationMap)) for more info on translation mapping. + #### Direct indexing logic vs. Macros It turns out all those functions we saw above used with `to_field` -- `literal`, `serialized_marc`, `extract_all_marc_values`, and `extract_marc` -- are what Traject calls 'macros'. They are all actually built based upon a more basic element of @@ -346,14 +349,14 @@ traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solr.url=http://example.com/solr -s solrj_writer.commit_on_close=true There are some built-in command-line option shortcuts for useful settings: -Use `-j` to output as pretty-printed JSON -hashes, instead of sending to solr. Useful for debugging or sanity -checking. +Use `--debug-mode` to output in a human-readable format, instead of sending to solr. +Also turns on debug logging and restricts processing to single-threaded. Useful for +debugging or sanity checking. - traject -j -c conf_file.rb marc_file + traject --debug-mode -c conf_file.rb marc_file Use `-u` as a shortcut for `s solr.url=X` traject -c conf_file.rb -u http://example.com/solr marc_file.mrc