README.md in traject-0.15.0 vs README.md in traject-0.16.0
- old
+ new
@@ -47,12 +47,13 @@
The traject command-line utility requires you to supply it with a configuration file. So let's start by describing the configuration file.
Configuration files are actually just ruby -- so by convention they end in `.rb`.
-Don't worry, you don't neccesarily need to know ruby well to write them, they give you a subset of ruby to work with. But the full power
-of ruby is available to you.
+We hope you can write basic useful configuration files without being a ruby expert,
+they give you a subset of ruby to work with. But the full power
+of ruby is available to you if needed.
**rubyist tip**: Technically, config files are executed with `instance_eval` in a Traject::Indexer instance, so the special commands you see are just methods on Traject::Indexer (or mixed into it). But you can
call ordinary ruby `require` in config files, etc., too, to load
external functionality. See more at Extending Logic below.
@@ -82,13 +83,10 @@
# default source type is binary, traject can't guess
# you have to tell it.
provide "marc_source.type", "xml"
- # settings can be set on command line instead of
- # config file too.
-
# various others...
provide "solrj_writer.commit_on_close", "true"
# By default, we use the Traject::Marc4JReader, which
# can read marc8 and ISO8859_1 -- if your records are all in UTF8,
@@ -161,44 +159,49 @@
# Can limit to certain indicators with || chars.
# "*" is a wildcard in indicator spec. So
# 856 with first indicator '0', subfield u.
to_field "email_addresses", extract_marc("856|0*|u")
-
- # Instead of joining subfields from the same field
- # into one string, joined by spaces, leave them
- # each in separate strings:
- to_field "isbn", extract_marc("020az", :separator => nil)
- # Same thing, but more explicit
- to_field "isbn", extract_marc("020a:020z")
-
-
- # Make sure that you don't get any duplicates
- # by passing in ":deduplicate => true"
- to_field 'language008', extract_marc('008[35-37]', :deduplicate=>true)
+ # Can list tag twice with different field combinations
+ # to extract separately
+ to_field "isbn", extract_marc("245a:245abcde")
~~~
The `extract_marc` function *by default* includes any linked
MARC `880` fields with alternate-script versions. Another reason
to use the `:first` option if you really only want one.
+By default, specifications with multiple subfields (like "240abc") will produce
+one single string of output for each matching field. Specifications
+with single subfields (like "020a") will split subfields and produce
+an output string for each matching subfield.
+
For MARC control (aka 'fixed') fields, you can use square
brackets to take a slice by byte offset.
+~~~ruby
to_field "langauge_code", extract_marc("008[35-37]")
+~~~
+For more information on extraction specifications, see
+the [MarcExtractor class](./lib/traject/marc_extractor.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/MarcExtractor)).
+
`extract_marc` also supports `translation maps` similar
-to SolrMarc's. There will be some translation maps built in,
-and you can provide your own. translation maps can be supplied
+to SolrMarc's. There are some translation maps provided by traject,
+and you can also define your own. translation maps can be supplied
in yaml or ruby. Translation maps are especially useful
-for mapping form MARC codes to user-displayable strings. See Traject::TranslationMap for more info:
+for mapping form MARC codes to user-displayable strings:
+~~~ruby
# "translation_map" will be passed to Traject::TranslationMap.new
# and the created map used to translate all values
to_field "language", extract_marc("008[35-37]:041a:041d", :translation_map => "marc_language_code")
+~~~
+See [Traject::TranslationMap](./lib/traject/translation_map.rb) ([rdoc](http://rdoc.info/gems/traject/Traject/TranslationMap)) for more info on translation mapping.
+
#### Direct indexing logic vs. Macros
It turns out all those functions we saw above used with `to_field` -- `literal`, `serialized_marc`, `extract_all_marc_values`, and `extract_marc` -- are what Traject calls 'macros'.
They are all actually built based upon a more basic element of
@@ -346,14 +349,14 @@
traject -c conf_file.rb marc_file -s solr.url=http://somehere/solr -s solr.url=http://example.com/solr -s solrj_writer.commit_on_close=true
There are some built-in command-line option shortcuts for useful
settings:
-Use `-j` to output as pretty-printed JSON
-hashes, instead of sending to solr. Useful for debugging or sanity
-checking.
+Use `--debug-mode` to output in a human-readable format, instead of sending to solr.
+Also turns on debug logging and restricts processing to single-threaded. Useful for
+debugging or sanity checking.
- traject -j -c conf_file.rb marc_file
+ traject --debug-mode -c conf_file.rb marc_file
Use `-u` as a shortcut for `s solr.url=X`
traject -c conf_file.rb -u http://example.com/solr marc_file.mrc