README.md in biblicit-1.0 vs README.md in biblicit-2.0.3
- old
+ new
@@ -1,37 +1,134 @@
biblicit
=============
Extract citations from PDFs.
-## Usage
+Note: The version is 2.x, but really should be 0.2.x.
+
+# Usage
+
```ruby
- # Extract metadata from a file using the code from CiteSeerX
- Biblicit.extract(file: "myfile.pdf", tool: :citeseer)
+ # Extract metadata from a file using default tools and settings
+ result = Biblicit::Extractor.extract(content: "a string containing the content of a PDF file")
- # Extract metadata from the contents of a PDF using cb2bib
- Biblicit.extract(contents: IO.read("myfile.pdf"), tool: :cb2bib, remote: true)
+ # Extract metadata from a file using all available tools
+ result = Biblicit::Extractor.extract(file: "myfile.pdf", tools: [:citeseer, :parshed, :cb2bib], remote: true, token: false)
+
+ # See reference information for "myfile.pdf"
+ result[:citeseer][:title]
+ result[:parshed][:title]
+ result[:citeseer][:authors]
+ # etc
```
-## Algorithms
+# Algorithms
+
### CiteSeer (default)
Wrapper around Perl code extracted from [CiteSeerX](http://citeseer.ist.psu.edu/).
-Uses [Apache PDFBox](http://pdfbox.apache.org/) to extract text from the PDF, uses a model trained with the [svm-light](http://svmlight.joachims.org/) Support Vector Machine library to extract citation data for the PDF itself, and then uses [ParsCit](http://aye.comp.nus.edu.sg/parsCit/)'s model trained with the [CRF++](http://code.google.com/p/crfpp/) Conditional Random Fields library to parse citations from the PDF's bibliography, if any.
+Uses a model trained with the [svm-light](http://svmlight.joachims.org/) Support Vector Machine library.
+### ParsCit (default)
+
+Wrapper around Perl & Ruby code from [ParsCit](http://aye.comp.nus.edu.sg/parsCit/), which is included as a Git submodule.
+
+Uses a model trained with the [CRF++](http://code.google.com/p/crfpp/) Conditional Random Fields library.
+
### cb2Bib
Wrapper around [cb2Bib](http://www.molspaces.com/cb2bib/) in command-line mode.
-Uses pdf2text from [Xpdf](http://www.foolabs.com/xpdf/download.html) to extract text from the PDF, uses an apparently less-sophisticated parsing algorithm than the CiteSeerX code to parse metadata, but then, if :remote=true, scrapes one of a large number of journal or public repository websites for a structured version of the citation data.
+Uses an apparently less-sophisticated parsing algorithm than the others to parse metadata, but then, if :remote=true, scrapes one of a large number of journal or public repository websites for a structured version of the citation data. Warning: sometimes it finds the wrong work!
-## Requirements
-### CRF++
+# Requirements
+
+There are a lot, but you may not need all of them, depending on your use case.
+
+
+## Required to support various input file formats
+
+Different tools are used for different input file formats.
+
+#### PDF - [Poppler](http://poppler.freedesktop.org/)
+
+This provides `pdftotext`. You could install `xpdf` instead.
+
+##### From source
+
+Requires fontconfig.
+
+ wget http://poppler.freedesktop.org/poppler-0.22.1.tar.gz
+ tar -xzf poppler-0.22.1.tar.gz
+ cd poppler-0.22.1
+ ./configure
+ make
+ sudo make install
+
+##### On Debian/Ubuntu
+
+ sudo apt-get install poppler-utils
+
+##### On OS X with Homebrew
+
+ brew install poppler
+
+#### Postscript - [Ghostscript](http://www.ghostscript.com/)
+
+This provides `ps2ascii`.
+
+##### From source
+
+ wget http://downloads.ghostscript.com/public/ghostscript-9.06.tar.gz
+ tar -xzf ghostscript-9.06.tar.gz
+ cd ghostscript-9.06
+ make
+ sudo make install
+
+##### On Debian/Ubuntu
+
+ sudo apt-get install ghostscript
+
+##### On OS X with Homebrew
+
+ brew install ghostscript
+
+#### Other (e.g. docx) - [AbiWord](http://www.abisource.com/)
+
+This provides `abiword`.
+
+##### On Debian/Ubuntu
+
+ sudo apt-get install abiword
+
+##### On OS X
+
+As of writing, you're out of luck, because AbiWord doesn't compile on recent versions of OS X. According to their website, however, this is being actively worked on.
+
+
+## Required to use either the ParsCit or CiteSeer algorithms
+
+#### Perl modules
+
+More than these might be required; this is what I had to add to my default installation.
+
+##### From CPAN
+
+ sudo cpan install Digest::SHA1
+ sudo cpan install String::Approx
+ sudo cpan install XML::Writer::String
+ sudo cpan install XML::Twig
+
+## Required to use the ParsCit algorithm
+
+#### CRF++
+
+You can specify where you have installed CRF++ by setting the CRFPP_HOME environment variable.
##### From source
wget http://crfpp.googlecode.com/files/CRF%2B%2B-0.57.tar.gz
tar xvzf CRF++-0.57.tar.gz
@@ -42,44 +139,38 @@
##### On Debian/Ubuntu
sudo apt-add-repository 'deb http://cl.naist.jp/~eric-n/ubuntu-nlp oneiric all'
sudo apt-get update
- sudo apt-get install libcrf++
+ sudo apt-get install libcrf++ crf++
##### On OS X with Homebrew
brew install crf++
-### svm-light
+## Required to use the CiteSeer algorithm
-The included model requires version 5, not the current version.
+#### svm-light
+Required for header extraction (reference information for the input work itself).
+
+The included model requires version 5, not the current version. You can specify where you have installed svm-light by setting the SVM_LIGHT_HOME environment variable.
+
##### From source
mkdir svm_light5
cd svm_light5
wget http://download.joachims.org/svm_light/v5.00/svm_light.tar.gz
tar -xzf svm_light.tar.gz
make
- sudo ln -s $(readlink -f "$(dirname svm_classify)/$(basename svm_classify)") /usr/bin/svm_classify5
- sudo ln -s $(readlink -f "$(dirname svm_learn)/$(basename svm_learn)") /usr/bin/svm_learn5
+ echo "export SVM_LIGHT_HOME=`pwd`" >> ~/.profile # or .bashrc or whatever
+ source ~/.profile
-Note: On OS X you'll need to use greadlink instead of readlink if you have coreutils installed, or another workaround for the absence of `-f`.
+## Required to use the cb2bib algorithm
-### Perl modules
+#### cb2Bib
-##### From CPAN
-
- sudo cpan install DBI
- sudo cpan install Digest::SHA1
- sudo cpan install Log::Log4perl
- sudo cpan install Log::Dispatch
- sudo cpan install String::Approx
-
-### cb2bib
-
##### From source (Linux)
wget http://www.molspaces.com/dl/progs/cb2bib-1.4.9.tar.gz
tar -xzvf cb2bib-1.4.9.tar.gz
cd cb2bib-1.4.9
@@ -103,18 +194,22 @@
##### On Debian/Ubuntu
sudo apt-get install cb2bib
-### Other
+## Other
+
+(I'm not currently sure what this was required for; TODO figure it out!)
+
##### On Debian/Ubuntu
sudo apt-get install libicu-dev
-## Copying
-Copyright Academia.edu or the original author(s).
+# Copying
+
+Copyright Academia.edu or the original author(s) - see documentation in the included parscit and svm-header-parse directories.
Apache licensed (see LICENSE.TXT).
Please note svm-light is in general free only for non-commercial use, but can be used in this gem by permission of the author. For conditions on additional uses see [the website](http://svmlight.joachims.org/).