in biblicit-1.0 vs in biblicit-2.0.3
- old
+ new
@@ -1,37 +1,134 @@
Extract citations from PDFs.
-## Usage
+Note: The version is 2.x, but really should be 0.2.x.
+# Usage
- # Extract metadata from a file using the code from CiteSeerX
- Biblicit.extract(file: "myfile.pdf", tool: :citeseer)
+ # Extract metadata from a file using default tools and settings
+ result = Biblicit::Extractor.extract(content: "a string containing the content of a PDF file")
- # Extract metadata from the contents of a PDF using cb2bib
- Biblicit.extract(contents:"myfile.pdf"), tool: :cb2bib, remote: true)
+ # Extract metadata from a file using all available tools
+ result = Biblicit::Extractor.extract(file: "myfile.pdf", tools: [:citeseer, :parshed, :cb2bib], remote: true, token: false)
+ # See reference information for "myfile.pdf"
+ result[:citeseer][:title]
+ result[:parshed][:title]
+ result[:citeseer][:authors]
+ # etc
-## Algorithms
+# Algorithms
### CiteSeer (default)
Wrapper around Perl code extracted from [CiteSeerX](
-Uses [Apache PDFBox]( to extract text from the PDF, uses a model trained with the [svm-light]( Support Vector Machine library to extract citation data for the PDF itself, and then uses [ParsCit]('s model trained with the [CRF++]( Conditional Random Fields library to parse citations from the PDF's bibliography, if any.
+Uses a model trained with the [svm-light]( Support Vector Machine library.
+### ParsCit (default)
+Wrapper around Perl & Ruby code from [ParsCit](, which is included as a Git submodule.
+Uses a model trained with the [CRF++]( Conditional Random Fields library.
### cb2Bib
Wrapper around [cb2Bib]( in command-line mode.
-Uses pdf2text from [Xpdf]( to extract text from the PDF, uses an apparently less-sophisticated parsing algorithm than the CiteSeerX code to parse metadata, but then, if :remote=true, scrapes one of a large number of journal or public repository websites for a structured version of the citation data.
+Uses an apparently less-sophisticated parsing algorithm than the others to parse metadata, but then, if :remote=true, scrapes one of a large number of journal or public repository websites for a structured version of the citation data. Warning: sometimes it finds the wrong work!
-## Requirements
-### CRF++
+# Requirements
+There are a lot, but you may not need all of them, depending on your use case.
+## Required to support various input file formats
+Different tools are used for different input file formats.
+#### PDF - [Poppler](
+This provides `pdftotext`. You could install `xpdf` instead.
+##### From source
+Requires fontconfig.
+ wget
+ tar -xzf poppler-0.22.1.tar.gz
+ cd poppler-0.22.1
+ ./configure
+ make
+ sudo make install
+##### On Debian/Ubuntu
+ sudo apt-get install poppler-utils
+##### On OS X with Homebrew
+ brew install poppler
+#### Postscript - [Ghostscript](
+This provides `ps2ascii`.
+##### From source
+ wget
+ tar -xzf ghostscript-9.06.tar.gz
+ cd ghostscript-9.06
+ make
+ sudo make install
+##### On Debian/Ubuntu
+ sudo apt-get install ghostscript
+##### On OS X with Homebrew
+ brew install ghostscript
+#### Other (e.g. docx) - [AbiWord](
+This provides `abiword`.
+##### On Debian/Ubuntu
+ sudo apt-get install abiword
+##### On OS X
+As of writing, you're out of luck, because AbiWord doesn't compile on recent versions of OS X. According to their website, however, this is being actively worked on.
+## Required to use either the ParsCit or CiteSeer algorithms
+#### Perl modules
+More than these might be required; this is what I had to add to my default installation.
+##### From CPAN
+ sudo cpan install Digest::SHA1
+ sudo cpan install String::Approx
+ sudo cpan install XML::Writer::String
+ sudo cpan install XML::Twig
+## Required to use the ParsCit algorithm
+#### CRF++
+You can specify where you have installed CRF++ by setting the CRFPP_HOME environment variable.
##### From source
tar xvzf CRF++-0.57.tar.gz
@@ -42,44 +139,38 @@
##### On Debian/Ubuntu
sudo apt-add-repository 'deb oneiric all'
sudo apt-get update
- sudo apt-get install libcrf++
+ sudo apt-get install libcrf++ crf++
##### On OS X with Homebrew
brew install crf++
-### svm-light
+## Required to use the CiteSeer algorithm
-The included model requires version 5, not the current version.
+#### svm-light
+Required for header extraction (reference information for the input work itself).
+The included model requires version 5, not the current version. You can specify where you have installed svm-light by setting the SVM_LIGHT_HOME environment variable.
##### From source
mkdir svm_light5
cd svm_light5
tar -xzf svm_light.tar.gz
- sudo ln -s $(readlink -f "$(dirname svm_classify)/$(basename svm_classify)") /usr/bin/svm_classify5
- sudo ln -s $(readlink -f "$(dirname svm_learn)/$(basename svm_learn)") /usr/bin/svm_learn5
+ echo "export SVM_LIGHT_HOME=`pwd`" >> ~/.profile # or .bashrc or whatever
+ source ~/.profile
-Note: On OS X you'll need to use greadlink instead of readlink if you have coreutils installed, or another workaround for the absence of `-f`.
+## Required to use the cb2bib algorithm
-### Perl modules
+#### cb2Bib
-##### From CPAN
- sudo cpan install DBI
- sudo cpan install Digest::SHA1
- sudo cpan install Log::Log4perl
- sudo cpan install Log::Dispatch
- sudo cpan install String::Approx
-### cb2bib
##### From source (Linux)
tar -xzvf cb2bib-1.4.9.tar.gz
cd cb2bib-1.4.9
@@ -103,18 +194,22 @@
##### On Debian/Ubuntu
sudo apt-get install cb2bib
-### Other
+## Other
+(I'm not currently sure what this was required for; TODO figure it out!)
##### On Debian/Ubuntu
sudo apt-get install libicu-dev
-## Copying
-Copyright or the original author(s).
+# Copying
+Copyright or the original author(s) - see documentation in the included parscit and svm-header-parse directories.
Apache licensed (see LICENSE.TXT).
Please note svm-light is in general free only for non-commercial use, but can be used in this gem by permission of the author. For conditions on additional uses see [the website](