README for sectLabel module (v100401) CONTENTS [0] Directory structure [1] Command line Usage [1.1] SectLabel [1.2] GenericSect [3] Known issues ------------------------------------------------------------ [0] DIRECTORY STRUCTURE * processOmniXML.pl: Process Omnipage XML output (concatenated results fromm all pages of a PDF file), and extract text lines together with other XML infos Note: the current script is complicated since it mixes 2 things: process Omnipage XML as well as extract XML features. We are planning to break into 2 scripts: 1) simplifyOmniXML.pl (Done!) -- to convert Omnipage into output into internal format, and 2) extractXMLFeatures.pl (TODO) -- to take input as the internal results produced by simplifyOmniXML.pl and generate XML features. * redo.sectLabel.pl: Perform stratified cross-validation for SectLabel * tr2crfpp.pl: Generate SectLabel features for CRF++ * single2multi.pl: Convert SectLabel training file (e.g. doc/sectLabel.tagged.txt) from single- to multi-line format. This script is called by tr2crfpp.pl * genericSectExtract.rb: given a list of section headers of a scientific document in an input file, assign generic headers for the section headers. * genericSect/ ------------------------------------------------------------ [1] COMMAND LINE USAGE ------------------------------ [1.1] SectLabel * Process Omnipage XML output ** Usage: processOmniXML.pl -h [invokes help] processOmniXML.pl -in xmlFile -out outFile [-xmlFeature -decode -markup -para] [-tag tagFile -allowEmptyLine -log] Options: -q Quiet Mode (don't echo license) -xmlFeature: append XML feature together with text extracted -decode: decode HTML entities and then output, to avoid double entity encoding later -tag tagFile: count XML tags/values for statistics -markup: add factor infos (bold, italic etc) per word using the format "word|||(b|nb)|||(i|ni)", useful in extracting bold/italic phrases * Perform stratified cross-validation ** Usage: redo.sectLabel.pl -h [invokes help] redo.sectLabel.pl -in trainFile -dir outDir -n folds -c configFile [-p numCpus -iter numIter -f freqCutoff] Options: -in: training file in the format as in doc/sectLabel.tagged.txt -dir: output directory, containing all intermediate files and outputs -n: num of cross validation folds -c: config file to extract features and automatically generate CRF++ template -p: CRF++ num of CPUs (deault = 6) -iter: CRF++ max iteration (default = 100) -f: CRF++ frequency cut-off (default = 3) ** E.g.: ./bin/sectLabel/redo.sectLabel.pl -in ./doc/sectLabelXml.tagged.txt -dir testRedoDir -n 10 -c ./resources/sectLabel/sectLabel.configXml * Extract features ** Usage: tr2crfpp.pl -h [invokes help] tr2crfpp.pl -in inFile -c configFile -out outFile [-template -single] Options: -q Quiet Mode (don't echo license) -in inFile: labeled input file -c configFile: to specify which feature set to use. -out outFile: output file for CRF++ training. -template: to output a template used by CRF++ according to the config file. -single: indicate that each input document is in single-line format (e.g., ./doc/sectLabel.tagged.txt) ------------------------------ [1.2] GenericSect * Create feature file ** Usage: ruby extractFeature.rb filePath filePath: path to the labeled data file which lists the actual section headers and their corressponding manually assigned generic section headers (if it exists) syntax: generic_header ||| actual_header * Generate generic section headers for a document ** Usage: ruby genericSectExtract.rb filePath where filePath is a file which lists the actual headers of a document (automaticaly extracted by other module of SectLabel) * Perform stratified cross-validation ** Usage: ruby crossValidation.rb dataFile numFold Note that data file has the format as in doc/genericSect.tagged.txt ------------------------------------------------------------ [3] KNOWN ISSUES