org.apache.poi.hwpf.extractor
Class WordExtractor

java.lang.Object
  extended by org.apache.poi.POITextExtractor
      extended by org.apache.poi.POIOLE2TextExtractor
          extended by org.apache.poi.hwpf.extractor.WordExtractor

public final class WordExtractor
extends POIOLE2TextExtractor

Class to extract the text from a Word Document. You should use either getParagraphText() or getText() unless you have a strong reason otherwise.

Author:
Nick Burch

Field Summary
 
Fields inherited from class org.apache.poi.POITextExtractor
document
 
Constructor Summary
WordExtractor(DirectoryNode dir, POIFSFileSystem fs)
           
WordExtractor(HWPFDocument doc)
          Create a new Word Extractor
WordExtractor(java.io.InputStream is)
          Create a new Word Extractor
WordExtractor(POIFSFileSystem fs)
          Create a new Word Extractor
 
Method Summary
 java.lang.String[] getCommentsText()
           
 java.lang.String[] getEndnoteText()
           
 java.lang.String getFooterText()
          Grab the text from the footers
 java.lang.String[] getFootnoteText()
           
 java.lang.String getHeaderText()
          Grab the text from the headers
 java.lang.String[] getParagraphText()
          Get the text from the word file, as an array with one String per paragraph
protected static java.lang.String[] getParagraphText(Range r)
           
 java.lang.String getText()
          Grab the text, based on the paragraphs.
 java.lang.String getTextFromPieces()
          Grab the text out of the text pieces.
static void main(java.lang.String[] args)
          Command line extractor, so people will stop moaning that they can't just run this.
static java.lang.String stripFields(java.lang.String text)
          Removes any fields (eg macros, page markers etc) from the string.
 
Methods inherited from class org.apache.poi.POIOLE2TextExtractor
getDocSummaryInformation, getFileSystem, getMetadataTextExtractor, getSummaryInformation
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordExtractor

public WordExtractor(java.io.InputStream is)
              throws java.io.IOException
Create a new Word Extractor

Parameters:
is - InputStream containing the word file
Throws:
java.io.IOException

WordExtractor

public WordExtractor(POIFSFileSystem fs)
              throws java.io.IOException
Create a new Word Extractor

Parameters:
fs - POIFSFileSystem containing the word file
Throws:
java.io.IOException

WordExtractor

public WordExtractor(DirectoryNode dir,
                     POIFSFileSystem fs)
              throws java.io.IOException
Throws:
java.io.IOException

WordExtractor

public WordExtractor(HWPFDocument doc)
Create a new Word Extractor

Parameters:
doc - The HWPFDocument to extract from
Method Detail

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
Command line extractor, so people will stop moaning that they can't just run this.

Throws:
java.io.IOException

getParagraphText

public java.lang.String[] getParagraphText()
Get the text from the word file, as an array with one String per paragraph


getFootnoteText

public java.lang.String[] getFootnoteText()

getEndnoteText

public java.lang.String[] getEndnoteText()

getCommentsText

public java.lang.String[] getCommentsText()

getParagraphText

protected static java.lang.String[] getParagraphText(Range r)

getHeaderText

public java.lang.String getHeaderText()
Grab the text from the headers


getFooterText

public java.lang.String getFooterText()
Grab the text from the footers


getTextFromPieces

public java.lang.String getTextFromPieces()
Grab the text out of the text pieces. Might also include various bits of crud, but will work in cases where the text piece -> paragraph mapping is broken. Fast too.


getText

public java.lang.String getText()
Grab the text, based on the paragraphs. Shouldn't include any crud, but slightly slower than getTextFromPieces().

Specified by:
getText in class POITextExtractor
Returns:
All the text from the document

stripFields

public static java.lang.String stripFields(java.lang.String text)
Removes any fields (eg macros, page markers etc) from the string.



Copyright 2010 The Apache Software Foundation or its licensors, as applicable.