org.apache.pdfbox.util
Class PDFTextStripperByArea

java.lang.Object
  extended by org.apache.pdfbox.util.PDFStreamEngine
      extended by org.apache.pdfbox.util.PDFTextStripper
          extended by org.apache.pdfbox.util.PDFTextStripperByArea

public class PDFTextStripperByArea
extends PDFTextStripper

This will extract text from a specified region in the PDF.

Version:
$Revision: 1.5 $
Author:
Ben Litchfield

Field Summary
 
Fields inherited from class org.apache.pdfbox.util.PDFTextStripper
charactersByArticle, document, lineSeparator, output, outputEncoding
 
Constructor Summary
PDFTextStripperByArea()
          Constructor.
 
Method Summary
 void addRegion(String regionName, Rectangle2D rect)
          Add a new region to group text by.
 void extractRegions(PDPage page)
          Process the page to extract the region text.
 List getRegions()
          Get the list of regions that have been setup.
 String getTextForRegion(String regionName)
          Get the text for the region, this should be called after extractRegions().
protected  void processTextPosition(TextPosition text)
          This will process a TextPosition object and add the text to the list of characters on a page.
protected  void writePage()
          This will print the processed page text to the output stream.
 
Methods inherited from class org.apache.pdfbox.util.PDFTextStripper
endArticle, endDocument, endPage, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getEndBookmark, getEndPage, getLineSeparator, getOutput, getPageSeparator, getSpacingTolerance, getStartBookmark, getStartPage, getText, getText, getWordSeparator, inspectFontEncoding, processPage, processPages, resetEngine, setAverageCharTolerance, setEndBookmark, setEndPage, setLineSeparator, setPageSeparator, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, shouldSeparateByBeads, shouldSortByPosition, shouldSuppressDuplicateOverlappingText, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageSeperator, writeString, writeText, writeText, writeWordSeparator
 
Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PDFTextStripperByArea

public PDFTextStripperByArea()
                      throws IOException
Constructor.

Throws:
IOException - If there is an error loading properties.
Method Detail

addRegion

public void addRegion(String regionName,
                      Rectangle2D rect)
Add a new region to group text by.

Parameters:
regionName - The name of the region.
rect - The rectangle area to retrieve the text from.

getRegions

public List getRegions()
Get the list of regions that have been setup.

Returns:
A list of java.lang.String objects to identify the region names.

getTextForRegion

public String getTextForRegion(String regionName)
Get the text for the region, this should be called after extractRegions().

Parameters:
regionName - The name of the region to get the text from.
Returns:
The text that was identified in that region.

extractRegions

public void extractRegions(PDPage page)
                    throws IOException
Process the page to extract the region text.

Parameters:
page - The page to extract the regions from.
Throws:
IOException - If there is an error while extracting text.

processTextPosition

protected void processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.

Overrides:
processTextPosition in class PDFTextStripper
Parameters:
text - The text to process.

writePage

protected void writePage()
                  throws IOException
This will print the processed page text to the output stream.

Overrides:
writePage in class PDFTextStripper
Throws:
IOException - If there is an error writing the text.


Copyright © 2002-2010 The Apache Software Foundation. All Rights Reserved.