public class PDFTextStripperByArea extends PDFTextStripper
charactersByArticle, document, output, outputEncoding, systemLineSeparator
Constructor and Description |
---|
PDFTextStripperByArea()
Constructor.
|
PDFTextStripperByArea(Properties props)
Instantiate a new PDFTextStripperArea object.
|
PDFTextStripperByArea(String encoding)
Instantiate a new PDFTextStripperArea object.
|
Modifier and Type | Method and Description |
---|---|
void |
addRegion(String regionName,
Rectangle2D rect)
Add a new region to group text by.
|
void |
extractRegions(PDPage page)
Process the page to extract the region text.
|
List<String> |
getRegions()
Get the list of regions that have been setup.
|
String |
getTextForRegion(String regionName)
Get the text for the region, this should be called after extractRegions().
|
protected void |
processTextPosition(TextPosition text)
This will process a TextPosition object and add the
text to the list of characters on a page.
|
void |
removeRegion(String regionName)
Delete a region to group text by.
|
protected void |
writePage()
This will print the processed page text to the output stream.
|
endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageSeparator, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getText, getWordSeparator, handleLineSeparation, inspectFontEncoding, isParagraphSeparation, matchListItemPattern, matchPattern, processPage, processPages, resetEngine, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageSeparator, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePageEnd, writePageSeperator, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeString, writeText, writeText, writeWordSeparator
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, isForceParsing, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, setColorSpaces, setFonts, setForceParsing, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix
public PDFTextStripperByArea() throws IOException
IOException
- If there is an error loading properties.public PDFTextStripperByArea(Properties props) throws IOException
props
- The properties containing the mapping of operators to
PDFOperator classes.IOException
- If there is an error reading the properties.public PDFTextStripperByArea(String encoding) throws IOException
encoding
- The encoding that the output will be written in.IOException
- If there is an error reading the properties.public void addRegion(String regionName, Rectangle2D rect)
regionName
- The name of the region.rect
- The rectangle area to retrieve the text from.public void removeRegion(String regionName)
regionName
- The name of the region to delete.public List<String> getRegions()
public String getTextForRegion(String regionName)
regionName
- The name of the region to get the text from.public void extractRegions(PDPage page) throws IOException
page
- The page to extract the regions from.IOException
- If there is an error while extracting text.protected void processTextPosition(TextPosition text)
processTextPosition
in class PDFTextStripper
text
- The text to process.protected void writePage() throws IOException
writePage
in class PDFTextStripper
IOException
- If there is an error writing the text.Copyright © 2002–2017 The Apache Software Foundation. All rights reserved.