public class PDFTextStripper extends PDFStreamEngine
Modifier and Type | Field and Description |
---|---|
protected Vector<List<TextPosition>> |
charactersByArticle
The charactersByArticle is used to extract text by article divisions.
|
protected PDDocument |
document
The document to read.
|
protected Writer |
output
The stream to write the output to.
|
protected String |
outputEncoding
encoding that text will be written in (or null).
|
protected String |
systemLineSeparator
The platforms line separator.
|
Constructor and Description |
---|
PDFTextStripper()
Instantiate a new PDFTextStripper object.
|
PDFTextStripper(Properties props)
Instantiate a new PDFTextStripper object.
|
PDFTextStripper(String encoding)
Instantiate a new PDFTextStripper object.
|
Modifier and Type | Method and Description |
---|---|
protected void |
endArticle()
End an article.
|
protected void |
endDocument(PDDocument pdf)
This method is available for subclasses of this class.
|
protected void |
endPage(PDPage page)
End a page.
|
boolean |
getAddMoreFormatting()
This will tell if the text stripper should add some more text formatting.
|
String |
getArticleEnd()
Returns the string which will be used at the end of an article.
|
String |
getArticleStart()
Returns the string which will be used at the beginning of an article.
|
float |
getAverageCharTolerance()
Get the current character width-based tolerance value that is being used
to estimate where spaces in text should be added.
|
protected Vector<List<TextPosition>> |
getCharactersByArticle()
Character strings are grouped by articles.
|
protected int |
getCurrentPageNo()
Get the current page number that is being processed.
|
float |
getDropThreshold()
the minimum whitespace, as a multiple
of the max height of the current characters
beyond which the current line start is considered
to be a paragraph start.
|
PDOutlineItem |
getEndBookmark()
Get the bookmark where text extraction should end, inclusive.
|
int |
getEndPage()
This will get the last page that will be extracted.
|
float |
getIndentThreshold()
returns the multiple of whitespace character widths
for the current text which the current
line start can be indented from the previous line start
beyond which the current line start is considered
to be a paragraph start.
|
String |
getLineSeparator()
This will get the line separator.
|
protected List<Pattern> |
getListItemPatterns()
returns a list of regular expression Patterns representing
different common list item formats.
|
protected Writer |
getOutput()
The output stream that is being written to.
|
String |
getPageEnd()
Returns the string which will be used at the end of a page.
|
String |
getPageSeparator()
Deprecated.
use
getPageStart() and getPageEnd() instead |
String |
getPageStart()
Returns the string which will be used at the beginning of a page.
|
String |
getParagraphEnd()
Returns the string which will be used at the end of a paragraph.
|
String |
getParagraphStart()
Returns the string which will be used at the beginning of a paragraph.
|
boolean |
getSeparateByBeads()
This will tell if the text stripper should separate by beads.
|
boolean |
getSortByPosition()
This will tell if the text stripper should sort the text tokens
before writing to the stream.
|
float |
getSpacingTolerance()
Get the current space width-based tolerance value that is being used
to estimate where spaces in text should be added.
|
PDOutlineItem |
getStartBookmark()
Get the bookmark where text extraction should start, inclusive.
|
int |
getStartPage()
This is the page that the text extraction will start on.
|
boolean |
getSuppressDuplicateOverlappingText() |
String |
getText(COSDocument doc)
Deprecated.
|
String |
getText(PDDocument doc)
This will return the text of a document.
|
String |
getWordSeparator()
This will get the word separator.
|
protected PositionWrapper |
handleLineSeparation(PositionWrapper current,
PositionWrapper lastPosition,
PositionWrapper lastLineStartPosition,
float maxHeightForLine)
handles the line separator for a new line given
the specified current and previous TextPositions.
|
String |
inspectFontEncoding(String str)
Reverse characters of a compound Arabic glyph.
|
protected void |
isParagraphSeparation(PositionWrapper position,
PositionWrapper lastPosition,
PositionWrapper lastLineStartPosition,
float maxHeightForLine)
tests the relationship between the last text position, the current text
position and the last text position that followed a line separator to
decide if the gap represents a paragraph separation.
|
protected Pattern |
matchListItemPattern(PositionWrapper pw)
returns the list item Pattern object that matches
the text at the specified PositionWrapper or null
if the text does not match such a pattern.
|
protected static Pattern |
matchPattern(String string,
List<Pattern> patterns)
iterates over the specified list of Patterns until
it finds one that matches the specified string.
|
protected void |
processPage(PDPage page,
COSStream content)
This will process the contents of a page.
|
protected void |
processPages(List<COSObjectable> pages)
This will process all of the pages and the text that is in them.
|
protected void |
processTextPosition(TextPosition text)
This will process a TextPosition object and add the
text to the list of characters on a page.
|
void |
resetEngine()
This method must be called between processing documents.
|
void |
setAddMoreFormatting(boolean newAddMoreFormatting)
There will some additional text formatting be added if addMoreFormatting
is set to true.
|
void |
setArticleEnd(String articleEndValue)
Sets the string which will be used at the end of an article.
|
void |
setArticleStart(String articleStartValue)
Sets the string which will be used at the beginning of an article.
|
void |
setAverageCharTolerance(float averageCharToleranceValue)
Set the character width-based tolerance value that is used
to estimate where spaces in text should be added.
|
void |
setDropThreshold(float dropThresholdValue)
sets the minimum whitespace, as a multiple
of the max height of the current characters
beyond which the current line start is considered
to be a paragraph start.
|
void |
setEndBookmark(PDOutlineItem aEndBookmark)
Set the bookmark where the text extraction should stop.
|
void |
setEndPage(int endPageValue)
This will set the last page to be extracted by this class.
|
void |
setIndentThreshold(float indentThresholdValue)
sets the multiple of whitespace character widths
for the current text which the current
line start can be indented from the previous line start
beyond which the current line start is considered
to be a paragraph start.
|
void |
setLineSeparator(String separator)
Set the desired line separator for output text.
|
protected void |
setListItemPatterns(List<Pattern> patterns)
use to supply a different set of regular expression
patterns for matching list item starts.
|
void |
setPageEnd(String pageEndValue)
Sets the string which will be used at the end of a page.
|
void |
setPageSeparator(String separator)
Deprecated.
use #setPageStart(String) and {@link #setPageEnd(String)} instead
|
void |
setPageStart(String pageStartValue)
Sets the string which will be used at the beginning of a page.
|
void |
setParagraphEnd(String s)
Sets the string which will be used at the end of a paragraph.
|
void |
setParagraphStart(String s)
Sets the string which will be used at the beginning of a paragraph.
|
void |
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
Set if the text stripper should group the text output by a list of beads.
|
void |
setSortByPosition(boolean newSortByPosition)
The order of the text tokens in a PDF file may not be in the same
as they appear visually on the screen.
|
void |
setSpacingTolerance(float spacingToleranceValue)
Set the space width-based tolerance value that is used
to estimate where spaces in text should be added.
|
void |
setStartBookmark(PDOutlineItem aStartBookmark)
Set the bookmark where text extraction should start, inclusive.
|
void |
setStartPage(int startPageValue)
This will set the first page to be extracted by this class.
|
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
By default the text stripper will attempt to remove text that overlapps each other.
|
void |
setWordSeparator(String separator)
Set the desired word separator for output text.
|
protected void |
startArticle()
Start a new article, which is typically defined as a column
on a single page (also referred to as a bead).
|
protected void |
startArticle(boolean isltr)
Start a new article, which is typically defined as a column
on a single page (also referred to as a bead).
|
protected void |
startDocument(PDDocument pdf)
This method is available for subclasses of this class.
|
protected void |
startPage(PDPage page)
Start a new page.
|
protected void |
writeCharacters(TextPosition text)
Write the string in TextPosition to the output stream.
|
protected void |
writeLineSeparator()
Write the line separator value to the output stream.
|
protected void |
writePage()
This will print the text of the processed page to "output".
|
protected void |
writePageEnd()
Write something (if defined) at the end of a page.
|
protected void |
writePageSeperator()
Write the page separator value to the output stream.
|
protected void |
writePageStart()
Write something (if defined) at the start of a page.
|
protected void |
writeParagraphEnd()
Write something (if defined) at the end of a paragraph.
|
protected void |
writeParagraphSeparator()
writes the paragraph separator string to the output.
|
protected void |
writeParagraphStart()
Write something (if defined) at the start of a paragraph.
|
protected void |
writeString(String text)
Write a Java string to the output stream.
|
protected void |
writeString(String text,
List<TextPosition> textPositions)
Write a Java string to the output stream.
|
void |
writeText(COSDocument doc,
Writer outputStream)
Deprecated.
|
void |
writeText(PDDocument doc,
Writer outputStream)
This will take a PDDocument and write the text of that document to the print writer.
|
protected void |
writeWordSeparator()
Write the word separator value to the output stream.
|
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, isForceParsing, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, setColorSpaces, setFonts, setForceParsing, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix
protected final String systemLineSeparator
protected Vector<List<TextPosition>> charactersByArticle
protected String outputEncoding
protected PDDocument document
protected Writer output
public PDFTextStripper() throws IOException
IOException
- If there is an error loading the properties.public PDFTextStripper(Properties props) throws IOException
props
- The properties containing the mapping of operators to PDFOperator
classes.IOException
- If there is an error reading the properties.public PDFTextStripper(String encoding) throws IOException
encoding
- The encoding that the output will be written in.IOException
- If there is an error reading the properties.public String getText(PDDocument doc) throws IOException
doc
- The document to get the text from.IOException
- if the doc state is invalid or it is encrypted.public String getText(COSDocument doc) throws IOException
doc
- The document to extract the text from.IOException
- If there is an error extracting the text.getText( PDDocument )
public void writeText(COSDocument doc, Writer outputStream) throws IOException
doc
- The document to extract the text.outputStream
- The stream to write the text to.IOException
- If there is an error extracting the text.writeText( PDDocument, Writer )
public void resetEngine()
resetEngine
in class PDFStreamEngine
public void writeText(PDDocument doc, Writer outputStream) throws IOException
doc
- The document to get the data from.outputStream
- The location to put the text.IOException
- If the doc is in an invalid state.protected void processPages(List<COSObjectable> pages) throws IOException
pages
- The pages object in the document.IOException
- If there is an error parsing the text.protected void startDocument(PDDocument pdf) throws IOException
pdf
- The PDF document that is being processed.IOException
- If an IO error occurs.protected void endDocument(PDDocument pdf) throws IOException
pdf
- The PDF document that is being processed.IOException
- If an IO error occurs.protected void processPage(PDPage page, COSStream content) throws IOException
page
- The page to process.content
- The contents of the page.IOException
- If there is an error processing the page.protected void startArticle() throws IOException
IOException
- If there is any error writing to the stream.protected void startArticle(boolean isltr) throws IOException
isltr
- true if primary direction of text is left to right.IOException
- If there is any error writing to the stream.protected void endArticle() throws IOException
IOException
- If there is any error writing to the stream.protected void startPage(PDPage page) throws IOException
page
- The page we are about to process.IOException
- If there is any error writing to the stream.protected void endPage(PDPage page) throws IOException
page
- The page we are about to process.IOException
- If there is any error writing to the stream.protected void writePage() throws IOException
IOException
- If there is an error writing the text.protected void writePageSeperator() throws IOException
IOException
- If there is a problem writing out the pageseparator to the document.protected void writeLineSeparator() throws IOException
IOException
- If there is a problem writing out the lineseparator to the document.protected void writeWordSeparator() throws IOException
IOException
- If there is a problem writing out the wordseparator to the document.protected void writeCharacters(TextPosition text) throws IOException
text
- The text to write to the stream.IOException
- If there is an error when writing the text.protected void writeString(String text, List<TextPosition> textPositions) throws IOException
textPositions
and just calls writeString(String)
.text
- The text to write to the stream.textPositions
- The TextPositions belonging to the text.IOException
- If there is an error when writing the text.protected void writeString(String text) throws IOException
text
- The text to write to the stream.IOException
- If there is an error when writing the text.protected void processTextPosition(TextPosition text)
processTextPosition
in class PDFStreamEngine
text
- The text to process.public int getStartPage()
public void setStartPage(int startPageValue)
startPageValue
- New value of 1-based startPage property.public int getEndPage()
public void setEndPage(int endPageValue)
endPageValue
- New value of 1-based endPage property.public void setLineSeparator(String separator)
separator
- The desired line separator string.public String getLineSeparator()
public void setPageSeparator(String separator)
separator
- The desired page separator string.public String getWordSeparator()
public void setWordSeparator(String separator)
separator
- The desired page separator string.public String getPageSeparator()
public boolean getSuppressDuplicateOverlappingText()
protected int getCurrentPageNo()
protected Writer getOutput()
protected Vector<List<TextPosition>> getCharactersByArticle()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
suppressDuplicateOverlappingTextValue
- The suppressDuplicateOverlappingText to set.public boolean getSeparateByBeads()
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
aShouldSeparateByBeads
- The new grouping of beads.public PDOutlineItem getEndBookmark()
public void setEndBookmark(PDOutlineItem aEndBookmark)
aEndBookmark
- The ending bookmark.public PDOutlineItem getStartBookmark()
public void setStartBookmark(PDOutlineItem aStartBookmark)
aStartBookmark
- The starting bookmark.public boolean getAddMoreFormatting()
public void setAddMoreFormatting(boolean newAddMoreFormatting)
newAddMoreFormatting
- Tell PDFBox to add some more text formattingpublic boolean getSortByPosition()
public void setSortByPosition(boolean newSortByPosition)
newSortByPosition
- Tell PDFBox to sort the text positions.public float getSpacingTolerance()
public void setSpacingTolerance(float spacingToleranceValue)
spacingToleranceValue
- tolerance / scaling factor to usepublic float getAverageCharTolerance()
public void setAverageCharTolerance(float averageCharToleranceValue)
averageCharToleranceValue
- average tolerance / scaling factor to usepublic float getIndentThreshold()
public void setIndentThreshold(float indentThresholdValue)
indentThresholdValue
- the number of whitespace character widths to use
when detecting paragraph indents.public float getDropThreshold()
public void setDropThreshold(float dropThresholdValue)
dropThresholdValue
- the character height multiple for
max allowed whitespace between lines in
the same paragraph.public String getParagraphStart()
public void setParagraphStart(String s)
s
- the paragraph start stringpublic String getParagraphEnd()
public void setParagraphEnd(String s)
s
- the paragraph end stringpublic String getPageStart()
public void setPageStart(String pageStartValue)
pageStartValue
- the page start stringpublic String getPageEnd()
public void setPageEnd(String pageEndValue)
pageEndValue
- the page end stringpublic String getArticleStart()
public void setArticleStart(String articleStartValue)
articleStartValue
- the article start stringpublic String getArticleEnd()
public void setArticleEnd(String articleEndValue)
articleEndValue
- the article end stringpublic String inspectFontEncoding(String str)
inspectFontEncoding
in class PDFStreamEngine
str
- a string obtained from font.encoding()protected PositionWrapper handleLineSeparation(PositionWrapper current, PositionWrapper lastPosition, PositionWrapper lastLineStartPosition, float maxHeightForLine) throws IOException
current
- the current text positionlastPosition
- the previous text positionlastLineStartPosition
- the last text position that followed a line
separator.maxHeightForLine
- max height for positions since lastLineStartPositionIOException
- if something went wrongprotected void isParagraphSeparation(PositionWrapper position, PositionWrapper lastPosition, PositionWrapper lastLineStartPosition, float maxHeightForLine)
This base implementation tests to see if the lastLineStartPosition is null OR if the current vertical position has dropped below the last text vertical position by at least 2.5 times the current text height OR if the current horizontal position is indented by at least 2 times the current width of a space character.
This also attempts to identify text that is indented under a hanging indent.
This method sets the isParagraphStart and isHangingIndent flags on the current position object.
position
- the current text position. This may have its isParagraphStart
or isHangingIndent flags set upon return.lastPosition
- the previous text position (should not be null).lastLineStartPosition
- the last text position that followed a line
separator. May be null.maxHeightForLine
- max height for text positions since lasLineStartPosition.protected void writeParagraphSeparator() throws IOException
IOException
- if something went wrongprotected void writeParagraphStart() throws IOException
IOException
- if something went wrongprotected void writeParagraphEnd() throws IOException
IOException
- if something went wrongprotected void writePageStart() throws IOException
IOException
- if something went wrongprotected void writePageEnd() throws IOException
IOException
- if something went wrongprotected Pattern matchListItemPattern(PositionWrapper pw)
getListItemPatterns()
method. To add to
the list, simply override that method (if sub-classing)
or explicitly supply your own list using
setListItemPatterns(List)
.pw
- positionprotected void setListItemPatterns(List<Pattern> patterns)
patterns
- list of patternsprotected List<Pattern> getListItemPatterns()
This method returns a list of such regular expression Patterns.
protected static final Pattern matchPattern(String string, List<Pattern> patterns)
Order of the supplied list of patterns is important as most common patterns should come first. Patterns should be strict in general, and all will be used with case sensitivity on.
string
- the string to be searchedpatterns
- list of patternsCopyright © 2002–2017 The Apache Software Foundation. All rights reserved.