public class PDFTextStripper extends PDFStreamEngine
Modifier and Type | Field and Description |
---|---|
protected ArrayList<List<TextPosition>> |
charactersByArticle
The charactersByArticle is used to extract text by article divisions.
|
protected PDDocument |
document |
protected String |
LINE_SEPARATOR
The platform's line separator.
|
protected Writer |
output |
Constructor and Description |
---|
PDFTextStripper()
Instantiate a new PDFTextStripper object.
|
Modifier and Type | Method and Description |
---|---|
protected void |
endArticle()
End an article.
|
protected void |
endDocument(PDDocument document)
This method is available for subclasses of this class.
|
protected void |
endPage(PDPage page)
End a page.
|
boolean |
getAddMoreFormatting()
This will tell if the text stripper should add some more text formatting.
|
String |
getArticleEnd()
Returns the string which will be used at the end of an article.
|
String |
getArticleStart()
Returns the string which will be used at the beginning of an article.
|
float |
getAverageCharTolerance()
Get the current character width-based tolerance value that is being used to estimate where spaces in text should
be added.
|
protected List<List<TextPosition>> |
getCharactersByArticle()
Character strings are grouped by articles.
|
protected int |
getCurrentPageNo()
Get the current page number that is being processed.
|
float |
getDropThreshold()
the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line
start is considered to be a paragraph start.
|
PDOutlineItem |
getEndBookmark()
Get the bookmark where text extraction should end, inclusive.
|
int |
getEndPage()
This will get the last page that will be extracted.
|
float |
getIndentThreshold()
returns the multiple of whitespace character widths for the current text which the current line start can be
indented from the previous line start beyond which the current line start is considered to be a paragraph start.
|
String |
getLineSeparator()
This will get the line separator.
|
protected List<Pattern> |
getListItemPatterns()
returns a list of regular expression Patterns representing different common list item formats.
|
protected Writer |
getOutput()
The output stream that is being written to.
|
String |
getPageEnd()
Returns the string which will be used at the end of a page.
|
String |
getPageStart()
Returns the string which will be used at the beginning of a page.
|
String |
getParagraphEnd()
Returns the string which will be used at the end of a paragraph.
|
String |
getParagraphStart()
Returns the string which will be used at the beginning of a paragraph.
|
boolean |
getSeparateByBeads()
This will tell if the text stripper should separate by beads.
|
boolean |
getSortByPosition()
This will tell if the text stripper should sort the text tokens before writing to the stream.
|
float |
getSpacingTolerance()
Get the current space width-based tolerance value that is being used to estimate where spaces in text should be
added.
|
PDOutlineItem |
getStartBookmark()
Get the bookmark where text extraction should start, inclusive.
|
int |
getStartPage()
This is the page that the text extraction will start on.
|
boolean |
getSuppressDuplicateOverlappingText() |
String |
getText(PDDocument doc)
This will return the text of a document.
|
String |
getWordSeparator()
This will get the word separator.
|
protected static Pattern |
matchPattern(String string,
List<Pattern> patterns)
iterates over the specified list of Patterns until it finds one that matches the specified string.
|
void |
processPage(PDPage page)
This will process the contents of a page.
|
protected void |
processPages(PDPageTree pages)
This will process all of the pages and the text that is in them.
|
protected void |
processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page.
|
void |
setAddMoreFormatting(boolean newAddMoreFormatting)
There will some additional text formatting be added if addMoreFormatting is set to true.
|
void |
setArticleEnd(String articleEndValue)
Sets the string which will be used at the end of an article.
|
void |
setArticleStart(String articleStartValue)
Sets the string which will be used at the beginning of an article.
|
void |
setAverageCharTolerance(float averageCharToleranceValue)
Set the character width-based tolerance value that is used to estimate where spaces in text should be added.
|
void |
setDropThreshold(float dropThresholdValue)
sets the minimum whitespace, as a multiple of the max height of the current characters beyond which the current
line start is considered to be a paragraph start.
|
void |
setEndBookmark(PDOutlineItem aEndBookmark)
Set the bookmark where the text extraction should stop.
|
void |
setEndPage(int endPageValue)
This will set the last page to be extracted by this class.
|
void |
setIndentThreshold(float indentThresholdValue)
sets the multiple of whitespace character widths for the current text which the current line start can be
indented from the previous line start beyond which the current line start is considered to be a paragraph start.
|
void |
setLineSeparator(String separator)
Set the desired line separator for output text.
|
protected void |
setListItemPatterns(List<Pattern> patterns)
use to supply a different set of regular expression patterns for matching list item starts.
|
void |
setPageEnd(String pageEndValue)
Sets the string which will be used at the end of a page.
|
void |
setPageStart(String pageStartValue)
Sets the string which will be used at the beginning of a page.
|
void |
setParagraphEnd(String s)
Sets the string which will be used at the end of a paragraph.
|
void |
setParagraphStart(String s)
Sets the string which will be used at the beginning of a paragraph.
|
void |
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
Set if the text stripper should group the text output by a list of beads.
|
void |
setSortByPosition(boolean newSortByPosition)
The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen.
|
void |
setSpacingTolerance(float spacingToleranceValue)
Set the space width-based tolerance value that is used to estimate where spaces in text should be added.
|
void |
setStartBookmark(PDOutlineItem aStartBookmark)
Set the bookmark where text extraction should start, inclusive.
|
void |
setStartPage(int startPageValue)
This will set the first page to be extracted by this class.
|
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
By default the text stripper will attempt to remove text that overlapps each other.
|
void |
setWordSeparator(String separator)
Set the desired word separator for output text.
|
protected void |
showGlyph(Matrix textRenderingMatrix,
PDFont font,
int code,
String unicode,
Vector displacement)
This method was originally written by Ben Litchfield for PDFStreamEngine.
|
protected void |
startArticle()
Start a new article, which is typically defined as a column on a single page (also referred to as a bead).
|
protected void |
startArticle(boolean isLTR)
Start a new article, which is typically defined as a column on a single page (also referred to as a bead).
|
protected void |
startDocument(PDDocument document)
This method is available for subclasses of this class.
|
protected void |
startPage(PDPage page)
Start a new page.
|
protected void |
writeCharacters(TextPosition text)
Write the string in TextPosition to the output stream.
|
protected void |
writeLineSeparator()
Write the line separator value to the output stream.
|
protected void |
writePage()
This will print the text of the processed page to "output".
|
protected void |
writePageEnd()
Write something (if defined) at the end of a page.
|
protected void |
writePageStart()
Write something (if defined) at the start of a page.
|
protected void |
writeParagraphEnd()
Write something (if defined) at the end of a paragraph.
|
protected void |
writeParagraphSeparator()
writes the paragraph separator string to the output.
|
protected void |
writeParagraphStart()
Write something (if defined) at the start of a paragraph.
|
protected void |
writeString(String text)
Write a Java string to the output stream.
|
protected void |
writeString(String text,
List<TextPosition> textPositions)
Write a Java string to the output stream.
|
void |
writeText(PDDocument doc,
Writer outputStream)
This will take a PDDocument and write the text of that document to the print writer.
|
protected void |
writeWordSeparator()
Write the word separator value to the output stream.
|
addOperator, applyTextAdjustment, beginText, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getResources, getTextLineMatrix, getTextMatrix, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showForm, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
protected final String LINE_SEPARATOR
protected ArrayList<List<TextPosition>> charactersByArticle
protected PDDocument document
protected Writer output
public PDFTextStripper() throws IOException
IOException
- If there is an error loading the properties.public String getText(PDDocument doc) throws IOException
doc
- The document to get the text from.IOException
- if the doc state is invalid or it is encrypted.public void writeText(PDDocument doc, Writer outputStream) throws IOException
doc
- The document to get the data from.outputStream
- The location to put the text.IOException
- If the doc is in an invalid state.protected void processPages(PDPageTree pages) throws IOException
pages
- The pages object in the document.IOException
- If there is an error parsing the text.protected void startDocument(PDDocument document) throws IOException
document
- The PDF document that is being processed.IOException
- If an IO error occurs.protected void endDocument(PDDocument document) throws IOException
document
- The PDF document that is being processed.IOException
- If an IO error occurs.public void processPage(PDPage page) throws IOException
page
- The page to process.IOException
- If there is an error processing the page.protected void startArticle() throws IOException
IOException
- If there is any error writing to the stream.protected void startArticle(boolean isLTR) throws IOException
isLTR
- true if primary direction of text is left to right.IOException
- If there is any error writing to the stream.protected void endArticle() throws IOException
IOException
- If there is any error writing to the stream.protected void startPage(PDPage page) throws IOException
page
- The page we are about to process.IOException
- If there is any error writing to the stream.protected void endPage(PDPage page) throws IOException
page
- The page we are about to process.IOException
- If there is any error writing to the stream.protected void writePage() throws IOException
IOException
- If there is an error writing the text.protected void writeLineSeparator() throws IOException
IOException
- If there is a problem writing out the lineseparator to the document.protected void writeWordSeparator() throws IOException
IOException
- If there is a problem writing out the wordseparator to the document.protected void writeCharacters(TextPosition text) throws IOException
text
- The text to write to the stream.IOException
- If there is an error when writing the text.protected void writeString(String text, List<TextPosition> textPositions) throws IOException
textPositions
and just calls writeString(String)
.text
- The text to write to the stream.textPositions
- The TextPositions belonging to the text.IOException
- If there is an error when writing the text.protected void writeString(String text) throws IOException
text
- The text to write to the stream.IOException
- If there is an error when writing the text.protected void processTextPosition(TextPosition text)
text
- The text to process.public int getStartPage()
public void setStartPage(int startPageValue)
startPageValue
- New value of 1-based startPage property.public int getEndPage()
public void setEndPage(int endPageValue)
endPageValue
- New value of 1-based endPage property.public void setLineSeparator(String separator)
separator
- The desired line separator string.public String getLineSeparator()
public String getWordSeparator()
public void setWordSeparator(String separator)
separator
- The desired page separator string.public boolean getSuppressDuplicateOverlappingText()
protected int getCurrentPageNo()
protected Writer getOutput()
protected List<List<TextPosition>> getCharactersByArticle()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
suppressDuplicateOverlappingTextValue
- The suppressDuplicateOverlappingText to set.public boolean getSeparateByBeads()
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
aShouldSeparateByBeads
- The new grouping of beads.public PDOutlineItem getEndBookmark()
public void setEndBookmark(PDOutlineItem aEndBookmark)
aEndBookmark
- The ending bookmark.public PDOutlineItem getStartBookmark()
public void setStartBookmark(PDOutlineItem aStartBookmark)
aStartBookmark
- The starting bookmark.public boolean getAddMoreFormatting()
public void setAddMoreFormatting(boolean newAddMoreFormatting)
newAddMoreFormatting
- Tell PDFBox to add some more text formattingpublic boolean getSortByPosition()
public void setSortByPosition(boolean newSortByPosition)
newSortByPosition
- Tell PDFBox to sort the text positions.public float getSpacingTolerance()
public void setSpacingTolerance(float spacingToleranceValue)
spacingToleranceValue
- tolerance / scaling factor to usepublic float getAverageCharTolerance()
public void setAverageCharTolerance(float averageCharToleranceValue)
averageCharToleranceValue
- average tolerance / scaling factor to usepublic float getIndentThreshold()
public void setIndentThreshold(float indentThresholdValue)
indentThresholdValue
- the number of whitespace character widths to use when detecting paragraph indents.public float getDropThreshold()
public void setDropThreshold(float dropThresholdValue)
dropThresholdValue
- the character height multiple for max allowed whitespace between lines in the same
paragraph.public String getParagraphStart()
public void setParagraphStart(String s)
s
- the paragraph start stringpublic String getParagraphEnd()
public void setParagraphEnd(String s)
s
- the paragraph end stringpublic String getPageStart()
public void setPageStart(String pageStartValue)
pageStartValue
- the page start stringpublic String getPageEnd()
public void setPageEnd(String pageEndValue)
pageEndValue
- the page end stringpublic String getArticleStart()
public void setArticleStart(String articleStartValue)
articleStartValue
- the article start stringpublic String getArticleEnd()
public void setArticleEnd(String articleEndValue)
articleEndValue
- the article end stringprotected void writeParagraphSeparator() throws IOException
IOException
- if something went wrongprotected void writeParagraphStart() throws IOException
IOException
- if something went wrongprotected void writeParagraphEnd() throws IOException
IOException
- if something went wrongprotected void writePageStart() throws IOException
IOException
- if something went wrongprotected void writePageEnd() throws IOException
IOException
- if something went wrongprotected void setListItemPatterns(List<Pattern> patterns)
patterns
- list of patternsprotected List<Pattern> getListItemPatterns()
This method returns a list of such regular expression Patterns.
protected static Pattern matchPattern(String string, List<Pattern> patterns)
Order of the supplied list of patterns is important as most common patterns should come first. Patterns should be strict in general, and all will be used with case sensitivity on.
string
- the string to be searchedpatterns
- list of patternsprotected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement) throws IOException
showGlyph
in class PDFStreamEngine
textRenderingMatrix
- the current text rendering matrix, Trmfont
- the current fontcode
- internal PDF character code for the glyphunicode
- the Unicode text for this glyph, or null if the PDF does provide itdisplacement
- the displacement (i.e. advance) of the glyph in text spaceIOException
- if the glyph cannot be processedCopyright © 2002–2016 The Apache Software Foundation. All rights reserved.