|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.pdfbox.util.PDFStreamEngine
org.apache.pdfbox.util.PDFTextStripper
public class PDFTextStripper
This class will take a pdf document and strip out all of the text and ignore the formatting and such. Please note; it is up to clients of this class to verify that a specific user has the correct permissions to extract text from the PDF document. The basic flow of this process is that we get a document and use a series of processXXX() functions that work on smaller and smaller chunks of the page. Eventually, we fully process each page and then print it.
| Field Summary | |
|---|---|
protected Vector<List<TextPosition>> |
charactersByArticle
The charactersByArticle is used to extract text by article divisions. |
protected PDDocument |
document
The document to read. |
protected String |
lineSeparator
The platforms lineseparator. |
protected Writer |
output
The stream to write the output to. |
protected String |
outputEncoding
encoding that text will be written in (or null). |
| Constructor Summary | |
|---|---|
PDFTextStripper()
Instantiate a new PDFTextStripper object. |
|
PDFTextStripper(Properties props)
Instantiate a new PDFTextStripper object. |
|
PDFTextStripper(String encoding)
Instantiate a new PDFTextStripper object. |
|
| Method Summary | |
|---|---|
protected void |
endArticle()
End an article. |
protected void |
endDocument(PDDocument pdf)
This method is available for subclasses of this class. |
protected void |
endPage(PDPage page)
End a page. |
float |
getAverageCharTolerance()
Get the current character width-based tolerance value that is being used to estimate where spaces in text should be added. |
protected Vector<List<TextPosition>> |
getCharactersByArticle()
Character strings are grouped by articles. |
protected int |
getCurrentPageNo()
Get the current page number that is being processed. |
PDOutlineItem |
getEndBookmark()
Get the bookmark where text extraction should end, inclusive. |
int |
getEndPage()
This will get the last page that will be extracted. |
String |
getLineSeparator()
This will get the line separator. |
protected Writer |
getOutput()
The output stream that is being written to. |
String |
getPageSeparator()
This will get the page separator. |
float |
getSpacingTolerance()
Get the current space width-based tolerance value that is being used to estimate where spaces in text should be added. |
PDOutlineItem |
getStartBookmark()
Get the bookmark where text extraction should start, inclusive. |
int |
getStartPage()
This is the page that the text extraction will start on. |
String |
getText(COSDocument doc)
Deprecated. |
String |
getText(PDDocument doc)
This will return the text of a document. |
String |
getWordSeparator()
This will get the word separator. |
String |
inspectFontEncoding(String str)
Reverse characters of a compound Arabic glyph. |
protected void |
processPage(PDPage page,
COSStream content)
This will process the contents of a page. |
protected void |
processPages(List<COSObjectable> pages)
This will process all of the pages and the text that is in them. |
protected void |
processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. |
void |
resetEngine()
This method must be called between processing documents. |
void |
setAverageCharTolerance(float averageCharToleranceValue)
Set the character width-based tolerance value that is used to estimate where spaces in text should be added. |
void |
setEndBookmark(PDOutlineItem aEndBookmark)
Set the bookmark where the text extraction should stop. |
void |
setEndPage(int endPageValue)
This will set the last page to be extracted by this class. |
void |
setLineSeparator(String separator)
Set the desired line separator for output text. |
void |
setPageSeparator(String separator)
Set the desired page separator for output text. |
void |
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
Set if the text stripper should group the text output by a list of beads. |
void |
setSortByPosition(boolean newSortByPosition)
The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. |
void |
setSpacingTolerance(float spacingToleranceValue)
Set the space width-based tolerance value that is used to estimate where spaces in text should be added. |
void |
setStartBookmark(PDOutlineItem aStartBookmark)
Set the bookmark where text extraction should start, inclusive. |
void |
setStartPage(int startPageValue)
This will set the first page to be extracted by this class. |
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
By default the text stripper will attempt to remove text that overlapps each other. |
void |
setWordSeparator(String separator)
Set the desired word separator for output text. |
boolean |
shouldSeparateByBeads()
This will tell if the text stripper should separate by beads. |
boolean |
shouldSortByPosition()
This will tell if the text stripper should sort the text tokens before writing to the stream. |
boolean |
shouldSuppressDuplicateOverlappingText()
|
protected void |
startArticle()
Start a new article, which is typically defined as a column on a single page (also referred to as a bead). |
protected void |
startArticle(boolean isltr)
Start a new article, which is typically defined as a column on a single page (also referred to as a bead). |
protected void |
startDocument(PDDocument pdf)
This method is available for subclasses of this class. |
protected void |
startPage(PDPage page)
Start a new page. |
protected void |
writeCharacters(TextPosition text)
Write the string in TextPosition to the output stream. |
protected void |
writeLineSeparator()
Write the line separator value to the output stream. |
protected void |
writePage()
This will print the text of the processed page to "output". |
protected void |
writePageSeperator()
Write the page separator value to the output stream. |
protected void |
writeString(String text)
Write a Java string to the output stream. |
void |
writeText(COSDocument doc,
Writer outputStream)
Deprecated. |
void |
writeText(PDDocument doc,
Writer outputStream)
This will take a PDDocument and write the text of that document to the print writer. |
protected void |
writeWordSeparator()
Write the word separator value to the output stream. |
| Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine |
|---|
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
protected Vector<List<TextPosition>> charactersByArticle
protected String lineSeparator
protected String outputEncoding
protected PDDocument document
protected Writer output
| Constructor Detail |
|---|
public PDFTextStripper()
throws IOException
IOException - If there is an error loading the properties.
public PDFTextStripper(Properties props)
throws IOException
props - The properties containing the mapping of operators to PDFOperator
classes.
IOException - If there is an error reading the properties.
public PDFTextStripper(String encoding)
throws IOException
encoding - The encoding that the output will be written in.
IOException - If there is an error reading the properties.| Method Detail |
|---|
public String getText(PDDocument doc)
throws IOException
doc - The document to get the text from.
IOException - if the doc state is invalid or it is encrypted.
public String getText(COSDocument doc)
throws IOException
doc - The document to extract the text from.
IOException - If there is an error extracting the text.getText( PDDocument )
public void writeText(COSDocument doc,
Writer outputStream)
throws IOException
doc - The document to extract the text.outputStream - The stream to write the text to.
IOException - If there is an error extracting the text.writeText( PDDocument, Writer )public void resetEngine()
resetEngine in class PDFStreamEngine
public void writeText(PDDocument doc,
Writer outputStream)
throws IOException
doc - The document to get the data from.outputStream - The location to put the text.
IOException - If the doc is in an invalid state.
protected void processPages(List<COSObjectable> pages)
throws IOException
pages - The pages object in the document.
IOException - If there is an error parsing the text.
protected void startDocument(PDDocument pdf)
throws IOException
pdf - The PDF document that is being processed.
IOException - If an IO error occurs.
protected void endDocument(PDDocument pdf)
throws IOException
pdf - The PDF document that is being processed.
IOException - If an IO error occurs.
protected void processPage(PDPage page,
COSStream content)
throws IOException
page - The page to process.content - The contents of the page.
IOException - If there is an error processing the page.
protected void startArticle()
throws IOException
IOException - If there is any error writing to the stream.
protected void startArticle(boolean isltr)
throws IOException
isltr - true if primary direction of text is left to right.
IOException - If there is any error writing to the stream.
protected void endArticle()
throws IOException
IOException - If there is any error writing to the stream.
protected void startPage(PDPage page)
throws IOException
page - The page we are about to process.
IOException - If there is any error writing to the stream.
protected void endPage(PDPage page)
throws IOException
page - The page we are about to process.
IOException - If there is any error writing to the stream.
protected void writePage()
throws IOException
IOException - If there is an error writing the text.
protected void writePageSeperator()
throws IOException
IOException - If there is a problem writing out the pageseparator to the document.
protected void writeLineSeparator()
throws IOException
IOException - If there is a problem writing out the lineseparator to the document.
protected void writeWordSeparator()
throws IOException
IOException - If there is a problem writing out the wordseparator to the document.
protected void writeCharacters(TextPosition text)
throws IOException
text - The text to write to the stream.
IOException - If there is an error when writing the text.
protected void writeString(String text)
throws IOException
text - The text to write to the stream.
IOException - If there is an error when writing the text.protected void processTextPosition(TextPosition text)
processTextPosition in class PDFStreamEnginetext - The text to process.public int getStartPage()
public void setStartPage(int startPageValue)
startPageValue - New value of property startPage.public int getEndPage()
public void setEndPage(int endPageValue)
endPageValue - New value of property endPage.public void setLineSeparator(String separator)
separator - The desired line separator string.public String getLineSeparator()
public void setPageSeparator(String separator)
separator - The desired page separator string.public String getWordSeparator()
public void setWordSeparator(String separator)
separator - The desired page separator string.public String getPageSeparator()
public boolean shouldSuppressDuplicateOverlappingText()
protected int getCurrentPageNo()
protected Writer getOutput()
protected Vector<List<TextPosition>> getCharactersByArticle()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
suppressDuplicateOverlappingTextValue - The suppressDuplicateOverlappingText to set.public boolean shouldSeparateByBeads()
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
aShouldSeparateByBeads - The new grouping of beads.public PDOutlineItem getEndBookmark()
public void setEndBookmark(PDOutlineItem aEndBookmark)
aEndBookmark - The ending bookmark.public PDOutlineItem getStartBookmark()
public void setStartBookmark(PDOutlineItem aStartBookmark)
aStartBookmark - The starting bookmark.public boolean shouldSortByPosition()
public void setSortByPosition(boolean newSortByPosition)
newSortByPosition - Tell PDFBox to sort the text positions.public float getSpacingTolerance()
public void setSpacingTolerance(float spacingToleranceValue)
spacingToleranceValue - tolerance / scaling factor to usepublic float getAverageCharTolerance()
public void setAverageCharTolerance(float averageCharToleranceValue)
averageCharToleranceValue - average tolerance / scaling factor to usepublic String inspectFontEncoding(String str)
inspectFontEncoding in class PDFStreamEnginestr - a string obtained from font.encoding()
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||