PDFTextStripper (Apache PDFBox 1.8.10 API)

java.lang.Object
- org.apache.pdfbox.util.PDFStreamEngine
- - org.apache.pdfbox.util.PDFTextStripper

Direct Known Subclasses:

PDFHighlighter, PDFText2HTML, PDFTextStripperByArea
```
public class PDFTextStripper
extends PDFStreamEngine
```
This class will take a pdf document and strip out all of the text and ignore the formatting and such. Please note; it is up to clients of this class to verify that a specific user has the correct permissions to extract text from the PDF document. The basic flow of this process is that we get a document and use a series of processXXX() functions that work on smaller and smaller chunks of the page. Eventually, we fully process each page and then print it.

Author:

Ben Litchfield

Field Summary

Fields
Modifier and Type	Field and Description
`protected Vector<List<TextPosition>>`	`charactersByArticle` The charactersByArticle is used to extract text by article divisions.
`protected PDDocument`	`document` The document to read.
`protected Writer`	`output` The stream to write the output to.
`protected String`	`outputEncoding` encoding that text will be written in (or null).
`protected String`	`systemLineSeparator` The platforms line separator.

Constructor Summary

Constructors
Constructor and Description
`PDFTextStripper()` Instantiate a new PDFTextStripper object.
`PDFTextStripper(Properties props)` Instantiate a new PDFTextStripper object.
`PDFTextStripper(String encoding)` Instantiate a new PDFTextStripper object.

Method Summary

Methods
Modifier and Type	Method and Description
`protected void`	`endArticle()` End an article.
`protected void`	`endDocument(PDDocument pdf)` This method is available for subclasses of this class.
`protected void`	`endPage(PDPage page)` End a page.
`boolean`	`getAddMoreFormatting()` This will tell if the text stripper should add some more text formatting.
`String`	`getArticleEnd()` Returns the string which will be used at the end of an article.
`String`	`getArticleStart()` Returns the string which will be used at the beginning of an article.
`float`	`getAverageCharTolerance()` Get the current character width-based tolerance value that is being used to estimate where spaces in text should be added.
`protected Vector<List<TextPosition>>`	`getCharactersByArticle()` Character strings are grouped by articles.
`protected int`	`getCurrentPageNo()` Get the current page number that is being processed.
`float`	`getDropThreshold()` the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line start is considered to be a paragraph start.
`PDOutlineItem`	`getEndBookmark()` Get the bookmark where text extraction should end, inclusive.
`int`	`getEndPage()` This will get the last page that will be extracted.
`float`	`getIndentThreshold()` returns the multiple of whitespace character widths for the current text which the current line start can be indented from the previous line start beyond which the current line start is considered to be a paragraph start.
`String`	`getLineSeparator()` This will get the line separator.
`protected List<Pattern>`	`getListItemPatterns()` returns a list of regular expression Patterns representing different common list item formats.
`protected Writer`	`getOutput()` The output stream that is being written to.
`String`	`getPageEnd()` Returns the string which will be used at the end of a page.
`String`	`getPageSeparator()` Deprecated. use `getPageStart()` and `getPageEnd()` instead
`String`	`getPageStart()` Returns the string which will be used at the beginning of a page.
`String`	`getParagraphEnd()` Returns the string which will be used at the end of a paragraph.
`String`	`getParagraphStart()` Returns the string which will be used at the beginning of a paragraph.
`boolean`	`getSeparateByBeads()` This will tell if the text stripper should separate by beads.
`boolean`	`getSortByPosition()` This will tell if the text stripper should sort the text tokens before writing to the stream.
`float`	`getSpacingTolerance()` Get the current space width-based tolerance value that is being used to estimate where spaces in text should be added.
`PDOutlineItem`	`getStartBookmark()` Get the bookmark where text extraction should start, inclusive.
`int`	`getStartPage()` This is the page that the text extraction will start on.
`boolean`	`getSuppressDuplicateOverlappingText()`
`String`	`getText(COSDocument doc)` Deprecated.
`String`	`getText(PDDocument doc)` This will return the text of a document.
`String`	`getWordSeparator()` This will get the word separator.
`protected PositionWrapper`	`handleLineSeparation(PositionWrapper current, PositionWrapper lastPosition, PositionWrapper lastLineStartPosition, float maxHeightForLine)` handles the line separator for a new line given the specified current and previous TextPositions.
`String`	`inspectFontEncoding(String str)` Reverse characters of a compound Arabic glyph.
`protected void`	`isParagraphSeparation(PositionWrapper position, PositionWrapper lastPosition, PositionWrapper lastLineStartPosition, float maxHeightForLine)` tests the relationship between the last text position, the current text position and the last text position that followed a line separator to decide if the gap represents a paragraph separation.
`protected Pattern`	`matchListItemPattern(PositionWrapper pw)` returns the list item Pattern object that matches the text at the specified PositionWrapper or null if the text does not match such a pattern.
`protected static Pattern`	`matchPattern(String string, List<Pattern> patterns)` iterates over the specified list of Patterns until it finds one that matches the specified string.
`protected void`	`processPage(PDPage page, COSStream content)` This will process the contents of a page.
`protected void`	`processPages(List<COSObjectable> pages)` This will process all of the pages and the text that is in them.
`protected void`	`processTextPosition(TextPosition text)` This will process a TextPosition object and add the text to the list of characters on a page.
`void`	`resetEngine()` This method must be called between processing documents.
`void`	`setAddMoreFormatting(boolean newAddMoreFormatting)` There will some additional text formatting be added if addMoreFormatting is set to true.
`void`	`setArticleEnd(String articleEndValue)` Sets the string which will be used at the end of an article.
`void`	`setArticleStart(String articleStartValue)` Sets the string which will be used at the beginning of an article.
`void`	`setAverageCharTolerance(float averageCharToleranceValue)` Set the character width-based tolerance value that is used to estimate where spaces in text should be added.
`void`	`setDropThreshold(float dropThresholdValue)` sets the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line start is considered to be a paragraph start.
`void`	`setEndBookmark(PDOutlineItem aEndBookmark)` Set the bookmark where the text extraction should stop.
`void`	`setEndPage(int endPageValue)` This will set the last page to be extracted by this class.
`void`	`setIndentThreshold(float indentThresholdValue)` sets the multiple of whitespace character widths for the current text which the current line start can be indented from the previous line start beyond which the current line start is considered to be a paragraph start.
`void`	`setLineSeparator(String separator)` Set the desired line separator for output text.
`protected void`	`setListItemPatterns(List<Pattern> patterns)` use to supply a different set of regular expression patterns for matching list item starts.
`void`	`setPageEnd(String pageEndValue)` Sets the string which will be used at the end of a page.
`void`	`setPageSeparator(String separator)` Deprecated. use #setPageStart(String) and {@link #setPageEnd(String)} instead
`void`	`setPageStart(String pageStartValue)` Sets the string which will be used at the beginning of a page.
`void`	`setParagraphEnd(String s)` Sets the string which will be used at the end of a paragraph.
`void`	`setParagraphStart(String s)` Sets the string which will be used at the beginning of a paragraph.
`void`	`setShouldSeparateByBeads(boolean aShouldSeparateByBeads)` Set if the text stripper should group the text output by a list of beads.
`void`	`setSortByPosition(boolean newSortByPosition)` The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen.
`void`	`setSpacingTolerance(float spacingToleranceValue)` Set the space width-based tolerance value that is used to estimate where spaces in text should be added.
`void`	`setStartBookmark(PDOutlineItem aStartBookmark)` Set the bookmark where text extraction should start, inclusive.
`void`	`setStartPage(int startPageValue)` This will set the first page to be extracted by this class.
`void`	`setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)` By default the text stripper will attempt to remove text that overlapps each other.
`void`	`setWordSeparator(String separator)` Set the desired word separator for output text.
`protected void`	`startArticle()` Start a new article, which is typically defined as a column on a single page (also referred to as a bead).
`protected void`	`startArticle(boolean isltr)` Start a new article, which is typically defined as a column on a single page (also referred to as a bead).
`protected void`	`startDocument(PDDocument pdf)` This method is available for subclasses of this class.
`protected void`	`startPage(PDPage page)` Start a new page.
`protected void`	`writeCharacters(TextPosition text)` Write the string in TextPosition to the output stream.
`protected void`	`writeLineSeparator()` Write the line separator value to the output stream.
`protected void`	`writePage()` This will print the text of the processed page to "output".
`protected void`	`writePageEnd()` Write something (if defined) at the end of a page.
`protected void`	`writePageSeperator()` Write the page separator value to the output stream.
`protected void`	`writePageStart()` Write something (if defined) at the start of a page.
`protected void`	`writeParagraphEnd()` Write something (if defined) at the end of a paragraph.
`protected void`	`writeParagraphSeparator()` writes the paragraph separator string to the output.
`protected void`	`writeParagraphStart()` Write something (if defined) at the start of a paragraph.
`protected void`	`writeString(String text)` Write a Java string to the output stream.
`protected void`	`writeString(String text, List<TextPosition> textPositions)` Write a Java string to the output stream.
`void`	`writeText(COSDocument doc, Writer outputStream)` Deprecated.
`void`	`writeText(PDDocument doc, Writer outputStream)` This will take a PDDocument and write the text of that document to the print writer.
`protected void`	`writeWordSeparator()` Write the word separator value to the output stream.

Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, isForceParsing, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, setColorSpaces, setFonts, setForceParsing, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - systemLineSeparator
```
protected final String systemLineSeparator
```
    The platforms line separator.
  - charactersByArticle
```
protected Vector<List<TextPosition>> charactersByArticle
```
    The charactersByArticle is used to extract text by article divisions. For example a PDF that has two columns like a newspaper, we want to extract the first column and then the second column. In this example the PDF would have 2 beads(or articles), one for each column. The size of the charactersByArticle would be 5, because not all text on the screen will fall into one of the articles. The five divisions are shown below Text before first article first article text text between first article and second article second article text text after second article Most PDFs won't have any beads, so charactersByArticle will contain a single entry.
  - outputEncoding
```
protected String outputEncoding
```
    encoding that text will be written in (or null).
  - document
```
protected PDDocument document
```
    The document to read.
  - output
```
protected Writer output
```
    The stream to write the output to.
- Constructor Detail
  - PDFTextStripper
```
public PDFTextStripper()
                throws IOException
```
    Instantiate a new PDFTextStripper object. This object will load properties from PDFTextStripper.properties and will not do anything special to convert the text to a more encoding-specific output.
    
    Throws:
    
    IOException - If there is an error loading the properties.
  - PDFTextStripper
```
public PDFTextStripper(Properties props)
                throws IOException
```
    Instantiate a new PDFTextStripper object. Loading all of the operator mappings from the properties object that is passed in. Does not convert the text to more encoding-specific output.
    
    Parameters:
    props - The properties containing the mapping of operators to PDFOperator classes.
    
    Throws:
    
    IOException - If there is an error reading the properties.
  - PDFTextStripper
```
public PDFTextStripper(String encoding)
                throws IOException
```
    Instantiate a new PDFTextStripper object. This object will load properties from PDFTextStripper.properties and will apply encoding-specific conversions to the output text.
    
    Parameters:
    encoding - The encoding that the output will be written in.
    
    Throws:
    
    IOException - If there is an error reading the properties.
- Method Detail
  - getText
```
public String getText(PDDocument doc)
               throws IOException
```
    This will return the text of a document. See writeText.
    NOTE: The document must not be encrypted when coming into this method.
    
    Parameters:
    doc - The document to get the text from.
    
    Returns:
    The text of the PDF document.
    
    Throws:
    
    IOException - if the doc state is invalid or it is encrypted.
  - getText
```
public String getText(COSDocument doc)
               throws IOException
```
    Deprecated.
    
    Parameters:
    doc - The document to extract the text from.
    
    Returns:
    The document text.
    
    Throws:
    
    IOException - If there is an error extracting the text.
    See Also:
    getText( PDDocument )
  - writeText
```
public void writeText(COSDocument doc,
             Writer outputStream)
               throws IOException
```
    Deprecated.
    
    Parameters:
    doc - The document to extract the text.
    outputStream - The stream to write the text to.
    
    Throws:
    
    IOException - If there is an error extracting the text.
    See Also:
    writeText( PDDocument, Writer )
  - resetEngine
```
public void resetEngine()
```
    This method must be called between processing documents. The PDFStreamEngine caches information for the document between pages and this will release the cached information. This only needs to be called if processing a new document.
    
    Overrides:
    
    resetEngine in class PDFStreamEngine
  - writeText
```
public void writeText(PDDocument doc,
             Writer outputStream)
               throws IOException
```
    This will take a PDDocument and write the text of that document to the print writer.
    
    Parameters:
    doc - The document to get the data from.
    outputStream - The location to put the text.
    
    Throws:
    
    IOException - If the doc is in an invalid state.
  - processPages
```
protected void processPages(List<COSObjectable> pages)
                     throws IOException
```
    This will process all of the pages and the text that is in them.
    
    Parameters:
    pages - The pages object in the document.
    
    Throws:
    
    IOException - If there is an error parsing the text.
  - startDocument
```
protected void startDocument(PDDocument pdf)
                      throws IOException
```
    This method is available for subclasses of this class. It will be called before processing of the document start.
    
    Parameters:
    pdf - The PDF document that is being processed.
    
    Throws:
    
    IOException - If an IO error occurs.
  - endDocument
```
protected void endDocument(PDDocument pdf)
                    throws IOException
```
    This method is available for subclasses of this class. It will be called after processing of the document finishes.
    
    Parameters:
    pdf - The PDF document that is being processed.
    
    Throws:
    
    IOException - If an IO error occurs.
  - processPage
```
protected void processPage(PDPage page,
               COSStream content)
                    throws IOException
```
    This will process the contents of a page.
    
    Parameters:
    page - The page to process.
    content - The contents of the page.
    
    Throws:
    
    IOException - If there is an error processing the page.
  - startArticle
```
protected void startArticle()
                     throws IOException
```
    Start a new article, which is typically defined as a column on a single page (also referred to as a bead). This assumes that the primary direction of text is left to right. Default implementation is to do nothing. Subclasses may provide additional information.
    
    Throws:
    
    IOException - If there is any error writing to the stream.
  - startArticle
```
protected void startArticle(boolean isltr)
                     throws IOException
```
    Start a new article, which is typically defined as a column on a single page (also referred to as a bead). Default implementation is to do nothing. Subclasses may provide additional information.
    
    Parameters:
    isltr - true if primary direction of text is left to right.
    
    Throws:
    
    IOException - If there is any error writing to the stream.
  - endArticle
```
protected void endArticle()
                   throws IOException
```
    End an article. Default implementation is to do nothing. Subclasses may provide additional information.
    
    Throws:
    
    IOException - If there is any error writing to the stream.
  - startPage
```
protected void startPage(PDPage page)
                  throws IOException
```
    Start a new page. Default implementation is to do nothing. Subclasses may provide additional information.
    
    Parameters:
    page - The page we are about to process.
    
    Throws:
    
    IOException - If there is any error writing to the stream.
  - endPage
```
protected void endPage(PDPage page)
                throws IOException
```
    End a page. Default implementation is to do nothing. Subclasses may provide additional information.
    
    Parameters:
    page - The page we are about to process.
    
    Throws:
    
    IOException - If there is any error writing to the stream.
  - writePage
```
protected void writePage()
                  throws IOException
```
    This will print the text of the processed page to "output". It will estimate, based on the coordinates of the text, where newlines and word spacings should be placed. The text will be sorted only if that feature was enabled.
    
    Throws:
    
    IOException - If there is an error writing the text.
  - writePageSeperator
```
protected void writePageSeperator()
                           throws IOException
```
    Write the page separator value to the output stream.
    
    Throws:
    
    IOException - If there is a problem writing out the pageseparator to the document.
  - writeLineSeparator
```
protected void writeLineSeparator()
                           throws IOException
```
    Write the line separator value to the output stream.
    
    Throws:
    
    IOException - If there is a problem writing out the lineseparator to the document.
  - writeWordSeparator
```
protected void writeWordSeparator()
                           throws IOException
```
    Write the word separator value to the output stream.
    
    Throws:
    
    IOException - If there is a problem writing out the wordseparator to the document.
  - writeCharacters
```
protected void writeCharacters(TextPosition text)
                        throws IOException
```
    Write the string in TextPosition to the output stream.
    
    Parameters:
    text - The text to write to the stream.
    
    Throws:
    
    IOException - If there is an error when writing the text.
  - writeString
```
protected void writeString(String text,
               List<TextPosition> textPositions)
                    throws IOException
```
    Write a Java string to the output stream. The default implementation will ignore the textPositions and just calls writeString(String).
    
    Parameters:
    text - The text to write to the stream.
    textPositions - The TextPositions belonging to the text.
    
    Throws:
    
    IOException - If there is an error when writing the text.
  - writeString
```
protected void writeString(String text)
                    throws IOException
```
    Write a Java string to the output stream.
    
    Parameters:
    text - The text to write to the stream.
    
    Throws:
    
    IOException - If there is an error when writing the text.
  - processTextPosition
```
protected void processTextPosition(TextPosition text)
```
    This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
    
    Overrides:
    
    processTextPosition in class PDFStreamEngine
    
    Parameters:
    text - The text to process.
  - getStartPage
```
public int getStartPage()
```
    This is the page that the text extraction will start on. The pages start at page 1. For example in a 5 page PDF document, if the start page is 1 then all pages will be extracted. If the start page is 4 then pages 4 and 5 will be extracted. The default value is 1.
    
    Returns:
    Value of property startPage.
  - setStartPage
```
public void setStartPage(int startPageValue)
```
    This will set the first page to be extracted by this class.
    
    Parameters:
    startPageValue - New value of 1-based startPage property.
  - getEndPage
```
public int getEndPage()
```
    This will get the last page that will be extracted. This is inclusive, for example if a 5 page PDF an endPage value of 5 would extract the entire document, an end page of 2 would extract pages 1 and 2. This defaults to Integer.MAX_VALUE such that all pages of the pdf will be extracted.
    
    Returns:
    Value of property endPage.
  - setEndPage
```
public void setEndPage(int endPageValue)
```
    This will set the last page to be extracted by this class.
    
    Parameters:
    endPageValue - New value of 1-based endPage property.
  - setLineSeparator
```
public void setLineSeparator(String separator)
```
    Set the desired line separator for output text. The line.separator system property is used if the line separator preference is not set explicitly using this method.
    
    Parameters:
    separator - The desired line separator string.
  - getLineSeparator
```
public String getLineSeparator()
```
    This will get the line separator.
    
    Returns:
    The desired line separator string.
  - setPageSeparator
```
public void setPageSeparator(String separator)
```
    Deprecated. use #setPageStart(String) and {@link #setPageEnd(String)} instead
    
    Set the desired page separator for output text. The line.separator system property is used if the page separator preference is not set explicitly using this method.
    
    Parameters:
    separator - The desired page separator string.
  - getWordSeparator
```
public String getWordSeparator()
```
    This will get the word separator.
    
    Returns:
    The desired word separator string.
  - setWordSeparator
```
public void setWordSeparator(String separator)
```
    Set the desired word separator for output text. The PDFBox text extraction algorithm will output a space character if there is enough space between two words. By default a space character is used. If you need and accurate count of characters that are found in a PDF document then you might want to set the word separator to the empty string.
    
    Parameters:
    separator - The desired page separator string.
  - getPageSeparator
```
public String getPageSeparator()
```
    Deprecated. use getPageStart() and getPageEnd() instead
    
    This will get the page separator.
    
    Returns:
    The page separator string.
  - getSuppressDuplicateOverlappingText
```
public boolean getSuppressDuplicateOverlappingText()
```
    Returns:
    Returns the suppressDuplicateOverlappingText.
  - getCurrentPageNo
```
protected int getCurrentPageNo()
```
    Get the current page number that is being processed.
    
    Returns:
    A 1 based number representing the current page.
  - getOutput
```
protected Writer getOutput()
```
    The output stream that is being written to.
    
    Returns:
    The stream that output is being written to.
  - getCharactersByArticle
```
protected Vector<List<TextPosition>> getCharactersByArticle()
```
    Character strings are grouped by articles. It is quite common that there will only be a single article. This returns a List that contains List objects, the inner lists will contain TextPosition objects.
    
    Returns:
    A double List of TextPositions for all text strings on the page.
  - setSuppressDuplicateOverlappingText
```
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
```
    By default the text stripper will attempt to remove text that overlapps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.
    
    Parameters:
    suppressDuplicateOverlappingTextValue - The suppressDuplicateOverlappingText to set.
  - getSeparateByBeads
```
public boolean getSeparateByBeads()
```
    This will tell if the text stripper should separate by beads.
    
    Returns:
    If the text will be grouped by beads.
  - setShouldSeparateByBeads
```
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
```
    Set if the text stripper should group the text output by a list of beads. The default value is true!
    
    Parameters:
    aShouldSeparateByBeads - The new grouping of beads.
  - getEndBookmark
```
public PDOutlineItem getEndBookmark()
```
    Get the bookmark where text extraction should end, inclusive. Default is null.
    
    Returns:
    The ending bookmark.
  - setEndBookmark
```
public void setEndBookmark(PDOutlineItem aEndBookmark)
```
    Set the bookmark where the text extraction should stop.
    
    Parameters:
    aEndBookmark - The ending bookmark.
  - getStartBookmark
```
public PDOutlineItem getStartBookmark()
```
    Get the bookmark where text extraction should start, inclusive. Default is null.
    
    Returns:
    The starting bookmark.
  - setStartBookmark
```
public void setStartBookmark(PDOutlineItem aStartBookmark)
```
    Set the bookmark where text extraction should start, inclusive.
    
    Parameters:
    aStartBookmark - The starting bookmark.
  - getAddMoreFormatting
```
public boolean getAddMoreFormatting()
```
    This will tell if the text stripper should add some more text formatting.
    
    Returns:
    true if some more text formatting will be added
  - setAddMoreFormatting
```
public void setAddMoreFormatting(boolean newAddMoreFormatting)
```
    There will some additional text formatting be added if addMoreFormatting is set to true. Default is false.
    
    Parameters:
    newAddMoreFormatting - Tell PDFBox to add some more text formatting
  - getSortByPosition
```
public boolean getSortByPosition()
```
    This will tell if the text stripper should sort the text tokens before writing to the stream.
    
    Returns:
    true If the text tokens will be sorted before being written.
  - setSortByPosition
```
public void setSortByPosition(boolean newSortByPosition)
```
    The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. For example, a PDF writer may write out all text by font, so all bold or larger text, then make a second pass and write out the normal text.
    The default is to not sort by position.
    
    A PDF writer could choose to write each character in a different order. By default PDFBox does not sort the text tokens before processing them due to performance reasons.
    
    Parameters:
    newSortByPosition - Tell PDFBox to sort the text positions.
  - getSpacingTolerance
```
public float getSpacingTolerance()
```
    Get the current space width-based tolerance value that is being used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error.
    
    Returns:
    The current tolerance / scaling factor
  - setSpacingTolerance
```
public void setSpacingTolerance(float spacingToleranceValue)
```
    Set the space width-based tolerance value that is used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error. Setting this value larger will reduce the number of spaces added.
    
    Parameters:
    spacingToleranceValue - tolerance / scaling factor to use
  - getAverageCharTolerance
```
public float getAverageCharTolerance()
```
    Get the current character width-based tolerance value that is being used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error.
    
    Returns:
    The current tolerance / scaling factor
  - setAverageCharTolerance
```
public void setAverageCharTolerance(float averageCharToleranceValue)
```
    Set the character width-based tolerance value that is used to estimate where spaces in text should be added. Note that the default value for this has been determined from trial and error. Setting this value larger will reduce the number of spaces added.
    
    Parameters:
    averageCharToleranceValue - average tolerance / scaling factor to use
  - getIndentThreshold
```
public float getIndentThreshold()
```
    returns the multiple of whitespace character widths for the current text which the current line start can be indented from the previous line start beyond which the current line start is considered to be a paragraph start.
    
    Returns:
    the number of whitespace character widths to use when detecting paragraph indents.
  - setIndentThreshold
```
public void setIndentThreshold(float indentThresholdValue)
```
    sets the multiple of whitespace character widths for the current text which the current line start can be indented from the previous line start beyond which the current line start is considered to be a paragraph start. The default value is 2.0.
    
    Parameters:
    indentThresholdValue - the number of whitespace character widths to use when detecting paragraph indents.
  - getDropThreshold
```
public float getDropThreshold()
```
    the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line start is considered to be a paragraph start.
    
    Returns:
    the character height multiple for max allowed whitespace between lines in the same paragraph.
  - setDropThreshold
```
public void setDropThreshold(float dropThresholdValue)
```
    sets the minimum whitespace, as a multiple of the max height of the current characters beyond which the current line start is considered to be a paragraph start. The default value is 2.5.
    
    Parameters:
    dropThresholdValue - the character height multiple for max allowed whitespace between lines in the same paragraph.
  - getParagraphStart
```
public String getParagraphStart()
```
    Returns the string which will be used at the beginning of a paragraph.
    
    Returns:
    the paragraph start string
  - setParagraphStart
```
public void setParagraphStart(String s)
```
    Sets the string which will be used at the beginning of a paragraph.
    
    Parameters:
    s - the paragraph start string
  - getParagraphEnd
```
public String getParagraphEnd()
```
    Returns the string which will be used at the end of a paragraph.
    
    Returns:
    the paragraph end string
  - setParagraphEnd
```
public void setParagraphEnd(String s)
```
    Sets the string which will be used at the end of a paragraph.
    
    Parameters:
    s - the paragraph end string
  - getPageStart
```
public String getPageStart()
```
    Returns the string which will be used at the beginning of a page.
    
    Returns:
    the page start string
  - setPageStart
```
public void setPageStart(String pageStartValue)
```
    Sets the string which will be used at the beginning of a page.
    
    Parameters:
    pageStartValue - the page start string
  - getPageEnd
```
public String getPageEnd()
```
    Returns the string which will be used at the end of a page.
    
    Returns:
    the page end string
  - setPageEnd
```
public void setPageEnd(String pageEndValue)
```
    Sets the string which will be used at the end of a page.
    
    Parameters:
    pageEndValue - the page end string
  - getArticleStart
```
public String getArticleStart()
```
    Returns the string which will be used at the beginning of an article.
    
    Returns:
    the article start string
  - setArticleStart
```
public void setArticleStart(String articleStartValue)
```
    Sets the string which will be used at the beginning of an article.
    
    Parameters:
    articleStartValue - the article start string
  - getArticleEnd
```
public String getArticleEnd()
```
    Returns the string which will be used at the end of an article.
    
    Returns:
    the article end string
  - setArticleEnd
```
public void setArticleEnd(String articleEndValue)
```
    Sets the string which will be used at the end of an article.
    
    Parameters:
    articleEndValue - the article end string
  - inspectFontEncoding
```
public String inspectFontEncoding(String str)
```
    Reverse characters of a compound Arabic glyph. When getSortByPosition() is true, inspect the sequence encoded by one glyph. If the glyph encodes two or more Arabic characters, reverse these characters from a logical order to a visual order. This ensures that the bidirectional algorithm that runs later will convert them back to a logical order.
    
    Overrides:
    
    inspectFontEncoding in class PDFStreamEngine
    
    Parameters:
    str - a string obtained from font.encoding()
    
    Returns:
    the reversed string
  - handleLineSeparation
```
protected PositionWrapper handleLineSeparation(PositionWrapper current,
                                   PositionWrapper lastPosition,
                                   PositionWrapper lastLineStartPosition,
                                   float maxHeightForLine)
                                        throws IOException
```
    handles the line separator for a new line given the specified current and previous TextPositions.
    
    Parameters:
    current - the current text position
    lastPosition - the previous text position
    lastLineStartPosition - the last text position that followed a line separator.
    maxHeightForLine - max height for positions since lastLineStartPosition
    
    Returns:
    start position of the last line
    
    Throws:
    
    IOException - if something went wrong
  - isParagraphSeparation
```
protected void isParagraphSeparation(PositionWrapper position,
                         PositionWrapper lastPosition,
                         PositionWrapper lastLineStartPosition,
                         float maxHeightForLine)
```
    tests the relationship between the last text position, the current text position and the last text position that followed a line separator to decide if the gap represents a paragraph separation. This should only be called for consecutive text positions that first pass the line separation test.
    This base implementation tests to see if the lastLineStartPosition is null OR if the current vertical position has dropped below the last text vertical position by at least 2.5 times the current text height OR if the current horizontal position is indented by at least 2 times the current width of a space character.
    
    This also attempts to identify text that is indented under a hanging indent.
    
    This method sets the isParagraphStart and isHangingIndent flags on the current position object.
    
    Parameters:
    position - the current text position. This may have its isParagraphStart or isHangingIndent flags set upon return.
    lastPosition - the previous text position (should not be null).
    lastLineStartPosition - the last text position that followed a line separator. May be null.
    maxHeightForLine - max height for text positions since lasLineStartPosition.
  - writeParagraphSeparator
```
protected void writeParagraphSeparator()
                                throws IOException
```
    writes the paragraph separator string to the output.
    
    Throws:
    
    IOException - if something went wrong
  - writeParagraphStart
```
protected void writeParagraphStart()
                            throws IOException
```
    Write something (if defined) at the start of a paragraph.
    
    Throws:
    
    IOException - if something went wrong
  - writeParagraphEnd
```
protected void writeParagraphEnd()
                          throws IOException
```
    Write something (if defined) at the end of a paragraph.
    
    Throws:
    
    IOException - if something went wrong
  - writePageStart
```
protected void writePageStart()
                       throws IOException
```
    Write something (if defined) at the start of a page.
    
    Throws:
    
    IOException - if something went wrong
  - writePageEnd
```
protected void writePageEnd()
                     throws IOException
```
    Write something (if defined) at the end of a page.
    
    Throws:
    
    IOException - if something went wrong
  - matchListItemPattern
```
protected Pattern matchListItemPattern(PositionWrapper pw)
```
    returns the list item Pattern object that matches the text at the specified PositionWrapper or null if the text does not match such a pattern. The list of Patterns tested against is given by the getListItemPatterns() method. To add to the list, simply override that method (if sub-classing) or explicitly supply your own list using setListItemPatterns(List).
    
    Parameters:
    pw - position
    
    Returns:
    the matching pattern
  - setListItemPatterns
```
protected void setListItemPatterns(List<Pattern> patterns)
```
    use to supply a different set of regular expression patterns for matching list item starts.
    
    Parameters:
    patterns - list of patterns
  - getListItemPatterns
```
protected List<Pattern> getListItemPatterns()
```
    returns a list of regular expression Patterns representing different common list item formats. For example numbered items of form:
    1. some text
    2. more text
    or
    - some text
    - more text
    etc., all begin with some character pattern. The pattern "\\d+\." (matches "1.", "2.", ...) or "\[\\d+\]" (matches "[1]", "[2]", ...).
    This method returns a list of such regular expression Patterns.
    Returns:
    a list of Pattern objects.
  - matchPattern
```
protected static final Pattern matchPattern(String string,
                   List<Pattern> patterns)
```
    iterates over the specified list of Patterns until it finds one that matches the specified string. Then returns the Pattern.
    Order of the supplied list of patterns is important as most common patterns should come first. Patterns should be strict in general, and all will be used with case sensitivity on.
    
    Parameters:
    string - the string to be searched
    patterns - list of patterns
    
    Returns:
    matching pattern

Class PDFTextStripper

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine

Methods inherited from class java.lang.Object

Field Detail

systemLineSeparator

charactersByArticle

outputEncoding

document

output

Constructor Detail

PDFTextStripper

PDFTextStripper

PDFTextStripper

Method Detail

getText

getText

writeText

resetEngine

writeText

processPages

startDocument

endDocument

processPage

startArticle

startArticle

endArticle

startPage

endPage

writePage

writePageSeperator

writeLineSeparator

writeWordSeparator

writeCharacters

writeString

writeString

processTextPosition

getStartPage

setStartPage

getEndPage

setEndPage

setLineSeparator

getLineSeparator

setPageSeparator

getWordSeparator

setWordSeparator

getPageSeparator

getSuppressDuplicateOverlappingText

getCurrentPageNo

getOutput

getCharactersByArticle

setSuppressDuplicateOverlappingText

getSeparateByBeads

setShouldSeparateByBeads

getEndBookmark

setEndBookmark

getStartBookmark

setStartBookmark

getAddMoreFormatting

setAddMoreFormatting

getSortByPosition

setSortByPosition

getSpacingTolerance

setSpacingTolerance

getAverageCharTolerance

setAverageCharTolerance

getIndentThreshold

setIndentThreshold

getDropThreshold

setDropThreshold

getParagraphStart

setParagraphStart

getParagraphEnd

setParagraphEnd

getPageStart

setPageStart

getPageEnd

setPageEnd