public class NonSequentialPDFParser extends PDFParser
PDFParser.
This class can be used as a PDFParser replacement. First
parse() must be called before page objects can be retrieved, e.g.
getPDDocument().
This class is a much enhanced version of QuickParser presented
in PDFBOX-1104 by
Jeremy Villalobos.| Modifier and Type | Field and Description |
|---|---|
protected static int |
DEFAULT_TRAIL_BYTECOUNT |
protected static char[] |
EOF_MARKER
EOF-marker.
|
protected static char[] |
OBJ_MARKER
obj-marker.
|
protected SecurityHandler |
securityHandler
The security handler.
|
protected static char[] |
STARTXREF_MARKER
StartXRef-marker.
|
static String |
SYSPROP_EOFLOOKUPRANGE |
static String |
SYSPROP_PARSEMINIMAL |
static String |
TMP_FILE_PREFIX |
isFDFDocment, xrefTrailerResolverDEF, document, ENDOBJ, ENDSTREAM, forceParsing, pdfSource, PROP_PUSHBACK_SIZE| Constructor and Description |
|---|
NonSequentialPDFParser(File file,
RandomAccess raBuf)
Constructs parser for given file using given buffer for temporary
storage.
|
NonSequentialPDFParser(File file,
RandomAccess raBuf,
String decryptionPassword)
Constructs parser for given file using given buffer for temporary
storage.
|
NonSequentialPDFParser(InputStream input)
Constructor.
|
NonSequentialPDFParser(InputStream input,
RandomAccess raBuf,
String decryptionPassword)
Constructor.
|
NonSequentialPDFParser(String filename)
Constructs parser for given file using memory buffer.
|
| Modifier and Type | Method and Description |
|---|---|
protected void |
decrypt(COSBase pb,
int objNr,
int objGenNr)
Decrypts given object.
|
protected void |
decryptDictionary(COSDictionary dict,
long objNr,
long objGenNr) |
protected void |
decryptString(COSString str,
long objNr,
long objGenNr)
Decrypts given COSString.
|
protected void |
deleteTempFile()
Remove the temporary file.
|
PDPage |
getPage(int pageNr)
Returns the page requested with all the objects loaded into it.
|
int |
getPageNumber()
Returns the number of pages in a document.
|
PDDocument |
getPDDocument()
This will get the PD document that was parsed.
|
protected File |
getPdfFile()
Return the pdf file.
|
SecurityHandler |
getSecurityHandler()
Returns security handler of the document or
null if document
is not encrypted or parse() wasn't called before. |
protected long |
getStartxrefOffset()
Looks for and parses startxref.
|
protected void |
initialParse()
The initial parse will first parse only the trailer, the xrefstart and
all xref tables to have a pointer (offset) to all the pdf's objects.
|
boolean |
isLenient()
Return true if parser is lenient.
|
protected int |
lastIndexOf(char[] pattern,
byte[] buf,
int endOff)
Searches last appearance of pattern within buffer.
|
void |
parse()
This will parse the stream and populate the COSDocument object.
|
protected COSStream |
parseCOSStream(COSDictionary dic,
RandomAccess file)
This will read a COSStream from the input stream using length attribute
within dictionary.
|
protected COSBase |
parseObjectDynamically(COSObject obj,
boolean requireExistingNotCompressedObj)
This will parse the next object from the stream and add it to the local
state.
|
protected COSBase |
parseObjectDynamically(int objNr,
int objGenNr,
boolean requireExistingNotCompressedObj)
This will parse the next object from the stream and add it to the local
state.
|
protected void |
readPattern(char[] pattern)
Reads given pattern from
BaseParser.pdfSource. |
protected void |
releasePdfSourceInputStream()
Enable handling of alternative pdfSource implementation.
|
void |
setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for EOF marker and
'startxref' marker.
|
void |
setLenient(boolean lenient)
Change the parser leniency flag.
|
protected void |
setPdfSource(long fileOffset)
Sets
BaseParser.pdfSource to start next parsing at given file offset. |
clearResources, getDocument, getFDFDocument, isContinueOnError, parseHeader, parseStartXref, parseTrailer, parseXrefStream, parseXrefStream, parseXrefTable, readVersionInTrailer, setTempDirectoryisClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseCOSString, parseDirObject, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, readUntilEndStream, setDocument, skipSpacespublic static final String SYSPROP_PARSEMINIMAL
public static final String SYSPROP_EOFLOOKUPRANGE
protected static final int DEFAULT_TRAIL_BYTECOUNT
protected static final char[] EOF_MARKER
protected static final char[] STARTXREF_MARKER
protected static final char[] OBJ_MARKER
protected SecurityHandler securityHandler
public static final String TMP_FILE_PREFIX
public NonSequentialPDFParser(String filename) throws IOException
filename - the filename of the pdf to be parsedIOException - If something went wrong.public NonSequentialPDFParser(File file, RandomAccess raBuf) throws IOException
file - the pdf to be parsedraBuf - the buffer to be used for parsingIOException - If something went wrong.public NonSequentialPDFParser(File file, RandomAccess raBuf, String decryptionPassword) throws IOException
file - the pdf to be parsedraBuf - the buffer to be used for parsingdecryptionPassword - password to be used for decryptionIOException - If something went wrong.public NonSequentialPDFParser(InputStream input) throws IOException
input - input stream representing the pdf.IOException - If something went wrong.public NonSequentialPDFParser(InputStream input, RandomAccess raBuf, String decryptionPassword) throws IOException
input - input stream representing the pdf.raBuf - the buffer to be used for parsingdecryptionPassword - password to be used for decryption.IOException - If something went wrong.public void setEOFLookupRange(int byteCount)
DEFAULT_TRAIL_BYTECOUNT.
In case system property SYSPROP_EOFLOOKUPRANGE is defined
this value will be set on initialization but can be overwritten
later.
byteCount - number of trailing bytesprotected void initialParse()
throws IOException
IOException - If something went wrong.protected final void setPdfSource(long fileOffset)
throws IOException
BaseParser.pdfSource to start next parsing at given file offset.fileOffset - file offsetIOException - If something went wrong.protected final void releasePdfSourceInputStream()
throws IOException
IOException - If something went wrong.protected final long getStartxrefOffset()
throws IOException
DEFAULT_TRAIL_BYTECOUNT bytes (or range set via
setEOFLookupRange(int)) and go back to find
startxref.IOException - If something went wrong.protected int lastIndexOf(char[] pattern,
byte[] buf,
int endOff)
pattern - pattern to search forbuf - buffer to search pattern inendOff - offset (exclusive) where lookup starts at-1 if
pattern could not be foundprotected final void readPattern(char[] pattern)
throws IOException
BaseParser.pdfSource. Skipping whitespace at start
and end.pattern - pattern to be skippedIOException - if pattern could not be readpublic void parse()
throws IOException
parse in class PDFParserIOException - If there is an error reading from the stream or corrupt data
is found.protected File getPdfFile()
public boolean isLenient()
public void setLenient(boolean lenient)
throws IllegalArgumentException
lenient - IllegalArgumentException - if the method is called after parsing.protected void deleteTempFile()
public SecurityHandler getSecurityHandler()
null if document
is not encrypted or parse() wasn't called before.public PDDocument getPDDocument() throws IOException
getPDDocument in class PDFParserIOException - If there is an error getting the document.public int getPageNumber()
throws IOException
IOException - if PAGES or other needed object is missingpublic PDPage getPage(int pageNr) throws IOException
pageNr - starts from 0 to the number of pages.IOException - If something went wrong.protected final COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj) throws IOException
PDFParser and reduced to parsing an
indirect object.obj - object to be parsed (we only take object number and generation
number for lookup start offset)requireExistingNotCompressedObj - if true object to be
parsed must not be contained within compressed streamIOException - If an IO error occurs.protected COSBase parseObjectDynamically(int objNr, int objGenNr, boolean requireExistingNotCompressedObj) throws IOException
PDFParser and reduced to parsing an
indirect object.objNr - object number of object to be parsedobjGenNr - object generation number of object to be parsedrequireExistingNotCompressedObj - if true the object to
be parsed must be defined in xref (comment: null objects may
be missing from xref) and it must not be a compressed object
within object stream (this is used to circumvent being stuck
in a loop in a malicious PDF)IOException - If an IO error occurs.protected final void decryptDictionary(COSDictionary dict, long objNr, long objGenNr) throws IOException
dict - the dictionary to be decryptedobjNr - the object numberobjGenNr - the object generation numberIOException - ff something went wrongprotected final void decryptString(COSString str, long objNr, long objGenNr) throws IOException
str - the string to be decryptedobjNr - the object numberobjGenNr - the object generation numberIOException - ff something went wrongprotected final void decrypt(COSBase pb, int objNr, int objGenNr) throws IOException
pb - the object to be decryptedobjNr - the object numberobjGenNr - the object generation numberIOException - ff something went wrongprotected COSStream parseCOSStream(COSDictionary dic, RandomAccess file) throws IOException
parseCOSStream in class BaseParserdic - dictionary that goes with this stream.file - file to write the stream to when reading.IOException - if an error occurred reading the stream, like
problems with reading length attribute, stream does not end
with 'endstream' after data read, stream too short etc.Copyright © 2002-2016 The Apache Software Foundation. All Rights Reserved.