public class COSParser extends BaseParser
PDFParser.parse()
or FDFParser.parse()
must be called before page objects
can be retrieved, e.g. PDFParser.getPDDocument()
.
This class is a much enhanced version of QuickParser
presented in PDFBOX-1104 by Jeremy Villalobos.Modifier and Type | Field and Description |
---|---|
protected static char[] |
EOF_MARKER
EOF-marker.
|
protected long |
fileLen
file length.
|
protected boolean |
initialParseDone |
protected static char[] |
OBJ_MARKER
obj-marker.
|
protected SecurityHandler |
securityHandler
The security handler.
|
protected RandomAccessRead |
source |
static String |
SYSPROP_EOFLOOKUPRANGE
The range within the %%EOF marker will be searched.
|
static String |
SYSPROP_PARSEMINIMAL
Only parse the PDF file minimally allowing access to basic information.
|
static String |
TMP_FILE_PREFIX
The prefix for the temp file being used.
|
protected XrefTrailerResolver |
xrefTrailerResolver
Collects all Xref/trailer objects and resolves them into single
object using startxref reference.
|
A, ASCII_CR, ASCII_LF, B, D, DEF, document, E, ENDOBJ_STRING, ENDSTREAM_STRING, J, M, N, O, R, S, seqSource, STREAM_STRING, T
Constructor and Description |
---|
COSParser(RandomAccessRead source)
Default constructor.
|
Modifier and Type | Method and Description |
---|---|
protected void |
checkPages(COSDictionary root)
Check if all entries of the pages dictionary are present.
|
COSDocument |
getDocument()
This will get the document that was parsed.
|
protected long |
getStartxrefOffset()
Looks for and parses startxref.
|
protected boolean |
isCatalog(COSDictionary dictionary)
Tell if the dictionary is a PDF catalog.
|
boolean |
isLenient()
Return true if parser is lenient.
|
protected int |
lastIndexOf(char[] pattern,
byte[] buf,
int endOff)
Searches last appearance of pattern within buffer.
|
protected COSStream |
parseCOSStream(COSDictionary dic)
This will read a COSStream from the input stream using length attribute within dictionary.
|
protected void |
parseDictObjects(COSDictionary dict,
COSName... excludeObjects)
Will parse every object necessary to load a single page from the pdf document.
|
protected boolean |
parseFDFHeader()
Parse the header of a fdf.
|
protected COSBase |
parseObjectDynamically(COSObject obj,
boolean requireExistingNotCompressedObj)
This will parse the next object from the stream and add it to the local state.
|
protected COSBase |
parseObjectDynamically(long objNr,
int objGenNr,
boolean requireExistingNotCompressedObj)
This will parse the next object from the stream and add it to the local state.
|
protected boolean |
parsePDFHeader()
Parse the header of a pdf.
|
protected COSBase |
parseTrailerValuesDynamically(COSDictionary trailer)
Parse the values of the trailer dictionary and return the root object.
|
protected COSDictionary |
parseXref(long startXRefOffset)
Parses cross reference tables.
|
protected boolean |
parseXrefTable(long startByteOffset)
This will parse the xref table from the stream and add it to the state
The XrefTable contents are ignored.
|
protected COSDictionary |
rebuildTrailer()
Rebuild the trailer dictionary if startxref can't be found.
|
protected COSDictionary |
retrieveTrailer()
Read the trailer information and provide a COSDictionary containing the trailer information.
|
void |
setEOFLookupRange(int byteCount)
Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker.
|
void |
setLenient(boolean lenient)
Change the parser leniency flag.
|
isClosing, isClosing, isDigit, isDigit, isEndOfName, isEOL, isEOL, isSpace, isSpace, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseDirObject, readExpectedChar, readExpectedString, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, skipSpaces, skipWhiteSpaces
protected final RandomAccessRead source
public static final String SYSPROP_PARSEMINIMAL
public static final String SYSPROP_EOFLOOKUPRANGE
protected static final char[] EOF_MARKER
protected static final char[] OBJ_MARKER
protected long fileLen
protected boolean initialParseDone
protected SecurityHandler securityHandler
protected XrefTrailerResolver xrefTrailerResolver
public static final String TMP_FILE_PREFIX
public COSParser(RandomAccessRead source)
public void setEOFLookupRange(int byteCount)
DEFAULT_TRAIL_BYTECOUNT
.
We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.
In case system property SYSPROP_EOFLOOKUPRANGE
is defined this value will be set on initialization but
can be overwritten later.
byteCount
- number of trailing bytesprotected COSDictionary retrieveTrailer() throws IOException
IOException
- if something went wrongprotected COSDictionary parseXref(long startXRefOffset) throws IOException
startXRefOffset
- start offset of the first tableIOException
- if something went wrongprotected final long getStartxrefOffset() throws IOException
DEFAULT_TRAIL_BYTECOUNT
bytes (or range set via setEOFLookupRange(int)
) and go back to find
startxref
.IOException
- If something went wrong.protected int lastIndexOf(char[] pattern, byte[] buf, int endOff)
pattern
- pattern to search forbuf
- buffer to search pattern inendOff
- offset (exclusive) where lookup starts at-1
if pattern could not be foundpublic boolean isLenient()
public void setLenient(boolean lenient)
lenient
- try to handle malformed PDFs.protected void parseDictObjects(COSDictionary dict, COSName... excludeObjects) throws IOException
dict
- the COSObject from the parent pages.excludeObjects
- dictionary object reference entries with these names will not be parsedIOException
- if something went wrongprotected final COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj) throws IOException
obj
- object to be parsed (we only take object number and generation number for lookup start offset)requireExistingNotCompressedObj
- if true
object to be parsed must not be contained within
compressed streamIOException
- If an IO error occurs.protected COSBase parseObjectDynamically(long objNr, int objGenNr, boolean requireExistingNotCompressedObj) throws IOException
objNr
- object number of object to be parsedobjGenNr
- object generation number of object to be parsedrequireExistingNotCompressedObj
- if true
the object to be parsed must be defined in xref
(comment: null objects may be missing from xref) and it must not be a compressed object within object stream
(this is used to circumvent being stuck in a loop in a malicious PDF)IOException
- If an IO error occurs.protected COSStream parseCOSStream(COSDictionary dic) throws IOException
dic
- dictionary that goes with this stream.IOException
- if an error occurred reading the stream, like problems with reading
length attribute, stream does not end with 'endstream' after data read, stream too short etc.protected final COSDictionary rebuildTrailer() throws IOException
IOException
- if something went wrongprotected void checkPages(COSDictionary root)
root
- the root dictionary of the pdfprotected boolean isCatalog(COSDictionary dictionary)
dictionary
- protected boolean parsePDFHeader() throws IOException
IOException
- if something went wrongprotected boolean parseFDFHeader() throws IOException
IOException
- if something went wrongprotected boolean parseXrefTable(long startByteOffset) throws IOException
startByteOffset
- the offset to start atIOException
- If an IO error occurs.public COSDocument getDocument() throws IOException
IOException
- If there is an error getting the document.protected COSBase parseTrailerValuesDynamically(COSDictionary trailer) throws IOException
trailer
- The trailer dictionary.IOException
- If an IO error occurs or if the root object is
missing in the trailer dictionary.Copyright © 2002–2017 The Apache Software Foundation. All rights reserved.