There are several ideas to enhance PDFBox. These are outlined below together with comments and the releases they are planned for as soon as there is agreement to do the implementation.
Enhance the type safety of PDFBox and add more generic collections and code cleanup.
This is an ongoing effort and most/all deprecated methods will be removed in PDFBox 2.0.0
In addition to the PDF parsing pdfbox does not always handle large PDF files well as some of the references are implemented as int instead of long
PDFBox 2.0.0 has Java 6 as a minimum requirement.
In order to support different use cases and provide a minimal toolset PDFBox 2.0.0 should be separated into different modules. This goes inline with rearranging some of the code e.g. remove AWT from PDDocument.
PDFBox 2.0.0 will render most of the fonts without using AWT.
The old “classic” PDF parser in PDFBox is not in line with the PDF specification as it parses a PDF from top to bottom instead of respecting the XRef information. The NonSequentialParser enhanced that situation but there is a need to have a cleaner foundation broken into several levels
In addition, handling documents which are not conforming shouldn’t be part of the core parser but of an extensible approach, e.g. by adding hooks to allow for handling parsing exceptions.
The recent PDFBox version is limited to WinANSI encoded text. 2.0.0 should have unicode support as well.
The COS level objects need to be refactored to be in line with the new parser. In addition method signatures, constructing … should be made similar across the COS objects
Instead of always parsing the complete document PDFs should be parsable on demand making objects only available as they are needed to enhance performance and minimize memory footprint.
This might be achieved by providing a layered approach where a base (non caching) parser provides the on demand parsing and a caching parser built on top caches objects for use cases where this is beneficial e.g. rendering, debugging …
The current implementation is a mix of PDF 1.4 and some adhoc additions without a clear distinction what is and is not supported. We could ad some support for explicitly handling versions in PDFBox e.g. my marking certain methods and properties to the PDF version support level. This could in addition be a good basis for PDF/A and other compliance checks.