External Links

This page lists projects that utilize PDFBox and articles that have been written about PDFBox. Please file an improvement issue to get new projects or articles added to this page, or to update the information on existing links.

Projects Using PDFBox

Project Name License Project Description
Alfresco LGPL - commercial services/support/training is available Alfresco is an open source, open-standards content repository built by the most experienced content management team that includes the co-founder of Documentum.
Apache Nutch Apache License v2 Apache Nutch is open source web-search software. It builds on Apache Lucene, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
Apache Tika Apache License v2 Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Jomic GPL Jomic is a viewer for comic book archives.
JpdfUnit Apache License v2 pdfUnit is a framework for testing a generated pdf document with the JUnit Test Framework.
Liferay Portal MIT Liferay Portal is an open source portal that helps organizations collaborate more efficiently by providing a consolidated view of disparate applications.
LuceGene Artistic License LuceGene is an open-source document/object search and retrieval system specially tuned for bioinformatics text databases and documents.
Lutece BSD-like Lutece is a portal engine which allows you to easily create your websites or intranets based upon HTML, XML content.
MMBase Lucene Module MPL MMBase Lucene Module is a plugin (module) for the MMBase content management system that enables Lucene full text search through it's content, and thanks to PDFBox also PDF content.
OpenCms LGPL MMBase Lucene Module is a plugin (module) for the MMBase content management system that enables Lucene full text search through it's content, and thanks to PDFBox also PDF content.
OpenSearchServer GPLv3 An open source search engine and crawler based on best open source technologies. It is a modern search engine and a suite of high-powered full text search algorithms.
Orbeon PresentationServer LGPL Orbeon PresentationServer (OPS) is an open source J2EE-based platform for XML-centric web applications. OPS is built around XHTML, XForms, XSLT, XML pipelines, and Web Services, which makes it ideal for applications that capture, process and present XML data. Commercial consulting/training/support is available through orbeon.
PDFJuice Apache License v2 This project provides some tools that help the user to extract structured information form PDF documents. Currently, the program is able to export them to HTML.
REWOO Scope Commercial REWOO Scope is an Enterprise Content Management (ECM) software to organize, structure and consolidate enterprise data. Apache PDFBox is an integral part to read and index PDF documents.
SearchBlox Commercial SearchBlox is a high-performance corporate search software designed for the Java 2 Enterprise Edition (J2EE) platform.
Semantic Scholar Web Based Semantic Scholar is a new service from AI2 for scientific literature search and discovery, focusing on semantics and textual understanding.
SimplexRepaginator Apache License v2 Simplex Repaginator converts simplex-scanned PDFs into properly duplex-paginated PDFs and vice versa.
Terrier MPL Terrier is software for the rapid development of Web, intranet and desktop search engines.
Triboni GinkGO Commercial Triboni GinkGO is a highly scalable J2EE services platform that is based on a simple XML business object defintion and scripting language. Together with XSLT content centric web applications can be configured in a very short time.

Articles/Books

Article Name Article Abstract
Build an eDoc Reader for your iPod
Part 1 - User Interface
Part 2 - Document Reading Engine
Part 3 - Integration with PDFBox
A three part article that discusses the implementation of the PodReader application. PodReader is Cocoa application written in Objective-C and article discusses how to use the Cocoa-Java bridge to integrate with the Java version of PDFBox.
Lucene In Action A book that discusses integrating with the lucene search engine. One chapter discusses how to index various file formats and highlights PDFBox for indexing PDF documents.
Java Developers Journal - March 2005 An article written by the lead developer of PDFBox discussing text extraction and AcroForm integration using PDFBox functionality.
Refactoring trends across N versions of N Java open source systems: an empirical study This article describes an empirical study of multiple versions of a range of open source Java systems in an attempt to understand whether refactoring occur and, if so, which types of refactoring were most (and least) common. PDFBox is used as a case study.