org.apache.pdfbox.searchengine.lucene
Class LucenePDFDocument

java.lang.Object
  extended by org.apache.pdfbox.searchengine.lucene.LucenePDFDocument

public final class LucenePDFDocument
extends java.lang.Object

This class is used to create a document for the lucene search engine. This should easily plug into the IndexHTML or IndexFiles that comes with the lucene project. This class will populate the following fields.

Lucene Field Name Description
path File system path if loaded from a file
url URL to PDF document
contents Entire contents of PDF document, indexed but not stored
summary First 500 characters of content
modified The modified date/time according to the url or path
uid A unique identifier for the Lucene document.
CreationDate From PDF meta-data if available
Creator From PDF meta-data if available
Keywords From PDF meta-data if available
ModificationDate From PDF meta-data if available
Producer From PDF meta-data if available
Subject From PDF meta-data if available
Trapped From PDF meta-data if available

Version:
$Revision: 1.23 $
Author:
Ben Litchfield

Constructor Summary
LucenePDFDocument()
          Constructor.
 
Method Summary
 org.apache.lucene.document.Document convertDocument(java.io.File file)
          This will take a reference to a PDF document and create a lucene document.
 org.apache.lucene.document.Document convertDocument(java.io.InputStream is)
          Convert the PDF stream to a lucene document.
 org.apache.lucene.document.Document convertDocument(java.net.URL url)
          Convert the document from a PDF to a lucene document.
 org.apache.lucene.document.DateTools.Resolution getDateTimeResolution()
          Get the Lucene data time resolution.
static org.apache.lucene.document.Document getDocument(java.io.File file)
          This will get a lucene document from a PDF file.
static org.apache.lucene.document.Document getDocument(java.io.InputStream is)
          This will get a lucene document from a PDF file.
static org.apache.lucene.document.Document getDocument(java.net.URL url)
          This will get a lucene document from a PDF file.
static void main(java.lang.String[] args)
          This will test creating a document.
 void setDateTimeResolution(org.apache.lucene.document.DateTools.Resolution resolution)
          Set the Lucene data time resolution.
 void setTextStripper(PDFTextStripper aStripper)
          Set the text stripper that will be used during extraction.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LucenePDFDocument

public LucenePDFDocument()
Constructor.

Method Detail

setTextStripper

public void setTextStripper(PDFTextStripper aStripper)
Set the text stripper that will be used during extraction.

Parameters:
aStripper - The new pdf text stripper.

getDateTimeResolution

public org.apache.lucene.document.DateTools.Resolution getDateTimeResolution()
Get the Lucene data time resolution.

Returns:
current date/time resolution

setDateTimeResolution

public void setDateTimeResolution(org.apache.lucene.document.DateTools.Resolution resolution)
Set the Lucene data time resolution.

Parameters:
resolution - set new date/time resolution

convertDocument

public org.apache.lucene.document.Document convertDocument(java.io.InputStream is)
                                                    throws java.io.IOException
Convert the PDF stream to a lucene document.

Parameters:
is - The input stream.
Returns:
The input stream converted to a lucene document.
Throws:
java.io.IOException - If there is an error converting the PDF.

convertDocument

public org.apache.lucene.document.Document convertDocument(java.io.File file)
                                                    throws java.io.IOException
This will take a reference to a PDF document and create a lucene document.

Parameters:
file - A reference to a PDF document.
Returns:
The converted lucene document.
Throws:
java.io.IOException - If there is an exception while converting the document.

convertDocument

public org.apache.lucene.document.Document convertDocument(java.net.URL url)
                                                    throws java.io.IOException
Convert the document from a PDF to a lucene document.

Parameters:
url - A url to a PDF document.
Returns:
The PDF converted to a lucene document.
Throws:
java.io.IOException - If there is an error while converting the document.

getDocument

public static org.apache.lucene.document.Document getDocument(java.io.InputStream is)
                                                       throws java.io.IOException
This will get a lucene document from a PDF file.

Parameters:
is - The stream to read the PDF from.
Returns:
The lucene document.
Throws:
java.io.IOException - If there is an error parsing or indexing the document.

getDocument

public static org.apache.lucene.document.Document getDocument(java.io.File file)
                                                       throws java.io.IOException
This will get a lucene document from a PDF file.

Parameters:
file - The file to get the document for.
Returns:
The lucene document.
Throws:
java.io.IOException - If there is an error parsing or indexing the document.

getDocument

public static org.apache.lucene.document.Document getDocument(java.net.URL url)
                                                       throws java.io.IOException
This will get a lucene document from a PDF file.

Parameters:
url - The file to get the document for.
Returns:
The lucene document.
Throws:
java.io.IOException - If there is an error parsing or indexing the document.

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
This will test creating a document. usage: java pdfparser.searchengine.lucene.LucenePDFDocument <pdf-document>

Parameters:
args - command line arguments.
Throws:
java.io.IOException - If there is an error.


Copyright © 2002-2010 The Apache Software Foundation. All Rights Reserved.