public class LucenePDFDocument extends Object
Lucene Field Name | Description |
---|---|
path | File system path if loaded from a file |
url | URL to PDF document |
contents | Entire contents of PDF document, indexed but not stored |
summary | First 500 characters of content |
modified | The modified date/time according to the url or path |
uid | A unique identifier for the Lucene document. |
CreationDate | From PDF meta-data if available |
Creator | From PDF meta-data if available |
Keywords | From PDF meta-data if available |
ModificationDate | From PDF meta-data if available |
Producer | From PDF meta-data if available |
Subject | From PDF meta-data if available |
Trapped | From PDF meta-data if available |
Modifier and Type | Field and Description |
---|---|
static org.apache.lucene.document.FieldType |
TYPE_STORED_NOT_INDEXED
not Indexed, tokenized, stored.
|
Constructor and Description |
---|
LucenePDFDocument()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
org.apache.lucene.document.Document |
convertDocument(File file)
This will take a reference to a PDF document and create a lucene document.
|
org.apache.lucene.document.Document |
convertDocument(InputStream is)
Convert the PDF stream to a lucene document.
|
org.apache.lucene.document.Document |
convertDocument(URL url)
Convert the document from a PDF to a lucene document.
|
static String |
createUID(File file)
Create an UID for the given file.
|
static String |
createUID(URL url,
long time)
Create an UID for the given file using the given time.
|
static org.apache.lucene.document.Document |
getDocument(File file)
This will get a lucene document from a PDF file.
|
static org.apache.lucene.document.Document |
getDocument(InputStream is)
This will get a lucene document from a PDF file.
|
static org.apache.lucene.document.Document |
getDocument(URL url)
This will get a lucene document from a PDF file.
|
void |
setTextStripper(PDFTextStripper aStripper)
Set the text stripper that will be used during extraction.
|
public static final org.apache.lucene.document.FieldType TYPE_STORED_NOT_INDEXED
public void setTextStripper(PDFTextStripper aStripper)
aStripper
- The new pdf text stripper.public org.apache.lucene.document.Document convertDocument(InputStream is) throws IOException
is
- The input stream.IOException
- If there is an error converting the PDF.public org.apache.lucene.document.Document convertDocument(File file) throws IOException
file
- A reference to a PDF document.IOException
- If there is an exception while converting the document.public org.apache.lucene.document.Document convertDocument(URL url) throws IOException
url
- A url to a PDF document.IOException
- If there is an error while converting the document.public static org.apache.lucene.document.Document getDocument(InputStream is) throws IOException
is
- The stream to read the PDF from.IOException
- If there is an error parsing or indexing the document.public static org.apache.lucene.document.Document getDocument(File file) throws IOException
file
- The file to get the document for.IOException
- If there is an error parsing or indexing the document.public static org.apache.lucene.document.Document getDocument(URL url) throws IOException
url
- The file to get the document for.IOException
- If there is an error parsing or indexing the document.public static String createUID(URL url, long time)
url
- the file we have to create an UID fortime
- the time to used to the UIDCopyright © 2002–2016 The Apache Software Foundation. All rights reserved.