I am a fan of open source. At DARPA, I work with open source technologies and create more open source as a result. However, when I had to extract information from a PDF police report, I ran into problems for this type of PDF. Here is a fragment of my document.
Now, you can easily see that the document easily breaks into (field, value) pairs. However, if you copy/paste the text, you get this:
Report no.:
Occurrence Type:
Occurrence time:
Reported time:
Place of offence:
Clearance status:
Concluded:
Concluded date:
Summary:
Remarks:
20131 234567
Impaired Operation/over 80 mg% of Motor Vehicle 253(1)(a)/(b) CC
2013/08/08 20:10 -
2013/08/08 20:10
1072 102 STREET, NORTH BATTLEFORD, SK Canada (CROWN CAB) (Div: F,
Dist: CENTRAL, Det: Battleford Municipal, Zone: BFD, Atom: C)
Cleared by charge/charge recommended
Yes
2013/08/29
Cst. SMITH
As you can see, the formatting is not preserved, and it becomes very hard to parse. I tried 'save as text' and I tried Tika, and I tried PdfBox, and I also asked the Tika people. The result is the same: I get all the text but not the formatting.
Well, comes in Aspose. Close source and with a price tag. But you know what? It is the only one that does the job and gives me the text output in the same format as PDF was.
Here is the code I had to use
private void initAsposeLicense() {
com.aspose.pdf.License license = new com.aspose.pdf.License();
try {
// ClassLoader classLoader = getClass().getClassLoader();
// File file = new File(classLoader.getResource("Aspose.Pdf.lic").getFile());
// InputStream licenseStream = new FileInputStream(file);
// license.setLicense(licenseStream);
license.setLicense("Aspose.Pdf.lic");
} catch (Exception e) {
logger.error("Aspose license problem", e);
}
}
As you can see, I tried to stream the license in. It would be better to distributed to whole jar, but it did not work for some reason. Well, keeping the license outside may be better, since you can replace it. So I just read it from the executable location folder.
Extracting the text was also extremely easy
private String extractWithAspose(File file) throws IOException {
// Open document
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(file.getPath());
// Create TextAbsorber object to extract text
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
// Accept the absorber for all the pages
pdfDocument.getPages().accept(textAbsorber);
// Get the extracted text
String extractedText = textAbsorber.getText();
// System.out.println("extractedText=\n" + extractedText);
return extractedText;
}
So now I can create a spreadsheet of fields/values for the whole document corpus:
Report no.:|Occurrence Type:|Occurrence time:|Reported time:|Place of offence:|Clearance status:|Concluded:|Concluded date:|Summary:|Remarks:|Associated occurrences:|Involved persons:|Involved addresses:|Involved comm addresses:|Involved vehicles:|Involved officers:|Involved property:|Modus operandi:|Reports:|Supplementary report:
"20131234567"|
Now I can happily proceed with my text analytics tasks.
No comments:
Post a Comment