Wednesday, December 23, 2015

Friday, December 18, 2015

Houston Hadoop Meetup - Going from Hadoop to Spark

This time Mark Kerzner, the organizer, presented "Going from Hadoop to Spark" - but with a slant on basics. It worked, more than half of the audience were there for the first time, and they got an introduction into Hadoop just as into Spark.

It may have worked too well: many people present wanted to bypass Hadoop altogether and start their learning from Spark. Don't. Start with a Hadoop overview, you can get our free book here, and good Spark training contents is available at DataBricks, the company behind Spark. You might also want the introduction to "Spark Illuminated", the book we are currently writing.

The slides from the presentation can be found here.

Thank you and see you next time.

PS. The new meeting place at the Slalom office in Houston was fantastic, we plan to stay there - thanks.

Wednesday, December 9, 2015

Data Analytics at the Memex DARPA program with ASPOSE

I am a fan of open source. At DARPA, I work with open source technologies and create more open source as a result. However, when I had to extract information from a PDF police report, I ran into problems for this type of PDF. Here is a fragment of my document.

Now, you can easily see that the document easily breaks into (field, value) pairs. However, if you copy/paste the text, you get this:

Report no.:
Occurrence Type:
Occurrence time:
Reported time:
Place of offence:
Clearance status:
Concluded date:
20131 234567
Impaired Operation/over 80 mg% of Motor Vehicle 253(1)(a)/(b) CC
2013/08/08 20:10 -
2013/08/08 20:10
Dist: CENTRAL, Det: Battleford Municipal, Zone: BFD, Atom: C)
Cleared by charge/charge recommended

As you can see, the formatting is not preserved, and it becomes very hard to parse. I tried 'save as text' and I tried Tika, and I tried PdfBox, and I also asked the Tika people. The result is the same: I get all the text but not the formatting.

Well, comes in Aspose. Close source and with a price tag. But you know what? It is the only one that does the job and gives me the text output in the same format as PDF was.

Here is the code I had to use

    private void initAsposeLicense() {
        com.aspose.pdf.License license = new com.aspose.pdf.License();
        try {
//            ClassLoader classLoader = getClass().getClassLoader();
//            File file = new File(classLoader.getResource("Aspose.Pdf.lic").getFile());
//            InputStream licenseStream = new FileInputStream(file);
//            license.setLicense(licenseStream);
        } catch (Exception e) {
            logger.error("Aspose license problem", e);

As you can see, I tried to stream the license in. It would be better to distributed to whole jar, but it did not work for some reason. Well, keeping the license outside may be better, since you can replace it. So I just read it from the executable location folder.

Extracting the text was also extremely easy

    private String extractWithAspose(File file) throws IOException {
        // Open document
        com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(file.getPath());

        // Create TextAbsorber object to extract text
        com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();

        // Accept the absorber for all the pages

        // Get the extracted text
        String extractedText = textAbsorber.getText();
//        System.out.println("extractedText=\n" + extractedText);
        return extractedText;

So now I can create a spreadsheet of fields/values for the whole document corpus:

Report no.:|Occurrence Type:|Occurrence time:|Reported time:|Place of offence:|Clearance status:|Concluded:|Concluded date:|Summary:|Remarks:|Associated occurrences:|Involved persons:|Involved addresses:|Involved comm addresses:|Involved vehicles:|Involved officers:|Involved property:|Modus operandi:|Reports:|Supplementary report:

Now I can happily proceed with my text analytics tasks.