Tuesday, February 2, 2016

Big Data Cartoon - Portraits of Famous People - Cos

Today we inaugurate the series of blog posts called "Big Data Famous People."

This here is Konstantin Boudnik. A graduate of St. Petersburg University, Math Department, and with a PhD in distributed systems, Cos holds strong opinions in many areas of programming languages, philosophy and economics. You can ask him on his site here, or on the restricted version here.

Cos is the 16th most prolific committer to Hadoop. He helped incubate Ignite, Groovy, and a bunch of other projects.

Tuesday, January 26, 2016

Illustrated story of Houston Hadoop Meetup

 Five years ago we came up with this logo for our book, Hadoop Illuminated. The Meetup that I started in 2011 was then meeting in Houston Public Library, and numbered in the tens, with 3-5 showing up. When we wanted to do some Amazon clusters, we found out that the library blocked port 22. We started looking for other quarters.

 Time flew, Meetup grew, and we came to our first Hadoop bootcamp. We spent more money on going to Pappadeaux restaurant than than we made on the bootcamp, but fun was had by all.

Today, the meetup is close to 700, and the current event has about 70 registrants. We have two goals.

1. In 2011, Bay area Hadoop meetup had 3,000 members, and 300 would come to the Yahoo headquarters to hear about Hadoop HA. I was green with envy. Now we have a chance to growing more than they - let's try it.
2. There is still no significant Big Data in Houston. But there are real signs that this is going to change this year. You finally hear about Big Data projects in retails, energy, power, etc.

Guys, let's continue making Houston a Big Data capital. Cheers!

Wednesday, December 23, 2015

Friday, December 18, 2015

Houston Hadoop Meetup - Going from Hadoop to Spark

This time Mark Kerzner, the organizer, presented "Going from Hadoop to Spark" - but with a slant on basics. It worked, more than half of the audience were there for the first time, and they got an introduction into Hadoop just as into Spark.

It may have worked too well: many people present wanted to bypass Hadoop altogether and start their learning from Spark. Don't. Start with a Hadoop overview, you can get our free book here, and good Spark training contents is available at DataBricks, the company behind Spark. You might also want the introduction to "Spark Illuminated", the book we are currently writing.

The slides from the presentation can be found here.

Thank you and see you next time.

PS. The new meeting place at the Slalom office in Houston was fantastic, we plan to stay there - thanks.

Wednesday, December 9, 2015

Data Analytics at the Memex DARPA program with ASPOSE

I am a fan of open source. At DARPA, I work with open source technologies and create more open source as a result. However, when I had to extract information from a PDF police report, I ran into problems for this type of PDF. Here is a fragment of my document.





















Now, you can easily see that the document easily breaks into (field, value) pairs. However, if you copy/paste the text, you get this:

Report no.:
Occurrence Type:
Occurrence time:
Reported time:
Place of offence:
Clearance status:
Concluded:
Concluded date:
Summary:
Remarks:
20131 234567
Impaired Operation/over 80 mg% of Motor Vehicle 253(1)(a)/(b) CC
2013/08/08 20:10 -
2013/08/08 20:10
1072 102 STREET, NORTH BATTLEFORD, SK Canada (CROWN CAB) (Div: F,
Dist: CENTRAL, Det: Battleford Municipal, Zone: BFD, Atom: C)
Cleared by charge/charge recommended
Yes
2013/08/29
Cst. SMITH

As you can see, the formatting is not preserved, and it becomes very hard to parse. I tried 'save as text' and I tried Tika, and I tried PdfBox, and I also asked the Tika people. The result is the same: I get all the text but not the formatting.


Well, comes in Aspose. Close source and with a price tag. But you know what? It is the only one that does the job and gives me the text output in the same format as PDF was.

Here is the code I had to use

    private void initAsposeLicense() {
        com.aspose.pdf.License license = new com.aspose.pdf.License();
        try {
//            ClassLoader classLoader = getClass().getClassLoader();
//            File file = new File(classLoader.getResource("Aspose.Pdf.lic").getFile());
//            InputStream licenseStream = new FileInputStream(file);
//            license.setLicense(licenseStream);
            license.setLicense("Aspose.Pdf.lic");
        } catch (Exception e) {
            logger.error("Aspose license problem", e);
        }
    }

As you can see, I tried to stream the license in. It would be better to distributed to whole jar, but it did not work for some reason. Well, keeping the license outside may be better, since you can replace it. So I just read it from the executable location folder.

Extracting the text was also extremely easy

    private String extractWithAspose(File file) throws IOException {
        // Open document
        com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(file.getPath());

        // Create TextAbsorber object to extract text
        com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();

        // Accept the absorber for all the pages
        pdfDocument.getPages().accept(textAbsorber);

        // Get the extracted text
        String extractedText = textAbsorber.getText();
//        System.out.println("extractedText=\n" + extractedText);
        return extractedText;
    }

So now I can create a spreadsheet of fields/values for the whole document corpus:

Report no.:|Occurrence Type:|Occurrence time:|Reported time:|Place of offence:|Clearance status:|Concluded:|Concluded date:|Summary:|Remarks:|Associated occurrences:|Involved persons:|Involved addresses:|Involved comm addresses:|Involved vehicles:|Involved officers:|Involved property:|Modus operandi:|Reports:|Supplementary report:
"20131234567"|

Now I can happily proceed with my text analytics tasks.

Tuesday, November 24, 2015

Thanksgiving in the Big Data Land

There are a few ways how it could work out in the world of Big Data
  1. The traditional Hadoop elephant serves the traditional turkey.
  2. Two vegetarians, turkey and elephant, eat the traditional pumpkin pie.
  3. Nobody eats anybody at all, in the style of Alice in Wonderland with the pudding (if you remember, Alice was introduced to pudding, and it is not etiquette to cut a piece of someone after being introduced to it).
Please take your pick, then send the postcard to your friend.

Tuesday, October 27, 2015

Hadoop and Spark cartoon

Our artist outdid herself with this cartoon, this is totally hilarious.

We do teach a lot of Spark courses lately though.