Thursday, August 20, 2015

Mike Drob presents Cloudera Search at Houston Hadoop Meetup

Last Tuesday it was Cloudera's turn to take the podium. Mike Drob presented "Cloudera Search". The slides are right here. Every component is available via the open source channels, mostly in Solr.

Cloudera's value add an open source, feature rich search engine plus all the integrations with a data management (also open source) ecosystem, to streamline multi-workload search, or search and other workloads of the same data, without moving it around between systems. Cloudera also provides production tooling, audit, and security.

In addition, open source buffs can use the implementation described in these slides, to glean the best practices to use in their own solutions.

Very good, clear discussion - thank you, Mike!

Thanks again, Microsoft, for hosting the meetup at MS Campus.

Wednesday, July 22, 2015

Review of “Monitoring Hadoop” by Gurmukh Singh

This book is recently published, April 2015, and it covers Nagios, Ganglia, Hadoop monitoring and monitoring best practices.
The first part is rightfully devoted to Nagios. Nagios is covered quite in depth: install, verification and configuration. It gives you the right balance: it does not say everything that there is in a Nagios manual, but tells you sufficient information to install Nagios and prepare it to monitor specific Hadoop daemons, ports, and hardware.
The same goes for Ganglia: it is covered in sufficient detail for one to be able to install and run, with enough attention to Hadoop specifics.
What I did not find in the book, and what could be useful... to read further

Review of “Hadoop in Action,” second edition

Four years have passed since the first publication, and as Russians say, “A lot of water has passed (under the bridge) since then,” so let’s look at what’s new in this edition.

Tuesday, July 7, 2015

The power of text analytics at DARPA/Memex

One of the things we are doing in the DARPA Memex program is text analytics. One of the outcomes of it is an open source project called MemexGATE.

By itself, GATE stands for Generic Architecture for Text Engineering, and it is a mature and widely-used tool. It is up to you to create something useful with GATE, and MemexGATE is our first step. This is an application configured to understand court documents. It will detect people mentioned in the documents, dates, places, and many more characteristics that take you beyond plain key word searches.

To achieve this, GATE combines processing pipelines (such as sentence splitter, language-specific word tokenizer, part of speech tagger, etc) with gazetteers. Now, what is a gazetteer? -- It is a list of people, places etc. that can occur in your documents. MemexGATE includes scripts that collect all US judges, for example, so that they can be detected, when found in a document.

But MemexGATE does more: it is scalable. Building on the Behemot framework, it can parallelise processing for the Hadoop cluster, thus putting no limit on the size of the corpus. MemexGATE was designed and implemented by Jet Propulsion Lab team, and the project committer is Lewis McGibbney.

The picture shown above gives an example of a processed document (from NY court of appeals), with specific finds color-coded. In this way, we process more than 100,000 documents. Why is this useful for us at Memex? - Because we are trying to find and parse court documents related to labor trafficking, so that we can analyze them and better understand the indicators of labor trafficking.

It is very exciting to work on the Memex program. Our team is called "Hyperion Gray" and has been featured in Forbes lately.

What's next? One of the plans is to add understanding of documents to FreeEed, the open source eDiscovery. Instead of just doing keyword searches through the document, the lawyers will be able, by the addition of text analytics, make more sense of the documents: detect people, dates, organizations, etc. This will in turn help create the picture of the case in an automated way.

Disclaimer: we are not official speakers for Memex.

Friday, July 3, 2015

Big Data Cartoons - Summer of Big Data

Since nothing much happens in Big Data in the summer (JK:), our artist took to making pictures of the breakfasts that an artist needs. Here are some examples.

Once this page is visited by more than a million people, it itself will qualify for a "Big Data" page.

Wednesday, June 10, 2015

Joe Witt of Onyara presented Apache NiFi

Joe Witt and the team of Onyara came to present Apache Nifi at Houston Hadoop Meetup. The NiFi project is the result of eight years of development at NSA, which has been open sourced in November of 2014.

The project is for automating enterprise dataflows, and its salient use cases are
  • Remote sensor delivery
  • Inter-site/global distribution
  • Intra-site distribution
  • "Big Data" ingest
  • Data Processing (enrichment, filtering, sanitization)
For the rest, in the words of Shakespeare

"Let Lion, Moonshine, Wall, and lovers twain

At large discourse, while here they do remain."

Meaning, in our case, here are the slides, courteously provided by Joe.

Oh, and there WAS a live demo, so those who missed it - missed it.

As always, pizza was provided by Elephant Scale LLC, Big Data training and consulting.

Monday, June 8, 2015

Big Data Cartoon - Summer Fun

Summer is the time to have fun and to get some rest! While their moms and dads are presumably coding away some new Big Data app, their kids can go to the summer camp. So did our Big Data cartoonist, who is now working as a summer camp artistic director. (These "cartoons" are really the large size decorations there.)

But you can see the same themes, albeit hidden: the tiger is no doubt the new elephant, and the magicians are the software engineers.