Thursday, December 29, 2016

Using FreeEed in the Memex program for investigations

A common problem in investigations is that the authors of the research software, which is being produced in the course of the Memex problem, are themselves not authorized to see the data that the investigation agencies deal with.

To address this problem, we added hash search to FreeEed. First, we have added the metadata screen display (which was not previously available), and users can see the metadata.

This screenshot presents the view of the metadata table. Metadata, of course, is "data about data." It shows all the fields collected from the documents being searched, together with their "a.k.a" or synonyms. For example, in this screenshot, you can see that field 22 can be called "From, but it can also be called "Author" or "Message-From." You can see now that there is a new field, called "Hash."

Next, the file hash is added to the metadata fields settings. Users have requested this feature prior, and now it is available. For emails, the hash is defined using the popular email fields. In FreeEed, this is configurable through the database.

This hash is shown in the screenshot on the left, which represents the 'load file' output by FreeEed. There it is seen with other popular metadata fields, which were recently added by request, such as Message-ID.

The investigating agency can simply compute the hashes of the objects, such as texts, phones, images, or anything else that they are looking for, and search for these, without revealing what they are searching for, to the authors of the software or the processors. Entities other than investigating agencies may find this feature useful as well.

Now, this shows in the processing results but is it searchable? For that, Hash has been added to the schema in the FreeEedUI search engine (which is SOLR). Now Hash shows up as one of the fields for each document, as the screenshot shows.

The last question, can one search having just the hash value? The answer is yes, you can search on the hash alone. To verify this, pick up one of the hashes that you saw in the documents and try to search for this value. You will find this one document - as is to be expected, since all hashes, MD5 and SHA-1, are designed to be unique per document.  The last screenshot illustrates this.

Additionally, FreeEed can provide the results sorted by user-defined "document significance," using the user-provided functions. Such functions are supplied by the Memex groups.

Sunday, December 25, 2016

Word clouds in FreeEed

Word clouds have been added to FreeEed as an early release. To try, download the jar from here, and replace the jar by the same name in your install. Then run (.bat) as usual.

Here is an example of a word cloud and a screenshot of the Analytics menu, which features word clouds.

The word cloud is from project included with FreeEed, which is just a collection of unconnected documents, so the cloud is not very meaningful. You should get something related to your use cases and more useful.

Your feedback will be very much appreciated.

Monday, December 12, 2016

Hadoop going to China

Actually, Hadoop is already in China. Here is the largest Hadoop distribution company in China, called Transwarp. Three hundred customers and counting, one hundred engineers, and growing, and five training centers across China.

Nevertheless, there is still "way to go" in this direction, as our cartoon aptly shows.

Thursday, December 1, 2016

Kent Graziano presents Snowflake at Houston Hadoop & Spark Meetup

Another great presentation at the Meetup, by Kent Graziano. Read all about the presenter, the subject, and the feedback here:

And here are the slides:

See you next time!

Tuesday, November 8, 2016

FreeEed eDiscovery, AI, Machine Learning, and Social Media

In the V7.0.0 release of FreeEed, we are highlighting text analytics and social media. 

You might also find interesting the articles that Mark Kerzner, the author of FreeEed, wrote recently on Bloomberg Law.

The source code

Our open source code collection is growing, and we have combined it all in one place: the SHMsoft company page on GitHub.

With gratitude and acknowledgment: this work is funded in part by the DARPA/Memex program, here is a Forbes article about our team.

Next: FreeEed as a service in the Amazon AWS cloud.


FreeEed - eDiscovery easy as popcorn.   

Friday, November 4, 2016

Using FreeEed for social media discovery

One of the areas that the Memex/DARPA teams excel in is crawling. FreeEed and the people behind it are part of the Memex, so it was quite natural to integrate discovery of crawl results into FreeEed processing and review.

Here is a recent Forbes article about the team.

Searching the websites and social media has been added to FreeEed starting from version 7. The common format to store crawl results is JSON. Each JSON description corresponds to a website page, user post, or a similar item.

Each JSON search entry is represented by a one-line in the archive file. The archive is given the extension *.jl, which stands for "JSON line".

FreeEed understands the *.jl extension, parses the JSON content of every line in the *.jl file, and finds indexes such fields as text, authors, etc., and makes them searchable in the FreeEed Review tool.

Below is a screenshot of FreeEeedUI review, illustrating searches in  a collection from an escort services website.

How to create your crawler? You can use the crawler from Scraping Hub, also a member of the Memex team. Or you can use the trusted friend, Apache Nutch. Nutch has been around for more than ten years, and it is the beginning of Hadoop.

By the way, we provide training in all these technologies.

Adding text analytics to FreeEed

Many documents in eDiscovery can be understood on a much deeper level than keyword search. Since groups of documents often have a similar structure, one can configure the software to extract additional fields from such documents.

Case study

We have collected all appeal documents from the NY Court of Appeals. For that, we crawled the court website and collected approximately 100,000 documents.

We have then configured the GATE (General Architecture for Text Engineering) tool to extract the information of interest from every document.

Here is the screenshot of GATE screen configured to extract information. It takes a few minutes to extract this information from 100,000 appeal cases, and the output is a CSV file which can be opened as a spreadsheet.

The verify the quality of the information extraction, we watch the statistics. Below is an example of the statistics from one of the latest runs. It shows the percentages of the information being reflected in the case document and successfully extracted by the software.

Files in dir: 111018
Docs processed : 100.0%
Case number: 100.0%
Metadata extracted: 100.0%
Civil: 71.0%
Criminal: 29.0%
Court: 94.7%
Gap days: 92.7%
First date: 92.8%
Appeal date: 100.0%
Judge: 85.8%
Other judges present: 98.4%
District attorney: 61.3%
Assistant district attorney: 100.0%
Crimes: 37.7%
County: 91.7%
Mode of conviction: 53.9%
Keywords: 93.3%
Interest of justice: 4.9%
References to cases: 19.9%
Number of output files: 12
Runtime: 2086 seconds

Our verification assured us that the rate of successful extraction (when the information is actually present) is high.

Below is an example screenshot of the information obtained. The output for all documents (25 MB) can be downloaded from here.

Adding this information to eDiscovery

There are two ways how you can add this information to FreeEed.

  1. The metadata fields can be added to the documents, and FreeEed configured to add them to the review; or
  2. The GATE workflow can be compiled and run directly within FreeEed.


The configuration of the GATE tool is an acquired skill, but even out-of-the-box extractors provide useful information. This work was done as part of Memex DARPA project, and the researchers found the extracted information extremely useful.

By the way, we provide training in all these technologies.

Monday, October 31, 2016

Elephant Scale - a training powerhouse

One of SHMsoft's ventures is Elephant Scale, a Big Data / Hadoop / NoSQL training powerhouse. In the past three years, Elephant Scale has delivered hundreds of training sessions to thousands of students.

What's so special? Elephant Scale builds on its experience of building software, and all of its trainers are high-level architects, who can also write code. See for yourself: here is the training schedule.

Saturday, October 29, 2016

O&G Hadoop Use Cases

Kenneth Smith and Wade Salazar presented Hortonworks Hadoop O&G use cases. Very useful and complete discussion of some of the Big Data applications. This talk is especially meaningful in Houston, one of the major energy centers of the world.

Learn more about Ken and Wade here, and read the slides here.

See you next time, guys, we are just twenty-four people away from a thousand.

As usual, pizza and drinks are sponsored by Elephant Scale (where I am a principal). We teach Big Data the way it was meant to be!


(Image from LinkedIn)

Wednesday, August 31, 2016

Eventual consistency explained with Starbucks coffee

How do you explain eventual consistency to a novice?  You tell them, "Have you been to Starbucks? Yes? - Well, it's like this, only for databases."

That is a favorite example. I thought that an illustration would help, so here it is.

The orders do not go through all phases in sequence, but eventually, you get it. There may be false starts, wrong order, etc., and this is how NoSQL databases work as well.

One more architectural principle that Starbucks illustrates is decoupling. The workers at Starbucks communicate with each other through messages, encoded on a cup. Moreover, this message is hardware (cup) based, so it does not get. Decoupling is important for scaling: you can have two baristas, for example.

Saturday, August 27, 2016

In Search of Database Nirvana - Houston Hadoop&Spark Meetup

Database expert Rohit Jain presented "In search of database Nirvana". 
Below is the description, here are the slides Note that the slides have animation. To enjoy the slides to the fullest, download and view them outside SlideShare. 

See y'all at the next meetup.

In Search of Database Nirvana – one SQL engine for transactional to analytical workloads
Companies are looking for a single database engine that can address all their varied needs—from transactional to analytical workloads, against structured, semi-structured, and unstructured data, leveraging graph, document, text search, column, key value, wide column, and relational data stores; on a single platform without the latency of data transformation and replication.  They are looking for the ultimate database nirvana.
The term hybrid transactional/analytical processing (HTAP), coined by Gartner, perhaps comes closest to describing this concept. 451 Research uses the terms convergence or converged data platform. The terms multi-model or unified are also used. But can such a nirvana be achieved?  Some database vendors claim to have already achieved this nirvana.  In this talk we will discuss the following challenges on the path to this nirvana, for you to assess how accurate these claims are:
·         What is needed for a single query engine to support all workloads?
·         What does it take for that single query engine to support multiple storage engines, each serving a different need?
·         Can a single query engine support all data models?
·         Can it provide enterprise-caliber capabilities?
Attendees looking to assess query and storage engines would benefit from understanding what the key considerations are when picking an engine to run their targeted workloads. Also, developers working on such engines can better understand capabilities they need to provide in order to run workloads that span the HTAP spectrum.
Rohit Jain is the CTO at Esgyn working on Apache Trafodion™, currently in incubation. Trafodion is a transactional to analytics SQL-on-Hadoop RDBMS. Rohit worked for Tandem, Compaq, and Hewlett-Packard for the last 28 of his 40 years in application and database development. He has worked as an application developer, solutions architect, consultant, software engineer, database architect, development and QA manager, Product Manager, and CTO. His experience spans Online Transaction Processing, Operational Data Stores, Data Marts, Enterprise Data Warehouses, Business Intelligence, and Advanced Analytics, on distributed massively parallel systems.

Tuesday, July 12, 2016

Houston Hadoop Meetup - Hacking and SQLing, July 2016

Mr. Y, the hacker, presented a report on Toorcamp 2016, unbelievable do-it, hack-it, laugh-at-it compendium, here are the slides:

Jim Scott, the fiery orator, talked about SQL and NoSQL.

Here is a link to the blog for this talk:
There is a link to a video embedded in there as well.

Here is the original presentation:

Thanks to you, see you next time.

Wednesday, May 18, 2016

What I like about NetBeans

Developers live and die by their IDE, and so they have religious wars about them. But thrice-blessed* are those that use multiple IDE.

Here is what I like about IntelliJ
  • It knows the variables you are going to type and often guesses them right; it also has a type-ahead support for them and methods;
  • It runs Scala out of the box.
But here is what I like about NetBeans
  • The UI editor (formerly Matisse, now just editor). It is unsurpassed. For example, the reason I don't write desktop apps in Scala is the absence of such editor.
  • It has a team with a few special guys. The name of one of them start with Geer or Cheer, I am not sure, but he writes an excellent blog about NetBeans. 😁 (Knowledgeable people say it is a hint to this).
  • In debugging, it shows values of variables and even functions or code fragments.

Note: thrice-blessed is a hint to this quote from Shakeaspeare

Tuesday, April 26, 2016

Big Data architecture for O&G

Houston Hadoop Meetup has grown to over 800 members by now. It is lavishly hosted by the Slalom consultants in the Galleria area, and beer, wine and food are provided by Slalom.

The presenter, Dmitry Kniazev gave an overview of the Proof-Of-Concept solution created for a major Oil & Gas company. He gave a brief overview of the WITSML standard that exists in the industry to share the sensor data among different operators, and described how they tapped to it to build the near real-time alerting application that streams data into Kafka queue and processes it using Spark Streaming.

Dmitry Kniazev is as a Solutions Architect, Data Analytics at EPAM Systems (NYSE: EPAM). EPAM is a solutions integrator that outsources solutions implementation to various locations, primarily Eastern Europe. Dmitry has been working with one of the major Oil & Gas companies here in Houston for almost 4 years and participated in various Data Analytics related projects.

The slides are found here. Again, thank you for hosting, presenting, and coming to the meeting.

Sunday, March 6, 2016

Hadoop as a service at Houston Hadoop Meetup

Hadoop as a service was presented by Ajay Jha, of Altiscale. Here are the slides.

As has become customary, our host, Slalom, provided parking ticket validation, pizza, beer and wine.

This location is in a fashionable Galleria area, where downstairs the geeks can continue Caracol restaurant - mexican coastal cuisine.

Thursday, February 25, 2016

Big Data Cartoons - Paris, Jerusalem, Istanbul, Singapore, where next?

In teaching Big Data, we often travel. Lately, in our view, Big Data is picking up the world over, not only in the US. Israeli Spark meetup are just as advanced as the ones in California. So we asked our artist to show  all the places where we have been. That was too hard though, so we just used travel pointers. But the elephant is real.

(In fact, this post is written on a Turkish Airlines plane - thanks to very good WiFi).

Big Data Cartoon - Elephant Scale enters Google doodle competition

Google 4 doodle competitions will be announced in March, but we are already thinking, what's next? So we asked our artist for an official entry. Our artist is so great that there is a virtual guarantee to win - if  it is not disqualified because our subtle brand promotion. So here is our spoiler.

Monday, February 15, 2016

Stoppable Hadoop cluster

The title of this post was inspired by the following lines

They dined on mince, and slices of quince,
   Which they ate with a runcible spoon

from the poem by Edward Lear, "The Owl and the Pussy-Cat", which the reader is invited to ponder at leisure.

Meanwhile, as a Big Data trainer, I often need to create what I would call a "stoppable cluster" on AWS, one that I can "pause," or put to sleep for a while. The most obvious use of it is saving money while the students are away, so that instead of $100/day, I would pay $33 per day. That would be reason enough. However, at times, as a developer, I want to stop the cluster that I am running.

If you look in the literature, it will cite two obstacles:
  1. Ephemeral nodes disappear on stop/start on AWS; and
  2. IP assignments change.
You can fix both by (1) using root drive and EBS drives; and (2) assigning elastic IPs. Amazon will not let you use more than 5 elastic IPs, but you can call them and ask nicely, and they will give you 10. 

Next, Hortonworks Ambari will check the cluster IP assignment and refuse to use the external IPs, even though Amazon promises you the right resolution:

So I use Cloudera manager, stop the cluster, stop the instance, and restart it.

Now, I try to start the Managing services again and....

alas, CM has resolved the IPs to the old internal ones!! And used that in the configuration.

Next installment - constructing proper clusters in the VPC and controlling the internal IP assignments.

Tuesday, February 2, 2016

Big Data Cartoon - Portraits of Famous People - Cos

Today we inaugurate the series of blog posts called "Big Data Famous People."

This here is Konstantin Boudnik. A graduate of St. Petersburg University, Math Department, and with a PhD in distributed systems, Cos holds strong opinions in many areas of programming languages, philosophy and economics. You can ask him on his site here, or on the restricted version here.

Cos is the 16th most prolific committer to Hadoop. He helped incubate Ignite, Groovy, and a bunch of other projects.

Tuesday, January 26, 2016

Illustrated story of Houston Hadoop Meetup

 Five years ago we came up with this logo for our book, Hadoop Illuminated. The Meetup that I started in 2011 was then meeting in Houston Public Library, and numbered in the tens, with 3-5 showing up. When we wanted to do some Amazon clusters, we found out that the library blocked port 22. We started looking for other quarters.

 Time flew, Meetup grew, and we came to our first Hadoop bootcamp. We spent more money on going to Pappadeaux restaurant than than we made on the bootcamp, but fun was had by all.

Today, the meetup is close to 700, and the current event has about 70 registrants. We have two goals.

1. In 2011, Bay area Hadoop meetup had 3,000 members, and 300 would come to the Yahoo headquarters to hear about Hadoop HA. I was green with envy. Now we have a chance to growing more than they - let's try it.
2. There is still no significant Big Data in Houston. But there are real signs that this is going to change this year. You finally hear about Big Data projects in retails, energy, power, etc.

Guys, let's continue making Houston a Big Data capital. Cheers!