Thursday, December 1, 2016

Kent Graziano presents Snowflake at Houston Hadoop & Spark Meetup

Another great presentation at the Meetup, by Kent Graziano. Read all about the presenter, the subject, and the feedback here: https://www.meetup.com/Houston-Hadoop-Meetup-Group/events/235608911/

And here are the slides: http://www.slideshare.net/elephantscale/changing-the-game-with-cloud-dw

See you next time!

Tuesday, November 8, 2016

FreeEed eDiscovery, AI, Machine Learning, and Social Media

In the V7.0.0 release of FreeEed, we are highlighting text analytics and social media. 


You might also find interesting the articles that Mark Kerzner, the author of FreeEed, wrote recently on Bloomberg Law.

The source code

Our open source code collection is growing, and we have combined it all in one place: the SHMsoft company page on GitHub.

With gratitude and acknowledgment: this work is funded in part by the DARPA/Memex program, here is a Forbes article about our team.

Next: FreeEed as a service in the Amazon AWS cloud.

Cheers, 

FreeEed - eDiscovery easy as popcorn.   

Friday, November 4, 2016

Using FreeEed for social media discovery

One of the areas that the Memex/DARPA teams excel in is crawling. FreeEed and the people behind it are part of the Memex, so it was quite natural to integrate discovery of crawl results into FreeEed processing and review.

Here is a recent Forbes article about the team.

Searching the websites and social media has been added to FreeEed starting from version 7. The common format to store crawl results is JSON. Each JSON description corresponds to a website page, user post, or a similar item.

Each JSON search entry is represented by a one-line in the archive file. The archive is given the extension *.jl, which stands for "JSON line".

FreeEed understands the *.jl extension, parses the JSON content of every line in the *.jl file, and finds indexes such fields as text, authors, etc., and makes them searchable in the FreeEed Review tool.

Below is a screenshot of FreeEeedUI review, illustrating searches in  a collection from an escort services website.

















How to create your crawler? You can use the crawler from Scraping Hub, also a member of the Memex team. Or you can use the trusted friend, Apache Nutch. Nutch has been around for more than ten years, and it is the beginning of Hadoop.

By the way, we provide training in all these technologies.

Adding text analytics to FreeEed

Many documents in eDiscovery can be understood on a much deeper level than keyword search. Since groups of documents often have a similar structure, one can configure the software to extract additional fields from such documents.

Case study

We have collected all appeal documents from the NY Court of Appeals. For that, we crawled the court website and collected approximately 100,000 documents.

We have then configured the GATE (General Architecture for Text Engineering) tool to extract the information of interest from every document.

Here is the screenshot of GATE screen configured to extract information. It takes a few minutes to extract this information from 100,000 appeal cases, and the output is a CSV file which can be opened as a spreadsheet.

The verify the quality of the information extraction, we watch the statistics. Below is an example of the statistics from one of the latest runs. It shows the percentages of the information being reflected in the case document and successfully extracted by the software.

Files in dir: 111018
Docs processed : 100.0%
Case number: 100.0%
Metadata extracted: 100.0%
Civil: 71.0%
Criminal: 29.0%
Court: 94.7%
Gap days: 92.7%
First date: 92.8%
Appeal date: 100.0%
Judge: 85.8%
Other judges present: 98.4%
District attorney: 61.3%
Assistant district attorney: 100.0%
Crimes: 37.7%
County: 91.7%
Mode of conviction: 53.9%
Keywords: 93.3%
Interest of justice: 4.9%
References to cases: 19.9%
Number of output files: 12
Runtime: 2086 seconds

Our verification assured us that the rate of successful extraction (when the information is actually present) is high.

Below is an example screenshot of the information obtained. The output for all documents (25 MB) can be downloaded from here.









Adding this information to eDiscovery

There are two ways how you can add this information to FreeEed.

  1. The metadata fields can be added to the documents, and FreeEed configured to add them to the review; or
  2. The GATE workflow can be compiled and run directly within FreeEed.

Conclusions

The configuration of the GATE tool is an acquired skill, but even out-of-the-box extractors provide useful information. This work was done as part of Memex DARPA project, and the researchers found the extracted information extremely useful.

By the way, we provide training in all these technologies.

Monday, October 31, 2016

Elephant Scale - a training powerhouse

One of SHMsoft's ventures is Elephant Scale, a Big Data / Hadoop / NoSQL training powerhouse. In the past three years, Elephant Scale has delivered hundreds of training sessions to thousands of students.

What's so special? Elephant Scale builds on its experience of building software, and all of its trainers are high-level architects, who can also write code. See for yourself: here is the training schedule.

Saturday, October 29, 2016

O&G Hadoop Use Cases

Kenneth Smith and Wade Salazar presented Hortonworks Hadoop O&G use cases. Very useful and complete discussion of some of the Big Data applications. This talk is especially meaningful in Houston, one of the major energy centers of the world.

Learn more about Ken and Wade here, and read the slides here.

See you next time, guys, we are just twenty-four people away from a thousand.

As usual, pizza and drinks are sponsored by Elephant Scale (where I am a principal). We teach Big Data the way it was meant to be!

下次见到你

(Image from LinkedIn)

Wednesday, August 31, 2016

Eventual consistency explained with Starbucks coffee

How do you explain eventual consistency to a novice?  You tell them, "Have you been to Starbucks? Yes? - Well, it's like this, only for databases."

That is a favorite example. I thought that an illustration would help, so here it is.

The orders do not go through all phases in sequence, but eventually, you get it. There may be false starts, wrong order, etc., and this is how NoSQL databases work as well.

One more architectural principle that Starbucks illustrates is decoupling. The workers at Starbucks communicate with each other through messages, encoded on a cup. Moreover, this message is hardware (cup) based, so it does not get. Decoupling is important for scaling: you can have two baristas, for example.