Friday, May 26, 2017

Sub-second SQL queries with LLAP from Hortonworks

Houston Hadoop & Spark Meetup in April was graced by the presentation from Ravi Mutyala of Hortonworks. Here are the slides, Please refer to Ravi for further questions.

Wednesday, March 22, 2017

How to create an IntelliJ shortcut on Ubuntu

I always dread IntelliJ upgrade, because I don't remember how to update my Ubuntu shortcut. So here it is, for me and other souls.

1. Unzip and run from the command line, just you do all versions. No problem here.
2. Choose menu Tools, then Create Desktop Entry.
3. This will create an entry in ./local. Copy it to your desktop:

cp  ./.local/share/applications/jetbrains-idea.desktop ~/Desktop/

Optional: drag it to the toolbar

Happy traveling! 逍遙遊

Sunday, March 12, 2017

What I saw in Bentonville, AR

Recently I was in Bentonville, taught Big Data, and visited the Crystal Bridges museum there. I share this amazing experience in my blog post here.

Thursday, February 9, 2017

FreeEed for eDiscovery response and for general research

FreeEed is a popular open source eDiscovery tool. It boasts over 1,000 users, has active projects in major consulting companies and is popular with researchers. However, it often needs to be used upside down. Here is what I mean.

In regular eDiscovery, you input directories, and FreeEed processes them, giving you these outputs

  1. "Load file," or a CSV file with the metadata, one line per document or email.
  2. "Output file," a zip file containing native documents, extracted text, PDF images of all files, and exceptions, each in its folder.
  3. Case for review, loaded into FreeEedUI review tool. It is put into SOLR as a back end, but for review, one uses the FreeEedUI.
However, there are two use cases that would require the opposite: reviewing the eDiscovery response, and using FreeEed for research.

Reviewing the eDiscovery response

If you send an eDiscovery request, you may get back the load file and the documents. In essence, you are getting the data in the same format that FreeEed outputs it. What you would like then is to reverse the process, to make the load file the input, and to index the documents for search. This is now implemented in FreeEed.

When you select the input, you see a "Data Source" panel. If you choose eDiscovery, FreeEed will work as before, that is, accepting your custodians' files as input.

If you choose the "Load file" radio button as a data source, the program will do the following
  • Read each line of the load file
  • For each line, use the given fields as metadata
  • Make the metadata and the extracted file text searchable and create a case in FreeEed for review
  • Available in FreeEed V 7.3
This use case lends itself very nicely to parallelization, and can, therefore, be processed on a Hadoop cluster, to accommodate large volumes.

Using FreeEed as a research tool

Often, researchers already have the metadata extracted. For example, in our Memex court document investigation, we already have elaborate parsing code that extracts metadata from the court documents. In this case, we want to be able to load the metadata and the file text into FreeEedUI for research. We should be able to answer questions like
  • How many times was a given crime mentioned?
  • Repeat the question above for the particular judge and in a specific time range (this questions will search metadata in a structured way, as well as text).
Clearly, this is the same use case as above. The only difference is that we need a different set of metadata fields than the one used in FreeEed by default. Technically, this amounts to programmatically changing the schema in SOLR, and this will be done in the next update, V 7.4.

Monday, January 23, 2017

Healthcare and Machine Learning

I have written about the current state of Machine Learning in Healthcare and about the practical steps that the healthcare professionals can take today.

The major points are
  • Quick Overview of Machine Learning
  • What can Machine Learning do for healthcare - overview of current use cases
  • What steps one can take today while waiting for big developments to come through

The blog post is on the Elephant Scale blog, so you can continue there.

Thursday, December 29, 2016

Using FreeEed in the Memex program for investigations

A common problem in investigations is that the authors of the research software, which is being produced in the course of the Memex problem, are themselves not authorized to see the data that the investigation agencies deal with.

To address this problem, we added hash search to FreeEed. First, we have added the metadata screen display (which was not previously available), and users can see the metadata.

This screenshot presents the view of the metadata table. Metadata, of course, is "data about data." It shows all the fields collected from the documents being searched, together with their "a.k.a" or synonyms. For example, in this screenshot, you can see that field 22 can be called "From, but it can also be called "Author" or "Message-From." You can see now that there is a new field, called "Hash."

Next, the file hash is added to the metadata fields settings. Users have requested this feature prior, and now it is available. For emails, the hash is defined using the popular email fields. In FreeEed, this is configurable through the database.

This hash is shown in the screenshot on the left, which represents the 'load file' output by FreeEed. There it is seen with other popular metadata fields, which were recently added by request, such as Message-ID.

The investigating agency can simply compute the hashes of the objects, such as texts, phones, images, or anything else that they are looking for, and search for these, without revealing what they are searching for, to the authors of the software or the processors. Entities other than investigating agencies may find this feature useful as well.

Now, this shows in the processing results but is it searchable? For that, Hash has been added to the schema in the FreeEedUI search engine (which is SOLR). Now Hash shows up as one of the fields for each document, as the screenshot shows.

The last question, can one search having just the hash value? The answer is yes, you can search on the hash alone. To verify this, pick up one of the hashes that you saw in the documents and try to search for this value. You will find this one document - as is to be expected, since all hashes, MD5 and SHA-1, are designed to be unique per document.  The last screenshot illustrates this.

Additionally, FreeEed can provide the results sorted by user-defined "document significance," using the user-provided functions. Such functions are supplied by the Memex groups.

Sunday, December 25, 2016

Word clouds in FreeEed

Word clouds have been added to FreeEed as an early release. To try, download the jar from here, and replace the jar by the same name in your install. Then run (.bat) as usual.

Here is an example of a word cloud and a screenshot of the Analytics menu, which features word clouds.

The word cloud is from project included with FreeEed, which is just a collection of unconnected documents, so the cloud is not very meaningful. You should get something related to your use cases and more useful.

Your feedback will be very much appreciated.