Monday, March 19, 2012

If you need to do searches in the FreeEed results...

Then now, with Hive, you can.

Imagine you processed some good amount of eDiscovery data with FreeEed™. Up until now, FreeEed did not give it you capability to search in the load file. You had to either load it in Concordance, or into another tool, open it in Excel (up a certain point), or bring it into a database. And yet, with large data size, all these approaches would eventually either be very slow, or have a limit on the size of the data, like Excel does. Now FreeEed™ gives you the capability to search the results. Here is how.

You select the menu item "Load with Hive." Hive is an open source tool, part of the Hadoop family, which allows you to query the results with a language similar to SQL. Actually, HiveQL, much more powerful than SQL.

Spoiler: in the background, FreeEed™ writes the Hive scripts and loads your load file into Hive, like this (you can see this going on in the History window):


12-03-19 23:41:14   Running command: hive -f /tmp/hive_load_table.sql
12-03-19 23:41:21   Hive history file=/tmp/mark/hive_job_log_mark_201203192341_371336996.txt
12-03-19 23:41:21   Copying data from file:/home/mark/projects/FreeEed/freeeed-output/0009/output/run-120319-233739/results/metadata.txt
12-03-19 23:41:21   Loading data to table default.load_file
12-03-19 23:41:21   OK
12-03-19 23:41:21   Time taken: 5.078 seconds
12-03-19 23:41:21   Running command: xterm -e hive


Now you can ask the data anything you want to know. For example, in response to your question, who is talking about energy at Enron, it will come back, after a few seconds, with the UPI (Unique Production Identifier) of the documents you are interested in. Why a few seconds? Because Hive uses the same MapReduce technology as FreeEed™ and runs on the same cluster. Therefore, it can handle any amount of data, but it has a small overhead to start the Hadoop job.


This feature is included in the RC release on our site, and (sorry, Windows fans), this feature does require running on Linux, with Hadoop in pseudo-distributed mode and Hive installed. If that install is intimidating, do not despair! Soon we will offer even better features in the SHMcloud ™ premium platform, whether in the cloud or inside your computing centers.

A la prochaine! (French for see you soon), from my favorite book, "French for Cats, All The French Your Cat Will Ever Need".



No comments: