Monday, March 19, 2012

If you need to do searches in the FreeEed results...

Then now, with Hive, you can.

Imagine you processed some good amount of eDiscovery data with FreeEed™. Up until now, FreeEed did not give it you capability to search in the load file. You had to either load it in Concordance, or into another tool, open it in Excel (up a certain point), or bring it into a database. And yet, with large data size, all these approaches would eventually either be very slow, or have a limit on the size of the data, like Excel does. Now FreeEed™ gives you the capability to search the results. Here is how.

You select the menu item "Load with Hive." Hive is an open source tool, part of the Hadoop family, which allows you to query the results with a language similar to SQL. Actually, HiveQL, much more powerful than SQL.

Spoiler: in the background, FreeEed™ writes the Hive scripts and loads your load file into Hive, like this (you can see this going on in the History window):


12-03-19 23:41:14   Running command: hive -f /tmp/hive_load_table.sql
12-03-19 23:41:21   Hive history file=/tmp/mark/hive_job_log_mark_201203192341_371336996.txt
12-03-19 23:41:21   Copying data from file:/home/mark/projects/FreeEed/freeeed-output/0009/output/run-120319-233739/results/metadata.txt
12-03-19 23:41:21   Loading data to table default.load_file
12-03-19 23:41:21   OK
12-03-19 23:41:21   Time taken: 5.078 seconds
12-03-19 23:41:21   Running command: xterm -e hive


Now you can ask the data anything you want to know. For example, in response to your question, who is talking about energy at Enron, it will come back, after a few seconds, with the UPI (Unique Production Identifier) of the documents you are interested in. Why a few seconds? Because Hive uses the same MapReduce technology as FreeEed™ and runs on the same cluster. Therefore, it can handle any amount of data, but it has a small overhead to start the Hadoop job.


This feature is included in the RC release on our site, and (sorry, Windows fans), this feature does require running on Linux, with Hadoop in pseudo-distributed mode and Hive installed. If that install is intimidating, do not despair! Soon we will offer even better features in the SHMcloud ™ premium platform, whether in the cloud or inside your computing centers.

A la prochaine! (French for see you soon), from my favorite book, "French for Cats, All The French Your Cat Will Ever Need".



Monday, March 12, 2012

Things always improve on Amazon


Four days ago there was an announcement that I could have missed. However, while working non stop :) I needed to create some new instances, went to http://alestic.com/ and found this, http://aws.typepad.com/aws/2012/03/ec2-updates-new-instance-64-bit-bit-ubiquity-ssh-client.html

Why is it a big deal? - Read it and you will see :)

In short, more convenient yet, and cheaper yet.

Monday, March 5, 2012

Houston Hadoop Meetup - Hands-On Clusters on EC2

This meetup was a learning session: how to build your own clusters on EC2, so that you can work on your own Hadoop clusters, for development, testing, and profit.

We discussed the many possible ways to choose your AMI's, to attach EBS volumes to them, to set up Java, Cloudera, and Hadoop. Then we launched our EC2 machines, only to discover that the library WiFi blocks ssh port 22. This is where it got interesting, with people finding ways to change the port, alternatively setting up a Hadoop cluster locally on a laptop, or using a shared WiFi on a phone.

In the end, we learned a lot about EC2, something about hacking, and not that much about logging in and working on the EC2 instances (only one or two people were actually able to achieve that feat).

However, the feeling of doing some real work was tremendous. We all became good friends, all were exhausted and skipped the usual Starbucks, and all decided to continue. Perhaps we need a better, more technical meeting place, with good Internet connection  - any suggestions?

But, in any case, that was a valuable experience. I will polish my instructions and put  into the "Hadoop Illuminated", and we will all continue thinking about what to do at the next meeting, and should we combine presentations with hackathons.

Thank you, all!

Houston Hadoop Meetup - Remote Presentation by Sujee Maniyam

Sujee Maniyam gave an excellent presentation on "Cost Effective Big-Data Processing using Amazon Elastic Map Reduce". Here is the link to the slides, http://www.slideshare.net/sujee/cost-effective-bigdata-processing-on-amazon-ec2

The remote format worked out great: we could join people from other places, using Skype and joinme.com. The speakers made everybody in the room hear, and we could ask Sujee questions that he could hear and answer. The slides were projected on the wall (thanks to Alan Lipe), and the fun was had by all.

The gist of it is that Amazon is great for quick prototyping, and EMR - even more so, if you can make it work for you. But in the long run, it can get pretty expensive (as was evidenced by me last week, when I left a 6-machine cluster running for week, 192 hours).

Thank you all!