Tuesday, February 21, 2012

Processing Enron Data On A FreeEed Cluster

To process eDiscovery on a Hadoop cluster prepared for FreeEed (we will call it FreeEed Cluster), you can use your Amazon EC2 account and rent your computers there.

The images need to be prepared with the utilities that FreeEed requires (SHMsoft will publish and maintain these EC2 images soon). Therefore the first step - start the cluster!

You may notice that I use a separate namenode, a separate jobtracker node, and then as many data/work nodes as needed, four in this case. Datanode is stored as EBS and multiple copies of it can be started.

The procedure to start the cluster after stopping the machines has a few caveats, and I have described it in the Hadoop Illuminated book, which I am gradually publishing to the web, so you will have that soon.

Having started the Hadoop daemons, let's verify the cluster operation. If you don't want to open your machines to http access to the world (like I don't), you can login to the jobtracker server and run command-line browser there,

w3m http://localhost:50030

This is what I saw:





Since the Enron data is already in zip files, staging (when FreeEed packages the files into zip files for processing) is not needed, and we give these as input to FreeEed. The input parameters file looks as follows



As a courtesy to EDRM, which nicely provided the Enron data, we store all of these files in our own S3 buckets, so as not to download the data from them every time, and not to make them incur bandwidth charges.

Now let's run the processing, which, incidentally goes very fast with more data. I saw first-hand that Hadoop is designed for Big Data and behaves better under load. Eight custodians' PSTs (admittedly, smallish) took about a minute




For FreeEed processing on EC2, it makes sense to store the input and output data in S3, rather than in HDFS. That is because, unlike the unlimited HDFS of your private cluster, the HDFS of an Amazon cluster needs to be maintained, costs money even if not used, and disappears on cluster shutdown, if stored in ephemeral storage. The S3, on the other hand, fits the bill perfectly: it is protected against failure, replicated in many computing centers, and you pay only for the data you actually store. FreeEed has both "hdfs" and "s3" options for the file system.

Here are the output zip files, happily residing in designated S3 buckets

The "attempt..." name is given by FreeEd to make each zip file, coming from an individual reducer, is unique across the job output files, so that reducers don't overwrite each other's output.

That's it. You can now load the output in Hive for analysis.

No comments: