The images need to be prepared with the utilities that FreeEed requires (SHMsoft will publish and maintain these EC2 images soon). Therefore the first step - start the cluster!
The procedure to start the cluster after stopping the machines has a few caveats, and I have described it in the Hadoop Illuminated book, which I am gradually publishing to the web, so you will have that soon.
Having started the Hadoop daemons, let's verify the cluster operation. If you don't want to open your machines to http access to the world (like I don't), you can login to the jobtracker server and run command-line browser there,
This is what I saw:
Since the Enron data is already in zip files, staging (when FreeEed packages the files into zip files for processing) is not needed, and we give these as input to FreeEed. The input parameters file looks as follows
As a courtesy to EDRM, which nicely provided the Enron data, we store all of these files in our own S3 buckets, so as not to download the data from them every time, and not to make them incur bandwidth charges.
Now let's run the processing, which, incidentally goes very fast with more data. I saw first-hand that Hadoop is designed for Big Data and behaves better under load. Eight custodians' PSTs (admittedly, smallish) took about a minute
For FreeEed processing on EC2, it makes sense to store the input and output data in S3, rather than in HDFS. That is because, unlike the unlimited HDFS of your private cluster, the HDFS of an Amazon cluster needs to be maintained, costs money even if not used, and disappears on cluster shutdown, if stored in ephemeral storage. The S3, on the other hand, fits the bill perfectly: it is protected against failure, replicated in many computing centers, and you pay only for the data you actually store. FreeEed has both "hdfs" and "s3" options for the file system.
Here are the output zip files, happily residing in designated S3 buckets
That's it. You can now load the output in Hive for analysis.