Thursday, June 21, 2012

Processing Enron data on a 49-node cluster

By now, all our Hadoop clusters, regardless of size, take about 5 minutes to come up. For the HALSM presentation in Houston yesterday we took the Enron data residing in our S3 Amazon account, and duplicated some of that. The total volume was about 50 GB zipped, so it was over 100 GB unzipped.

Then we used the SHMcloud(TM) Player to start the cluster and run the processing. It took slightly over one hour. Here are some screenshots.

It was fun to show processing live, and to poke around the servers for the audience.

The next goal is to reduce this time to 30 minutes, by better load balancing.

No comments: