Thursday, June 21, 2012

Processing Enron data on a 49-node cluster

By now, all our Hadoop clusters, regardless of size, take about 5 minutes to come up. For the HALSM presentation in Houston yesterday we took the Enron data residing in our S3 Amazon account, and duplicated some of that. The total volume was about 50 GB zipped, so it was over 100 GB unzipped.

Then we used the SHMcloud(TM) Player to start the cluster and run the processing. It took slightly over one hour. Here are some screenshots.

It was fun to show processing live, and to poke around the servers for the audience.

The next goal is to reduce this time to 30 minutes, by better load balancing.

Wednesday, June 13, 2012

Big clusters for eDiscovery

Every programmer knows that special pleasure and satisfaction when his or her code works right, and with more and more testing, and with more and more data. The special joy of clusters is when it works with any size of cluster.

The SHMcloud (TM) player is now able to start and configure all the machines in a Hadoop cluster at once. This means that a cluster of 1 machine takes five minutes, the cluster of 20 machines takes five minute, and the cluster of 50 or 100 machines also takes five minutes - the latter when Amazon approves my request for more instances :)

Update: got my limit raised to 50!

Then you can verify this in the AWS console.

And, don't forget to shut them down!

Update 2: the nice folks at Amazon gave me 50 machines the next day. Now the cluster looks like this:

-rw-r--r--   1 ubuntu supergroup          0 2012-06-14 21:33 /test-output/_SUCCE
drwxr-xr-x   - ubuntu supergroup          0 2012-06-14 21:32 /test-output/_logs
-rw-r--r--   1 ubuntu supergroup        172 2012-06-14 21:33 /test-output/part-0

12-06-14 16:33:30   Cluster testing and verification is complete
setInitializedState for cluster of 49
12-06-14 16:33:33   Running instances: 49
12-06-14 16:33:33   Completely initialized: 49
setInitializedState for cluster of 49

Gioia gioia mille anni!

47 working nodes (49 total - memory master - work master) working together!

Monday, June 4, 2012

Houston Hadoop Meetup June - Hands-On!

A sign of a good meetup is when people don't want to leave. This happened again now: the librarian wanted us out, because the library was closing, and we, being good compassionate human beings, preempted her a by a little. Even so, people were sighted talking next to the closed library entrance for the next half hour at least.

Why the excitement? Developers love challenges, and here they were challenged enough: the hands-on format had something for everyone, and if you already completed your assignment, you were given the next level of complexity.

We found our next presenter, Dennis, who will be talking about his genome project, and how they use Hadoop at the Medical Center - details to follow soon. After that, the proposed format is to take one chapter in the Hadoop Illuminated book, and combine the presentation based on that chapter with the exercises found at the end of the chapter. This will also give me a push to complete the book. And the companion project, of course!

A la prochaine!

Art:  Jean Charles Meissonier F- Two Men Talking In A Tavern