Monday, May 7, 2012

Automated Hadoop Cluster Construction on EC2

Hi, all,

that was the topic of today's presentation, and here are the slides: http://www.slideshare.net/markkerzner/automated-hadoop-cluster-construction-on-ec2

However, it was more than that, because we had a lively discussion, almost an unconference, where people asked questions, gave explanations, and described their business cases.

The case in point: we kept talking next to the library for another hour. To an outsider (and there was one police car watching us), it was a veritable Silicon Valley meeting, what with the startup and technology discussions.

See you next time on Richmond, another hands-on session. However, other presenter ideas are welcome.

Monday, April 2, 2012

The most awesome Houston Hadoop Meetup Ever

Hi, all,

we had the most interesting meeting. It's one thing to sit and listen, and it is completely another to do it yourself. But, paradoxically, this is what interests people the most.

We had a more-than-expected turnout, and everybody was burning to build their own clusters, run them on Amazon, and achieve first-hand experience of what others have only an "I heard the presentation" knowledge.

So, of course, everybody had their own environments, things did not work out at first try, you had to debug and adjust, install and re-install, and it was so much fun that we could not stop until 9 PM, or however late it was that we all had to go home.

Resolutions for the next meetup:

  1. We will separate into two groups, beginners and advanced, and we will have two teachers;
  2. I will finish publishing whatever chapters I have of Hadoop Illuminated, with all the instructions for Hadoop on EC2;
  3. We may need a new venue, at least for this coming May meeting, and we may consider the Microsoft compound nearby.
See you all next time!

Monday, March 19, 2012

If you need to do searches in the FreeEed results...

Then now, with Hive, you can.

Imagine you processed some good amount of eDiscovery data with FreeEed™. Up until now, FreeEed did not give it you capability to search in the load file. You had to either load it in Concordance, or into another tool, open it in Excel (up a certain point), or bring it into a database. And yet, with large data size, all these approaches would eventually either be very slow, or have a limit on the size of the data, like Excel does. Now FreeEed™ gives you the capability to search the results. Here is how.

You select the menu item "Load with Hive." Hive is an open source tool, part of the Hadoop family, which allows you to query the results with a language similar to SQL. Actually, HiveQL, much more powerful than SQL.

Spoiler: in the background, FreeEed™ writes the Hive scripts and loads your load file into Hive, like this (you can see this going on in the History window):


12-03-19 23:41:14   Running command: hive -f /tmp/hive_load_table.sql
12-03-19 23:41:21   Hive history file=/tmp/mark/hive_job_log_mark_201203192341_371336996.txt
12-03-19 23:41:21   Copying data from file:/home/mark/projects/FreeEed/freeeed-output/0009/output/run-120319-233739/results/metadata.txt
12-03-19 23:41:21   Loading data to table default.load_file
12-03-19 23:41:21   OK
12-03-19 23:41:21   Time taken: 5.078 seconds
12-03-19 23:41:21   Running command: xterm -e hive


Now you can ask the data anything you want to know. For example, in response to your question, who is talking about energy at Enron, it will come back, after a few seconds, with the UPI (Unique Production Identifier) of the documents you are interested in. Why a few seconds? Because Hive uses the same MapReduce technology as FreeEed™ and runs on the same cluster. Therefore, it can handle any amount of data, but it has a small overhead to start the Hadoop job.


This feature is included in the RC release on our site, and (sorry, Windows fans), this feature does require running on Linux, with Hadoop in pseudo-distributed mode and Hive installed. If that install is intimidating, do not despair! Soon we will offer even better features in the SHMcloud ™ premium platform, whether in the cloud or inside your computing centers.

A la prochaine! (French for see you soon), from my favorite book, "French for Cats, All The French Your Cat Will Ever Need".



Monday, March 12, 2012

Things always improve on Amazon


Four days ago there was an announcement that I could have missed. However, while working non stop :) I needed to create some new instances, went to http://alestic.com/ and found this, http://aws.typepad.com/aws/2012/03/ec2-updates-new-instance-64-bit-bit-ubiquity-ssh-client.html

Why is it a big deal? - Read it and you will see :)

In short, more convenient yet, and cheaper yet.

Monday, March 5, 2012

Houston Hadoop Meetup - Hands-On Clusters on EC2

This meetup was a learning session: how to build your own clusters on EC2, so that you can work on your own Hadoop clusters, for development, testing, and profit.

We discussed the many possible ways to choose your AMI's, to attach EBS volumes to them, to set up Java, Cloudera, and Hadoop. Then we launched our EC2 machines, only to discover that the library WiFi blocks ssh port 22. This is where it got interesting, with people finding ways to change the port, alternatively setting up a Hadoop cluster locally on a laptop, or using a shared WiFi on a phone.

In the end, we learned a lot about EC2, something about hacking, and not that much about logging in and working on the EC2 instances (only one or two people were actually able to achieve that feat).

However, the feeling of doing some real work was tremendous. We all became good friends, all were exhausted and skipped the usual Starbucks, and all decided to continue. Perhaps we need a better, more technical meeting place, with good Internet connection  - any suggestions?

But, in any case, that was a valuable experience. I will polish my instructions and put  into the "Hadoop Illuminated", and we will all continue thinking about what to do at the next meeting, and should we combine presentations with hackathons.

Thank you, all!

Houston Hadoop Meetup - Remote Presentation by Sujee Maniyam

Sujee Maniyam gave an excellent presentation on "Cost Effective Big-Data Processing using Amazon Elastic Map Reduce". Here is the link to the slides, http://www.slideshare.net/sujee/cost-effective-bigdata-processing-on-amazon-ec2

The remote format worked out great: we could join people from other places, using Skype and joinme.com. The speakers made everybody in the room hear, and we could ask Sujee questions that he could hear and answer. The slides were projected on the wall (thanks to Alan Lipe), and the fun was had by all.

The gist of it is that Amazon is great for quick prototyping, and EMR - even more so, if you can make it work for you. But in the long run, it can get pretty expensive (as was evidenced by me last week, when I left a 6-machine cluster running for week, 192 hours).

Thank you all!


Tuesday, February 21, 2012

Top 20 replies by Programmers to Testers when their programs don't work


20. "That's weird..."

19. "It's never done that before."

18. "It worked yesterday."

17. "How is that possible?"

16. "It must be a hardware problem."

15. "What did you type in wrong to get it to crash?"

14. "There is something funky in your data."

13. "I haven't touched that module in weeks!"

12. "You must have the wrong version."

11. "It's just some unlucky coincidence."

10. "I can't test everything!"

9. "THIS can't be the source of THAT."

8. "It works, but it hasn't been tested."

7. "Somebody must have changed my code."

6. "Did you check for a virus on your system?"

5. "Even though it doesn't work, how does it feel?

4. "You can't use that version on your system."

3. "Why do you want to do it that way?"

2. "Where were you when the program blew up?"

1. "It works on my machine"