Tuesday, July 26, 2011

Interview at Facebook

Took photo as a "proof of concept" or rather as  evidence that can be discovered :)

Monday, July 25, 2011

Visiting Bay Area Hadoop Meetup


Look at their numbers! 2235 total, and 291 attending.

Houston should beat that! (So far we have 23 total and 5 attending).


Friday, July 22, 2011

History in FreeEed

Open source eDiscovery project, FreeEed, got a concept of history. History is shown in a separate window that can be open or closed it will. You can all erase it and start with a clean slate.

History is real-time, and it frees up the user interface to do other things while the software is crunching the information.

The code is currently in "Branch 2" and will be part of the 2.5 release.

Thursday, July 21, 2011

A use-case for FreeEed: When your business model involves litigation

Imagine that you are in the business of buying other businesses. With the business you may acquire its assets, its debts, and its legal obligations. Litigation becomes a way of life, part of regular business.

The big part of litigation, of course, is eDiscovery. Now imagine that you have that part completely under your control. No payments to the third-party vendors, no timing limitations on their side. You can experiment with your litigation strategies at will, both for plaintiff and defendant situations.

That is where FreeEed comes it. You can run your eDiscovery and be in control of it. The FreeEed team, in turn, will do everything in its power to help you advance your goal, win or settle the cases, and grow your business.

Art: Fernando Botero - Man Reading a Paper

Tuesday, July 19, 2011

SQL pagination with complex query

It is well known that SQL has an OFFSET parameter, so if you need pages, you could do something like the following


The problem is, this query is very inefficient. Since many databases today are big (believe it or not, I saw a 2008 blog post saying "since many databases today are small"), inefficiency of OFFSET, which results in fetching all rows before it, is unacceptable.

The next step is to go specific and use a unique index in the TABLE1 (hope you have it), so you can sort and offset on this index, which will be efficient. You can then do something like the following

SELECT * from TABLE1 where TABLE1.id > 2500 order by TABLE1.id LIMIT 100,

assuming that the last id value you saw was 2500.

What do you do when you do not have an index that you can use, but instead are doing some join with multiple indexes? For example,


The query above is actually already half the solution. You only need to add this condition

TABLE1.ID1 > last_value_id1 OR
(TABLE1.ID1 = last_value_id1 AND TABLE1.ID2 > last_value_id2).

You can continue and do this with as many indexes as you would like. This question and the solution came up in the  case of 4 indexes, and it worked great and very efficiently.

Art: Leonardo Da Vinci - Crossbow Machine

Monday, July 18, 2011

Installing Hadoop on Natty Narwhal

Here are the notes that will save you some trouble

To install Java on Natty, use this link

That is because you need a screen to accept Java license.

Start by installing Cloudera Hadoop in pseudo-distributed mode for Ubuntu lucid, not natty.

Formatting the node must be done as user hdfs, or else permissions won't work.

Sunday, July 17, 2011

FreeEed gets graphical user interface

Version 2.0 of FreeEed is out!

It sports a clear GUI (graphical user interface) where the user can set parameters and run the processing. The GUI is intentionally a thin wrapper around the command-line utility, to keep the tool simple and efficient.

For all the rest,
Let Lion, Moonshine, Wall, and lovers twain
At large discourse, while here they do remain.

(Shakespeare, Midsummer Night's Dream)

Friday, July 15, 2011

Steve Loughran on Hadoop commercial support options

Steve Loughran presents a great summary on the hadoop-core discussion group, worth repeating:

The picture is a bit confusing

Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release.

So those selling commercial support are:

  • Cloudera
  • HortonWorks
  • MapR
  • EMC (reselling MapRTech, but had announced their own GreenPlum (free and commercial))
  • IBM BigInsights (free and commercial)
  • DataStax

+ Amazon, indirectly, that do their own derivative work of some release of Hadoop (which version is it based on?)

Photo of Steve taken from LinkedIn

Tuesday, July 5, 2011

Houston Hadoop Meeting - July 2011

Howdy, all,

we had two new guests, Carter Cole and  and Donald Sutton, both of whom talked about very interesting things.

Carter is into SEO, and his node.js - based SEO plugin for Chrome is used by over thirty thousand people. He wants to take it to the next level by using the power of Hadoop.

Donald Sutton brought the news of the GreenPlum Hadoop appliance, which minimizes the work of the Hadoop cluster administrator.

Mark Kerzner presented his FreeEed, Hadoop-based open source software for eDiscovery, and after the meeting, at the coffee shop across the street, the good time was had by all.

FreeEed Slides for upcoming V2.0 presentations

Slides to be used at the Houston Hadoop Meetup today, and later for Women in eDiscovery in Houston, August meeting.

FreEed - Open Source eDiscovery

View more presentations from markkerzner