Tuesday, July 26, 2011
Monday, July 25, 2011
Visiting Bay Area Hadoop Meetup
http://www.meetup.com/hadoop/events/16805556/
Look at their numbers! 2235 total, and 291 attending.
Houston should beat that! (So far we have 23 total and 5 attending).
Cheers
Look at their numbers! 2235 total, and 291 attending.
Houston should beat that! (So far we have 23 total and 5 attending).
Cheers
Friday, July 22, 2011
History in FreeEed
Open source eDiscovery project, FreeEed, got a concept of history. History is shown in a separate window that can be open or closed it will. You can all erase it and start with a clean slate.
History is real-time, and it frees up the user interface to do other things while the software is crunching the information.
The code is currently in "Branch 2" and will be part of the 2.5 release.
History is real-time, and it frees up the user interface to do other things while the software is crunching the information.
The code is currently in "Branch 2" and will be part of the 2.5 release.
Thursday, July 21, 2011
A use-case for FreeEed: When your business model involves litigation
Imagine that you are in the business of buying other businesses. With the business you may acquire its assets, its debts, and its legal obligations. Litigation becomes a way of life, part of regular business.
The big part of litigation, of course, is eDiscovery. Now imagine that you have that part completely under your control. No payments to the third-party vendors, no timing limitations on their side. You can experiment with your litigation strategies at will, both for plaintiff and defendant situations.
That is where FreeEed comes it. You can run your eDiscovery and be in control of it. The FreeEed team, in turn, will do everything in its power to help you advance your goal, win or settle the cases, and grow your business.
Art: Fernando Botero - Man Reading a Paper
The big part of litigation, of course, is eDiscovery. Now imagine that you have that part completely under your control. No payments to the third-party vendors, no timing limitations on their side. You can experiment with your litigation strategies at will, both for plaintiff and defendant situations.
That is where FreeEed comes it. You can run your eDiscovery and be in control of it. The FreeEed team, in turn, will do everything in its power to help you advance your goal, win or settle the cases, and grow your business.
Art: Fernando Botero - Man Reading a Paper
Tuesday, July 19, 2011
SQL pagination with complex query
It is well known that SQL has an OFFSET parameter, so if you need pages, you could do something like the following
SELECT * from TABLE1 LIMIT 100 OFFSET 100
The problem is, this query is very inefficient. Since many databases today are big (believe it or not, I saw a 2008 blog post saying "since many databases today are small"), inefficiency of OFFSET, which results in fetching all rows before it, is unacceptable.
The next step is to go specific and use a unique index in the TABLE1 (hope you have it), so you can sort and offset on this index, which will be efficient. You can then do something like the following
SELECT * from TABLE1 where TABLE1.id > 2500 order by TABLE1.id LIMIT 100,
assuming that the last id value you saw was 2500.
What do you do when you do not have an index that you can use, but instead are doing some join with multiple indexes? For example,
SELECT * from TABLE1 JOIN TABLE2 ORDER BY TABLE1.ID1, TABLE2.ID2 LIMIT 100
The query above is actually already half the solution. You only need to add this condition
WHERE
TABLE1.ID1 > last_value_id1 OR
(TABLE1.ID1 = last_value_id1 AND TABLE1.ID2 > last_value_id2).
You can continue and do this with as many indexes as you would like. This question and the solution came up in the case of 4 indexes, and it worked great and very efficiently.
Art: Leonardo Da Vinci - Crossbow Machine
SELECT * from TABLE1 LIMIT 100 OFFSET 100
The problem is, this query is very inefficient. Since many databases today are big (believe it or not, I saw a 2008 blog post saying "since many databases today are small"), inefficiency of OFFSET, which results in fetching all rows before it, is unacceptable.
The next step is to go specific and use a unique index in the TABLE1 (hope you have it), so you can sort and offset on this index, which will be efficient. You can then do something like the following
SELECT * from TABLE1 where TABLE1.id > 2500 order by TABLE1.id LIMIT 100,
assuming that the last id value you saw was 2500.
What do you do when you do not have an index that you can use, but instead are doing some join with multiple indexes? For example,
SELECT * from TABLE1 JOIN TABLE2 ORDER BY TABLE1.ID1, TABLE2.ID2 LIMIT 100
The query above is actually already half the solution. You only need to add this condition
WHERE
TABLE1.ID1 > last_value_id1 OR
(TABLE1.ID1 = last_value_id1 AND TABLE1.ID2 > last_value_id2).
You can continue and do this with as many indexes as you would like. This question and the solution came up in the case of 4 indexes, and it worked great and very efficiently.
Art: Leonardo Da Vinci - Crossbow Machine
Monday, July 18, 2011
Installing Hadoop on Natty Narwhal
Here are the notes that will save you some trouble
To install Java on Natty, use this link
http://www.multimediaboom.com/how-to-install-java-in-ubuntu-11-04-natty-narwhal-ppa/
That is because you need a screen to accept Java license.
Start by installing Cloudera Hadoop in pseudo-distributed mode for Ubuntu lucid, not natty.
Formatting the node must be done as user hdfs, or else permissions won't work.
To install Java on Natty, use this link
http://www.multimediaboom.com/how-to-install-java-in-ubuntu-11-04-natty-narwhal-ppa/
That is because you need a screen to accept Java license.
Start by installing Cloudera Hadoop in pseudo-distributed mode for Ubuntu lucid, not natty.
Formatting the node must be done as user hdfs, or else permissions won't work.
Sunday, July 17, 2011
FreeEed gets graphical user interface
Version 2.0 of FreeEed is out!
It sports a clear GUI (graphical user interface) where the user can set parameters and run the processing. The GUI is intentionally a thin wrapper around the command-line utility, to keep the tool simple and efficient.
For all the rest,
Let Lion, Moonshine, Wall, and lovers twain
At large discourse, while here they do remain.
(Shakespeare, Midsummer Night's Dream)
It sports a clear GUI (graphical user interface) where the user can set parameters and run the processing. The GUI is intentionally a thin wrapper around the command-line utility, to keep the tool simple and efficient.
For all the rest,
Let Lion, Moonshine, Wall, and lovers twain
At large discourse, while here they do remain.
(Shakespeare, Midsummer Night's Dream)
Friday, July 15, 2011
Steve Loughran on Hadoop commercial support options
Steve Loughran presents a great summary on the hadoop-core discussion group, worth repeating:
The picture is a bit confusing
Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release.
So those selling commercial support are:
+ Amazon, indirectly, that do their own derivative work of some release of Hadoop (which version is it based on?)
Photo of Steve taken from LinkedIn
The picture is a bit confusing
Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release.
So those selling commercial support are:
- Cloudera
- HortonWorks
- MapR
- EMC (reselling MapRTech, but had announced their own GreenPlum (free and commercial))
- IBM BigInsights (free and commercial)
- DataStax
+ Amazon, indirectly, that do their own derivative work of some release of Hadoop (which version is it based on?)
Photo of Steve taken from LinkedIn
Tuesday, July 5, 2011
Houston Hadoop Meeting - July 2011
Howdy, all,
we had two new guests, Carter Cole and and Donald Sutton, both of whom talked about very interesting things.
Carter is into SEO, and his node.js - based SEO plugin for Chrome is used by over thirty thousand people. He wants to take it to the next level by using the power of Hadoop.
Donald Sutton brought the news of the GreenPlum Hadoop appliance, which minimizes the work of the Hadoop cluster administrator.
Mark Kerzner presented his FreeEed, Hadoop-based open source software for eDiscovery, and after the meeting, at the coffee shop across the street, the good time was had by all.
we had two new guests, Carter Cole and and Donald Sutton, both of whom talked about very interesting things.
Carter is into SEO, and his node.js - based SEO plugin for Chrome is used by over thirty thousand people. He wants to take it to the next level by using the power of Hadoop.
Donald Sutton brought the news of the GreenPlum Hadoop appliance, which minimizes the work of the Hadoop cluster administrator.
Mark Kerzner presented his FreeEed, Hadoop-based open source software for eDiscovery, and after the meeting, at the coffee shop across the street, the good time was had by all.
FreeEed Slides for upcoming V2.0 presentations
Slides to be used at the Houston Hadoop Meetup today, and later for Women in eDiscovery in Houston, August meeting.
FreEed - Open Source eDiscovery
FreEed - Open Source eDiscovery
Subscribe to:
Posts (Atom)