Tuesday, May 31, 2011

FreeEed V1.0 Released!

Latest changes:

  • All processing examples work;
  • Testing is automated with JUnit;
  • All tests run;
  • PST coming very soon.
Cause for celebration!!!

eDiscovery open-source tool is available for all.

FreeEed Roadmap

While working on the already close release of 1.0 of FreeEed, the roadmap is becoming clear:

  • 1.0 - basics of eDiscovery processing, small or big projects. with Java/Hadoop;
  • 2.0 - backed by a NoSQL database, which scales without a problem to thousands of projects and billions of files, with document-oriented CouchDB being a natural fit, still with Java/Hadoop;
  • 3.0 - text analytics, adding Natural Language Processing to legal review, written in the Scala language, which is a better fit for text processing than Java, and with Mahout;
  • 4.0 - data collection, forensics, preservation.

When? 1.0 - in the next few days; 2.0 - two more months; 3.0 - two more months.

After that? - improvements, improvements, improvements.

Friday, May 27, 2011

Beautiful example of Scala code

This is directly from the "Programming in Scala" book, but it illustrates the language so well.

In calculating the greatest common divider, or gcd, one can write code like that

def gcd(x: Long, y: Long): Long =
  if (y == 0) x else gcd(y, x % y)

and the beauty of it is in looking exactly as the mathematical formula from which the algorithm is derived.


The biggest claim against Scala programmers, and possibly against newbie Scala programmers (like myself), is that they take the best Scala example and match it against the worst Java code. So, to be fair, here is the same code in Java:

public long gcd(long x, long y) {
  if (y == 0) return x; else return gcd(y, x % y);

Is it that much different? Perhaps not, and all we achieved is give an example of things to come. Scala is closer to a mathematical formula, but so far that's it. Well, as everyone knows, the biggest advantage is that in Scala semicolon; is optional!

Art: William-Adolphe Bouguereau : La le├žon difficile (The difficult lesson)

Sunday, May 22, 2011

Upcoming Houston Hadoop Meetup on June 6

This is a great chance to hear about Hadoop and other Big Data technologies, when you would and when you would not use them - from Cloudera's engineer Vikram Oberoi.

Vikram is a native Houstonian who left Texas for the Silicon Valley, first to study and then to work, but we won't hold it against him for two reasons: he is working on Big Data problems, and he is coming back to tell us about it.

While the world is on fire about Big Data and Hadoop, Houston is for the most part dormant in this regard, with the notable exception of medical research.

The Big Data technologies originated at Google, Amazon, and Facebook. However, lately they are used by a large number of companies, and in fact, your company may not be competitive without it, at least that is what McKinsey analysts are telling us. Therefore, come and hear.

Afterward, Cloudera and Houston's own SHMsoft invite y'all to Barry's Pizza on Richmond.

Art: Arthur Rackham : The Lion, Jupiter and the Elephant, illustration from Aesops Fables, published by Heinemann, 1912

Wednesday, May 18, 2011

Big Data, Legal, and Tolkien’s Seeing Stone

How does Big Data relate to legal? See this article on Forbes.

Apache Hadoop takes top prize at Media Guardian Innovation Awards

The Apache Hadoop open source software project won the top prize at Thursday night's 2011 MediaGuardian Innovation Awards, the Megas.

Described by the judging panel as a "Swiss army knife of the 21st century", Apache Hadoop picked up the innovator of the year award for having the potential to change the face of media innovations.

The judges felt the project had the potential as a greater catalyst for innovation than other nominees including WikiLeaks and the iPad.

Continue reading...

Tuesday, May 17, 2011

Big data’s potential for businesses - By McKinsey analysts

Data are now part of every sector and function of the global economy and, like other essential factors of production such as hard assets and human capital, much of modern economic activity simply could not take place without them. The use of big data — large pools of data that can be brought together — will become the key basis of competition and growth for individual firms, enhancing productivity and creating significant value for the world economy.

Continue reading...

Art: Claude Oscar Monet - At Large- Open Sea

Friday, May 6, 2011

Houston Hadoop Meetup #3 with Cloudera is in the works

Cloudera's own Doug Cutting or Jeff Hammerbacher will present at Houston Hadoop Meetup on June 6. Houston Hadoopers are all excited! News of agenda and venue to follow...

Wednesday, May 4, 2011

Houston Hadoop Meetup #2

The meeting was on May 2, 2011, and it was about SQL/NoSQL and Hadoop. Here are presentation slides. From now on, we will always publish the main points of each meeting as slides, possibly in advance.

The news of the Hadoopers who were present are

  • Hal Martin is playing with Hadoop code and will prepare a presentation for July
  • Jeremy R. Easton-Marks is building his first 5-node cluster at the Cancer Center at BCM
  • Marcel Poisot is the one who registers us in the library - thank you, Marcel!
  • Kumar got a Cloudera Hadoop certification
The two suggested themes for a June meetup are
  • Practical uses of Hadoop, and/or
  • Failing memory (mine) - please fill in
See y'all in June!