SHMsoft blog: 2011

Thursday, December 22, 2011

Search news in FreeEed

We've added the automatic build time to the About screen. We are going to have many version of FreeEed floating around, and this will be useful.

Today LinkedIn announced that they open-sources a number of search technologies. It is mind-boggling to think what kind of searches we can add to FreeEed very soon:

https://engineering.linkedin.com/open-source/indextank-now-open-source
http://sna-projects.com/bobo/
http://javasoze.github.com/zoie/
http://sna-projects.com/cleo/

And the big year in Solr/Lucene:
http://blog.sematext.com/2011/12/21/lucene-solr-year-2011-in-review/

Thursday, November 24, 2011

How to process Microsoft Outlook .PST files

Here is an efficient way that FreeEed uses:

Convert PST to MBOX formats. Use readpst in Linux and JPST in Windows. Before I used individual EML emails, but this is not so efficient, since there are too many of them. Dealing with MBOX files that correspond to top-level PST folders fits much better with the overall Hadoop processing;
Use javamail in conjunction with mstor local access provider to process these MBOX files. This approach is great because it allows to use standard components of high quality. It also gives full access to attachments, CC, BCC, etc.

Now this approach is something I feel very good about, because it combines the best practices with overall efficiency.

Tuesday, November 22, 2011

Adding image creation to FreeEed

By "image creation" in eDiscovery we mean making the PDF or TIFF images of the originals. Having these is convenient for review, because it eliminates the need for the various applications required to open the native file formats, and is useful for redacting.

In the last three weeks I was starting a new assignment that has to deal with text analytics and understanding in the context of Big Data, which is great, because the deeper knowledge of it will help me create open source tools for automated document review later on. But it also meant that I only had a couple hours to work on FreeEed in the evening, and that only for two evenings.

Nevertheless, this was enough. OpenOffice/LibreOffice are open source free applications that allow printing MS office documents to PDF, and JodConverter is a bridge that allows the code to talk to it. Altogether, printing is done with five lines of code. Here they are:

OfficeManager officeManager = new DefaultOfficeManagerConfiguration().buildOfficeManager();
officeManager.start();
OfficeDocumentConverter converter = new OfficeDocumentConverter(officeManager);
converter.convert(new File("test.odt"), new File("test.pdf");
officeManager.stop();

Taking out the start/stop code, you have just one line:

converter.convert(new File("test.odt"), new File("test.pdf");

That's it! One line of code (and lots of computing power) to convert all MS Office file formats to PDF. Isn't this amazing? Anytime you need more computing power, you get it from the cloud on the cheap, so FreeEed begins to really shine, because it is designed for parallel processing in the cloud.

Sometimes I wish that I would have more time for FreeEed, perhaps even doing it full-time. But then again, since I can do so much with the great open source tools, then maybe it is not even necessary.

Tuesday, November 8, 2011

Evan Koblentz on eDiscovery pricing

"The other extreme is products and services that are cost-free. The open-source FreeEED project may get traction now that it's available for Microsoft Windows. In addition, FreeEDD will soon tackle automated document review, project leader Mark Kerzner said. Open source is sometimes controversial for developers' philosophical approach, and for what closed-sourced vendors allege are hidden costs of implementation and support. But at least you know what the base software costs -- nothing."

Full article

Friday, November 4, 2011

Open source eDiscovery (FreeEed) for Windows is released

All paralegals, lawyers, and do-it-yourselfers!

We have released the version of FreeEed, which runs in Windows. It also runs on a Hadoop cluster, is scalable, and is free. You can find it here, http://freeeed.org.

Thank you. Sincerely,
FreeEed Team

Art: Bruegel, Pieter the Elder - Dance

Thursday, November 3, 2011

So I paid $548.30

...and I own a JPST license from http://independentsoft.com/. It gives me unlimited distribution rights, so everybody can use FreeEed for free, and it works in Windows and extracts PST. It goes into Release Candidate 2, where all known bugs are fixed. Production release coming as soon as I receive the licencing info.

Sunday, October 23, 2011

FreeEed for Windows Release Candidate 1 is available for download

The open source software for eDiscovery, FreeEed, now works in Windows, in addition to Linux. It can be downloaded here, http://freeeed.org/download. We are testing it on the Mac.

The software version is 2.9.5, and it contains multiple bug fixes and enhancements. The new release is still free and will always be free, even though it contains a closed source version of the PST extactor, which is needed for it to work in Windows. No other additional software needs to be installed.

In a few weeks, when the software is officially released, it will contain a licensed version of this extractor, free for all users.

As always, your feedback is welcome.

Thursday, October 6, 2011

Loading inner maps in Hive

Sometimes you would want to load maps which contain maps into Hive. I mean, this structure

map <string,map<string,string>>

Hive allows you this. In fact, it allows even deeper levels of mapping. However, the question is, how do you tell it where your inner maps end, since this is not one of the parameters in the LOAD DATA INPATH statement. Well, there is an undocumented default, and that is, '\004' and '\005' for inner maps.

Here is how your data has to be formatted (using an image, to show the non-ascii separators)

and this is how you define your table

CREATE EXTERNAL TABLE map_table
(
complex_map map <string, map<string,string>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\003'
MAP KEYS TERMINATED BY '\002'
STORED AS TextFile;

and load it as follows

LOAD DATA INPATH "/mydata" OVERWRITE INTO TABLE map_table;

and you can have the following queries:

hive> select * from map_table;
Result:
{"key1":{"innerkey11":"innervalue11","innerkey21":"innervalue21"},"key2":{"innerkey12":"innervalue12","innerkey22":"innervalue22"}}

also

hive> select complex_map["key1"] from map_table;
Result:
{"innerkey11":"innervalue11","innerkey21":"innervalue21"}

and even

hive> select complex_map["key1"]["innerkey11"] from map_table;
Result:
innervalue11

Now, it is true, this uses the default coding for inner-level maps for Hive load, and that's not documented, but it IS the coding which is unlikely to change.

Credits: this post talks about it, and it was brought to my attention by Steven Wong of Netflix.

Sunday, October 2, 2011

Email archiving and retention with Hadoop

Cloudera explores a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.

To read further...

Implications: you can set up a system for legal hold, email archiving and retention, and subsequent eDiscovery, all on open source scalable platform. And, it will likely work better than home-grown proprietary platforms, because Hadoop is designed for scalability and has been proven at world's largest companies.

Art: Winslow Homer : Woman and Elephant

Monday, September 26, 2011

FreeEed is close to run in Windows, also watch for Continuous Integration

I am adding Windows single-node processing, and it is going easier than I expected. Since Hadoop requires too much setup in Windows, instead I detect the Windows platform in the code, and take processing from beginning to end without Hadoop. There is a lot of useful stuffing in FreeEed, such as parsing all file formats with Tika and culling with Lucene, so even without Hadoop there is much usability left.

What gives me especial pleasure is watching how every new code iteration is automatically tested with Jenkins continuous integration server. Feels good to see that the code's build and correct operation is automatically tested. I will add more tests to the build, to better exercise all of the software's capabilities.

Sunday, September 18, 2011

Symantec: Files, Databases Overtake E-Mail in E-Discovery

Respondents gave a surprising answer to a question about how frequently various types of ESI are requested during legal and regulatory processes. Files and documents are requested in 67 percent of situations, followed by application and database records at 61 percent, and e-mail at 58 percent, they said. Microsoft SharePoint records are requested 51 percent of the time, while messaging formats such as instant messaging, texts, and BlackBerry PIN messages are needed 44 percent of the time. Data from social media trailed, being needed for 41 percent of ESI requests.

Continue reading...

Wednesday, September 7, 2011

FreeEed used to process the complete Enron data

From the beginning

To the end

Enron data set is publicly available. In particular, EDRM provides this data in a convenient format. The data set was processed with FreeEed, and the results were made available for reference and for feedback. Each PST produced a (1) zip archive, containing native files, complete text from the each email with all attachments, and exceptions, if any; (2) a CSV file with the metadata, and (3) a short summary report.

Some interesting statistics, which will tell the users what to expect. Using the high-CPU EC2 machine, each Gigabyte took on the average one hour to process. The cost of processing was below $1 per Gig. The processing was done on one machine and took about 2 days. The time could have been shortened to under 4 hours by using 25-50 machines, but at this time we were interested in watching the process and on debugging it, not in the optimization.

While processing was going on, we were also fixing the bugs observed, mainly in the Tika parsers, and the Tika team fixed some bugs with a turn-around time of under one day. There is more work to do and more re-processing in sight, but the main take away: FreeEed is mature and stable and can be relied upon for processing. Now is the time to take it to the next level, by creating the Windows/Mac/Linux thin client and using Amazon EC2 for processing, which will make eDiscovery processing easily available for a non-geek user.

Thursday, September 1, 2011

The first Enron tests are in

The first 5 Enron mail boxes are in and can be found here.

The processing is done on Amazon machines, and the results are pushed to the S3 cloud, so that the transfer costs are 0. As planned, FreeEed just reprocesses all projects and puts the data in the cloud, and the links on the web site do not have to change.

For each project, you get a zip of native files, a metadata in the CSV format, and a little text report.

ejusdem generis

I am reading this wonderful letter by Mike Godwin (PDF), which he wrote when he served as top lawyer for the Wikimedia Foundation, addressed to the FBI over Wikipedia's use of their seal.

I am mostly enjoying the style, but I am also looking up the terms, and I came across this, ejusdem generis, which is explained here and is used to interpret loosely written statutes. Well, amazingly enough this is the Talmud rule of prat-uklal (a specific example followed by a general category), and it is explained here!

Back to the letter, which is wonderful is style!

Tuesday, August 30, 2011

FreeEed processing is made verifiably stable

FreeEed relies on Tika to process emails. In fact, one code line tells Tika to extract all attachments in whatever formats they happen to be, and process these also, adding the extracted text to the total.

Well, today I found a bug in Tika (it was not closing those attachments) that led FreeEed to crash. But also today the nice Tika programmers Mike McCandless, Nich Birch, and Jukka Zitting fixed the bug! So FreeEed is happily churning those Gigabytes of Enron PST on the EC2 machines now.

Sunday, August 28, 2011

FreeEed is ready to do Enron processing

The command-line working of FreeEed has been updated with the latest changes which resulted from GUI work. Now you can use the program equally well as a command-line utility and as a GUI tool. Why is this important? For any serious project, such as processing all of Enron data, you want to be able to script the work of FreeEed and integrate it into a larger whole.

The version 2.7.5 is uploaded to the FreeEed site, and it is ready to take on the Enron set (with Amazon machine, of course). And why is this so nice? -- Having FreeEed process a large set serves as regression testing: on every new release, we can re-process the set (which will grow by adding other data sources) and verify that the quality of processing did not go down and in fact improved.

The list of improvements can be always found here, but there it is for your reading pleasure:

Smaller FreeEed download
Capability to read remote resources as data source using URI notation. The URI syntax is documented, and the program takes you to the right web page for help. You can include ftp with user name and password, and a lot of other things - anything that is a valid URI and that the site where it resides will actually allow you to download. For example, if you want to process an Enron file from the EDRM site, you just give its URI as http://duaj3yp6waei2.cloudfront.net/edrm-enron-v2_bailey-s_pst.zip and FreeEed downloads the file.
Processing of dozens of archive formats: http://truezip.java.net/kick-start/no-maven.html
Processing of archives inside of archives recursively
Command-line running is restored and can be used for scripting large jobs
Option -enron to process Enron data set (specific test script)

What is next? Processing 'em on a Hadoop cluster, of course!

Sunday, August 21, 2011

On reading Judge's Grimm "FEDERAL RULE OF EVIDENCE 502: HAS IT LIVED UP TO ITS POTENTIAL?"

The first phrase sets the stage and excites the interest that will carry you through the 82-pages article, "Nothing causes litigators greater anxiety than the possibility of doing, or failing to do, something during a civil case that waives attorney–client privilege or work-product protection."

However, as a technologist, I can't resist thinking that we are simply glorifying a technical problem and making a legal problem out of it. Imagine that computer search and analysis technology were able to be as precise or more so than the humans, would the problem exist at all? -- No! Or at least, not to that degree.

It is an old adage, but before Google, say, just 15 years ago, one would say, "Cataloging of all of the world's information, and making an answer to any question, even based on key words, available to anyone in under a second? What nonsense!" Nevertheless, today we expect that as a routine. More than that, Google's Page promises a brain implant that would whisper the answers to your thoughts.

So the challenge for FreeEed is clearly stated: after (1) making it work routinely with large data sets, such as Enron's email, for massive statistical verification, and (2) putting the processing into the compute cloud, making it available from any PC or a tablet computer, (3) automate the privilege review to make it usable and useful. Being open source and not a black box, it will inspire more trust, cooperation, and verification. Being a low-cost solution, it will be useful.

How to read data sources in eDiscovery

In addition to local files, FreeEed needs to read such resources as

http://
https://
ftp:// with user name and password

and so on. As it turns out, the regular Java URL classes do all that! Why it is not publicized by Oracle/SUN - I have no idea, but the following three lines are all that you need to read from anywhere on any network:

URL url = new URL(dir);
URLConnection con = url.openConnection();
BufferedInputStream in = new BufferedInputStream(con.getInputStream());

Hooray for Java, and FreeEed now has it - watch for the next release.

Friday, August 19, 2011

Open source software for lawyers getting traction

Open source software for lawyers will be discussed at the upcoming ILTA conference, here is a link to an article by Evan Koblentz about it. Evidently it is popular, since it has been "slashdotted." FreeEed is mentioned there.

It all started with Evan Koblentz's ground-breaking article "Open Source Could Change the Future of E-Discovery".

Thursday, August 18, 2011

Google highlights trouble in detecting web-based malware

Google issued a new study on Wednesday detailing how it is becoming more difficult to identify malicious websites and attacks, with antivirus software proving to be an ineffective defense against new ones.

Continue reading...

Art: Henri De Toulouse-Lautrec - Two Knights in Armor

Wednesday, August 17, 2011

I am on Java 7 now!

OK,

it took me some going back and forth, and pulling out some hair, but it's done. I am on Java 7.

The trick was probably not in the JDK itself, but in the switch: change java and JAVA_HOME, then force your build environment and your IDE (NetBeans) to pick up the changes. In the IDE, I had to pretend to rename and refactor packages, then bring them back. Finally, I got it, so now my pom.xml lists both source and target as Java 7, that is,

source=1.7
target=1.7

and even minor language changes work, like this switch with String's

switch (mycase) {
case "aha!": {

}
case "uhu!": {

}
}

There may be more glitches, but based on my experience, you can resolve all.

Tuesday, August 16, 2011

Houston eDiscovery Masters Series

It was awesome to meet so many top lawyers and eDiscovery consultants -- at happy hour! Many of them knew about FreeEed and could spell it right away. Those who did not, understood the idea of open source eDiscovery with all its implications in a moment.

Monday, August 8, 2011

Free Computer Forensics Toolkit (SIFT) from SANS

Below is a re-print of the litsupport group announcement. For FreeEed, this comes in a very opportune time: eDiscovery needs integrated forensics capabilities, and the Linux-based SIFT looks like a perfect match.

ugust 04, 2011, Washington DC

The SANS Institute reported today a comparison of the capabilities of the recently-upgraded SIFT forensics toolkit with the most popular commercial forensics tools. Although the commercial tools maintain advantages over SIFT in some areas, the free SIFT tool exceeds the capabilities of the commercial tools in other areas. “Even if SIFT cost tens of thousands of dollars,” says, Alan Paller, director of research at SANS, “it would be a very competitive product.” At no cost, it should be part of the portfolio in every organization that has skilled forensics analysts.

The comparison:

SIFT Workstation 2.1 Guidance Software EnCase v6 AccessData FTK 3
Memory Analysis X Limited Limited
Super Timeline Generation and Analysis X
Mobile Device Support Limited
Automated Windows Registry Parsing X Limited Limited
File System Parsing X X X
Network Forensics X
Malware Analysis X

Some testimonials about the SIFT Workstation

The SIFT Workstation has quickly become my "go to" tool when conducting an exam. The powerful open source forensic tools in the kit on top of the versatile and stable Linux operating system make for quick access to most everything I need to conduct a thorough analysis of a computer system." -Ken Pryor, GCFA Robinson, IL Police Department

"Configuring a forensic analysis platform on your workstation can take a lot of time, and installing/setting up applications can be a pain at times. The SANS SIFT workstation has done the heavy lifting already with a wealth of useful, relevant tools - things like volatility, sleuthkit (with autopsy and ptk), pyflag and (my personal favorite) log2timeline. It gives the best of both worlds, both CLI and GUI. The best thing is, you don't need a dongle or have to worry about licensing, since it's all free/open source! SIFT is an excellent platform
for analysis and I have found it to be very beneficial during investigations. -
Frank McClain - GCFA, GCIH, CHFI"

For more information regarding the SIFT Workstation 2.1 release or to download it, the link you should use is: http://computer-forensics.sans.org/community/downloads

Rob Lee, Lead Digital Forensics and Incident Response -
SANS institute rlee@sans.org
703-585-0630
801 4th Street SE
Washington DC 20003

Sunday, August 7, 2011

The Potential Market Size of Big Data in coming 2-3 years?

Michael Segel, Solutions Architect at Nokia, had this to say:

Its big. :-)

Ok seriously... relatively speaking, the potential is unlimited.
Even though you have a finite number of companies, the permutations and multiple projects that would fit in to this 'big data' space make the opportunity huge.

Focusing on Hadoop, you have HDFS for storage, Map/Reduce for parallel processing, HBase/Hive for persistence, and then other components in the ecosystem like Oozie for control, Flume, Scribe for data acquisition... All of this components allow a company to look at their information in a different light. You can Use HDFS for storage, Add in HBase and not use map/reduce. You can use Map/Reduce on small data sets that are computational intensive, so you don't need HBase/Hive.

Think of Hadoop as being a new tool kit that lets you do things you couldn't do before.

This is why its a disruptive technology.

The reason you are seeing a huge growth in Hadoop and derivative technology is that its currently transitioning from leading edge to mainstream. Enough early adopters have shown success that mainstream fortune 1000 companies are starting to look at it.
Add to this start-ups and SMB where the low cost and shortened runways make adoption of Hadoop easier.

Your limitations are human ones. Potential adopters vision and budget are your only real constraints.

-Mike

I think it summarizes this so well that I could not resist re-posting.

Friday, August 5, 2011

Malpractice Suit Targets Quality of BigLaw’s Temporary Lawyers

A legal malpractice suit filed against McDermott Will & Emery raises questions about outside lawyers hired to help screen documents for clients.

An amended malpractice suit filed last week by J-M Manufacturing, also known as JM Eagle, claims McDermott’s outside contract lawyers “negligently performed their duties” while screening documents, the Wall Street Journal (sub. req.) reports. The newspaper says the suit “is seen in the industry as an important case concerning the quality of work performed by a growing cadre of temp lawyers who are paid as little as $25 to $30 an hour to review documents.” To continue reading, click here.

My comment: Interesting. Mind-numbing work, and the computers should be better at such routine tasks. Once the first stage of eDiscovery with the open source FreeEed is complete, the next one is exactly that - help for automated review. Encouraging article.

FreeEed news: V 2.5, distribution and manual

We are working hard on improving FreeEed, and it's moving well, thanks in part to our users' feedback. We also got an official tester on the team, so that should improve the quality of our releases.

You can always find the latest version on the FreeEed web site, and this page now has a manual with screen shots.

What's next? Firstly, we want to start processing the Enron data, posting the results on our web site. Secondly, we are rushing to allow you to run FreeEed on any computer, be it Windows, Mac, or Linux, while harnessing the computational power of Amazon for processing. Can't wait for this to happen!

Thursday, August 4, 2011

NameNode HA from HortonWorks

While in Santa Clara, I attended the "NameNode HA" presentation by Sanjay Radia, and was very much impressed by it. Back to Houston, I re-told what I heard at Houston Hadoop Meetup, and was at a loss to explain exactly what is new in the HortonWorks project. I asked Sanjay, who graciously provided the details. With a little editing, it goes as follows.

To answer your question, Mark, there were no previous offerings like that, and certainly not from Cloudera. Cloudera does not have such an offering, and that is why they are working with us.

The only solution currently in use is Avatar NN which is used at Facebook for doing manual faiover.

So, what it different? Firstly, we deal with all the corner cases for fencing. Secondly, our NN HA is hot, not cold or warm. And thirdly - the NN does not have the necessary states such as Active and Standby - you can try to fake the safe mode to do that.

Our doc on the jira gives the details.

Art: Raja Ravi Varma - A Student

Tuesday, July 26, 2011

Interview at Facebook

Took photo as a "proof of concept" or rather as evidence that can be discovered :)

Monday, July 25, 2011

Visiting Bay Area Hadoop Meetup

http://www.meetup.com/hadoop/events/16805556/

Look at their numbers! 2235 total, and 291 attending.

Houston should beat that! (So far we have 23 total and 5 attending).

Cheers

Friday, July 22, 2011

History in FreeEed

Open source eDiscovery project, FreeEed, got a concept of history. History is shown in a separate window that can be open or closed it will. You can all erase it and start with a clean slate.

History is real-time, and it frees up the user interface to do other things while the software is crunching the information.

The code is currently in "Branch 2" and will be part of the 2.5 release.

Thursday, July 21, 2011

A use-case for FreeEed: When your business model involves litigation

Imagine that you are in the business of buying other businesses. With the business you may acquire its assets, its debts, and its legal obligations. Litigation becomes a way of life, part of regular business.

The big part of litigation, of course, is eDiscovery. Now imagine that you have that part completely under your control. No payments to the third-party vendors, no timing limitations on their side. You can experiment with your litigation strategies at will, both for plaintiff and defendant situations.

That is where FreeEed comes it. You can run your eDiscovery and be in control of it. The FreeEed team, in turn, will do everything in its power to help you advance your goal, win or settle the cases, and grow your business.

Art: Fernando Botero - Man Reading a Paper

Tuesday, July 19, 2011

SQL pagination with complex query

It is well known that SQL has an OFFSET parameter, so if you need pages, you could do something like the following

SELECT * from TABLE1 LIMIT 100 OFFSET 100

The problem is, this query is very inefficient. Since many databases today are big (believe it or not, I saw a 2008 blog post saying "since many databases today are small"), inefficiency of OFFSET, which results in fetching all rows before it, is unacceptable.

The next step is to go specific and use a unique index in the TABLE1 (hope you have it), so you can sort and offset on this index, which will be efficient. You can then do something like the following

SELECT * from TABLE1 where TABLE1.id > 2500 order by TABLE1.id LIMIT 100,

assuming that the last id value you saw was 2500.

What do you do when you do not have an index that you can use, but instead are doing some join with multiple indexes? For example,

SELECT * from TABLE1 JOIN TABLE2 ORDER BY TABLE1.ID1, TABLE2.ID2 LIMIT 100

The query above is actually already half the solution. You only need to add this condition

WHERE
TABLE1.ID1 > last_value_id1 OR
(TABLE1.ID1 = last_value_id1 AND TABLE1.ID2 > last_value_id2).

You can continue and do this with as many indexes as you would like. This question and the solution came up in the case of 4 indexes, and it worked great and very efficiently.

Art: Leonardo Da Vinci - Crossbow Machine

Monday, July 18, 2011

Installing Hadoop on Natty Narwhal

Here are the notes that will save you some trouble

To install Java on Natty, use this link
http://www.multimediaboom.com/how-to-install-java-in-ubuntu-11-04-natty-narwhal-ppa/

That is because you need a screen to accept Java license.

Start by installing Cloudera Hadoop in pseudo-distributed mode for Ubuntu lucid, not natty.

Formatting the node must be done as user hdfs, or else permissions won't work.

Sunday, July 17, 2011

FreeEed gets graphical user interface

Version 2.0 of FreeEed is out!

It sports a clear GUI (graphical user interface) where the user can set parameters and run the processing. The GUI is intentionally a thin wrapper around the command-line utility, to keep the tool simple and efficient.

For all the rest,
Let Lion, Moonshine, Wall, and lovers twain
At large discourse, while here they do remain.

(Shakespeare, Midsummer Night's Dream)

Friday, July 15, 2011

Steve Loughran on Hadoop commercial support options

Steve Loughran presents a great summary on the hadoop-core discussion group, worth repeating:

The picture is a bit confusing

Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release.

So those selling commercial support are:

Cloudera

HortonWorks

MapR

EMC (reselling MapRTech, but had announced their own GreenPlum (free and commercial))

IBM BigInsights (free and commercial)

DataStax

+ Amazon, indirectly, that do their own derivative work of some release of Hadoop (which version is it based on?)

Photo of Steve taken from LinkedIn

Tuesday, July 5, 2011

Houston Hadoop Meeting - July 2011

Howdy, all,

we had two new guests, Carter Cole and and Donald Sutton, both of whom talked about very interesting things.

Carter is into SEO, and his node.js - based SEO plugin for Chrome is used by over thirty thousand people. He wants to take it to the next level by using the power of Hadoop.

Donald Sutton brought the news of the GreenPlum Hadoop appliance, which minimizes the work of the Hadoop cluster administrator.

Mark Kerzner presented his FreeEed, Hadoop-based open source software for eDiscovery, and after the meeting, at the coffee shop across the street, the good time was had by all.

FreeEed Slides for upcoming V2.0 presentations

Slides to be used at the Houston Hadoop Meetup today, and later for Women in eDiscovery in Houston, August meeting.

FreEed - Open Source eDiscovery

View more presentations from markkerzner

Monday, June 27, 2011

Toward FreeEed V2.0

Following are planned and for the most part already implemented for V2.0

Graphical User Interface
Exception processing
All logs in a separate "logs" directory
Clear separation between command-line and parameter_file parameters
FreeEed.org web site with the documentation

Once GUI is completed, the release will be ready.

Friday, June 24, 2011

Open source Linux OCR

Here is a good link comparing it start with. From my personal experience, using tesseract, you need to play with image resolution (if you create those images), because it makes a difference in accuracy.

For testing, one can use this link, http://code.google.com/p/isri-ocr-evaluation-tools/

Digital Forensics with Open Source Tools

By Ken Prior

"I was excited awhile back to learn Digital Forensics with Open Source Tools was being written and even more pleased when I heard who its authors were. I worked almost exclusively with open source tools while beginning my foray into the digital forensics world and happily continue using them today, so I knew this book would be of great interest to me. I had a general idea of what I thought the book would be like, but what I found in it was so much better than I expected. This book is an excellent introduction to open source forensic tools, but in many ways it's also a "how to do forensics" book. In the interest of full disclosure, I did receive a review copy of this book without cost to me, although I did buy a second copy to keep at my office as well."

Continue reading...

Friday, June 17, 2011

Open Source in eDiscovery – Discussion Continues

A commentary/opinion published today in Law Technology News is titled “The Cost of Open Source in eDiscovery.” It starts by saying that “There has been a lot of talk about open-source software and the dramatic effect it may have on e-discovery.” The author, attorney Sean Doherty, makes a number of good points. Let us see what open source can learn from them, and in doing so, let us also analyze the other points of view.

Any discussion is good, because it makes one wise, but what makes this discussion especially interesting is the fact that the original article by Evan Koblentz to which Doherty refers was itself published only three days ago. A "lot of talk" takes a special meaning when applied to just three days.

After stating that the effect of open source may be just for more technology-savvy lawyers, or that it may fit well into the eDiscovery market as it becomes commoditized, or that it may even go as far as bring eDiscovery to every lawyer who needs it, Doherty makes an important point that open-source and free are not one and the same.

The meaning of the word “free” in the context of open source has been discussed since its inception in the 1998. The usual definition given is “free like in freedom, not like in free beer.” That still depends on who you ask, but in my opinion it is fair to say that open-source is free in the sense the source code for the software is freely available, although its usage is regulated by the license under which it is released. It may be a license that requires that products that derive, or build on top of this, are also open-sourced, or it may be less restrictive. As you can see, “less restrictive” means that you are more free to use it. In general, open-source publishes its source code and specifies what one can do with it.

Practically however, one can charge money for open source products, and certainly for services based on it and on support offered with it. The meaning of “free” in open-source is thus a general argument about open source.

Specific to eDiscovery (and we will use the FreeEed open-source tool as an example), you need to hire someone to do it for you, especially since lawyers usually do not deal with technical matters. That, however, is true for any eDiscovery and any technology used in the practice of law, as evidenced by the existence paralegals and IT departments in law firms. Still, the software being free might make a difference.

You also need to take care of upgrades – even though here open-source may come out a winner, because its upgrades are free and frequent. Unless, as Doherty mentions, you use software-as-a-service. Here too there is nothing specific to open-source. In particular, FreeEed, is designed to work on the Amazon cloud, thus being software-as-a-service and also providing the scalability that comes with the use of compute clouds.

The next important point is licensing. There are a number of open-source licenses out there, and there is an on-going disagreement between the Apache 2.0 license used by Apache and the GPL V3.0 license used by Linux, for example. However, lawyers do not shy away from licensing issues. In fact, they will probably have an advantage in this over a lay business person. The FreeEed uses the Apache V 2.0 license, because the software packages which it builds upon – Hadoop for cluster processing, Lucene for text search, and Cassandra as a scalable fast NoSQL database – all use the Apache V2.0 license.

The next argument is that “Open source tools often require a level of sophistication that exceeds that needed for plug-and-play and click-to-receive software.” This may or may not be true. Some tools that are designed for programmers (a web crawler tool called Nutch comes to mind) expect one to be able to configure and run Linux applications. However, there are open-source applications that are quite easy to use. Consider such examples as OpenOffice, a Microsoft Office replacement, Ubuntu Linux for the Desktop use, RedHat Linux for the enterprise, and the Firefox browser. The FreeEed, for example, will do well to offer an easy-to-use graphical interface. This, in fact, was the first suggestion by tRon Chichester when he came on the FreeEed team, and the implementation has already begun. It will all depend on the hard work and user feedback, but not on the inherent limitations of closed source or open-source software.

Next, “End point, it's one thing to be locked into a proprietary code base and another to be locked in by the developers and administrators to your open source tools.” Assuming that the tool is easy to use, its adoption is no different from the closed source: for closed source, you have to evaluate the vendor, and for open source you have to evaluate the developer community and/or the commercial company providing support. Commercial eDiscovery vendors may be slow to respond, they may not care about you specifically, and then they may be bought out, or even go out of business. One extra option of open-source is that your IT departments can participate in the development, offer their contributions, and if implemented for in-house processing, they have an additional guarantee that the code will always be available in the eventuality that they have to continue using and developing it themselves.

The final argument is “Whether open source or proprietary code, you get what you pay for. And that value is often found in the service and support from manufacturers, not the software. “ It is true that you get what you pay for, however, sometimes it costs less. And sometimes excellent products, like the ones mentioned above, are free. Google and Bing search are examples of free but excellent services. The service and support from manufacturers is another story. Any company is free to offer their support of any open-source tool, and SHMsoft is already offering this support for FreeEed. Any eDiscovery vendor can use FreeEed or any other tool of their choosing. Here everything will depend on the execution.

Wednesday, June 15, 2011

Search in eDiscovery

After you process all the files you have collected, your next step is searching through the data and slicing and dicing it in every way possible. This may be done at different times by both the defendant and the plaintiff. Case analysts need to try and answer different "what if" questions and scenarios. But with millions of files, SQL solutions become terribly slow. Anybody who saw that please raise your hand.

So here is how I formulated this questions on HBase and Cassandra user groups - because I needed it for FreeEed, my open source eDiscovery tool:

Imagine I need to store, say, 10M-100M documents, with each document having say 100 fields, like author, creation date, access date, etc., and then I want to ask questions like "give me all documents whose author is like abc**, and creation date any time in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions, matching a list of some keywords."

What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan and compare of every record?

Well, the very nice and knowledgeable people on these groups came with a number of solutions:

I can't claim its the best, but I'd say solar or katta (Joey Echeverria)

'Add search to HBase' - HBASE-3529 is in development. (Jason Rutherglen)

I, for one, am interested in learning more about elasticsearch with HBase after reading the article over at StumbleUpon (Matt Davies)

I think this's the key line: "ElasticSearch is the search analogue to HBase that frees us from some restrictions that Solr imposes". It is quite true however if search is inside of HBase ones gets the same thing. Solr does have serious limitations in terms of scaling etc. I think ES has done a great job there, though this could have been done with Solr just as easily, eg, upgrade Solr with the same functionality and remove the need for schemas. Solr does allow schema-less, with eg, dynamic fields. (Jason Rutherglen)

I'd say give Lily a spin. Currently, we rely on Solr for search. In the next few months, we'll take a good look at "HBase-native" secondary indexes as well (Steven Noels)

There's also DrawnToScale (M. C. Srivas)

I store over 500M documents in HBase, and index using Solr with dynamic fields. This gives you tremendous flexibility to do the type of queries you are looking for -- and to make them simple and intuitive via a faceted interface. However, there was quite a bit of software that we had to write to get things going, and I can neither release all of it open source, or support other people using it. If I had to start again, I would seriously look at solutions like elastic search and lily (David Buttler)

Check out Solandra (Jake Luciani)

Thank you, all! Now I need to investigate all suggestions and select the one most fitting for eDiscovery.

An open-source Hadoop alternative from LexisNexis and FreeEed

As if by coincidence, on the next day after the LTN article by Evan Koblents which mentioned FreeEed, LexisNexis announced that it will open-source its Hadoop alternative for handling Big Data:

"LexisNexis announced today that it will open-source its High Performance Computing Cluster (HPCC) technology, as well as offer an enterprise version with commercial support. The company is positioning HPCC Systems, developed internally by its Risk Solutions unit, as an alternative to Apache Hadoop. A virtual machine for testing purposes will be available soon, and code will be available in a few weeks." For fuller announcement, go here.

As a first impression, what are the major comparison points?

LexisNexis has been using its technology for a while and has a marketing clout to match, but it announced only plans to make the VM "available soon" and code "in a few weeks." One wonders if this is a reaction to the momentum that FreeEed has been gaining. On the other hand, FreeEed is already out on GitHub;

LexisNexis is essentially a closed-source company, so one wonders how really open-sourced the offering is going to be. But they may be successful - look at Microsoft open-source contributions. In LexisNexis own words, "Only the core technology is being released, LexisNexis' own data linking techniques aren't being released, nor are its data sources." In contrast, FreeEed is pure open source (with commercial support options), and people are already investigating using it in ways beyond eDiscovery. This illustrates the flexibility of an open source offering.

LexisNexis has Roxie, a system for query and data warehousing, but FreeEed will have the same based on Cassandra.

LexisNexis sports ECL (Enterprise Control Language), but Cassandra has CQL (Cassandra Query Language).

LexisNexis's "HPCC team has been working with Amazon Web Services to make sure the product work well on AWS servers," but FreeEed team has planned on the use of EC2 from the start and is actively working on it now.

The two are not exactly competitors at this point: LexisNexis releases the technology for high performance cluster computing and its risk handling applications, but they are close in their approach to open source and to handling Big Data, so it is worth watching.

Tuesday, June 14, 2011

"Open source could change the future of eDiscovery"

FreeEed is being profiled on LAW.COM

Sunday, June 12, 2011

Implementing design patterns with Scala: Loan Pattern

Programmers are familiar with design patterns - these are best practice solutions to common problems. Then one often adds, almost apologetically, "you know, the code is different in every case, but you get to re-use the idea!"

Well, why would the code have to be different, and why can't you re-use even the code, and not only the idea? That's because in Java you cannot easily pass functions around. However, in Scala you can. Take, for example, the "loan pattern." It allows to borrow the resource and make sure you close it later. Here is what I mean. You may have thousands of little functions that open the SQL connection, use it, then close it, like in this picture

This screen image is to show off the syntax coloring of Scala code in NetBeans, but here is the same code in plain words:

def doSqlStuff: Unit = {
var conn: Connection = null
try {
val url = "jdbc:mysql://localhost:3306/";
val dbName = "beren";
val driver = "com.mysql.jdbc.Driver";
val userName = "beren";
val password = "beren";
Class.forName(driver).newInstance();
conn = DriverManager.getConnection(url + dbName, userName, password)
val stmt = conn.prepareStatement("insert into stuff (id) values (?) ")
stmt.setString(1, "" + new Date())
stmt.executeUpdate
} catch {
case e: SQLException => {
e.printStackTrace
println("SQLException: " + e.getMessage)
}
case ex: Exception => {
println("Exception: " + ex.getMessage)
}
} finally {
conn.close
}
}

But, only three lines are really important:

val stmt = conn.prepareStatement("insert into stuff (id) values (?) ")
stmt.setString(1, "" + new Date())
stmt.executeUpdate

Nevertheless, because of those three lines I am forced to replicate the boilerplate over and over. If I could somehow pass these lines to my boilerplate code, I would have it so much nicer!

OK, and indeed, it's better when you do it like this. Here is the boiler plate. Now you promise to pass it the function to do the real work.

def doSqlStuffBoilerplate (sqlDoer: Connection => Int): Unit = {
var conn: Connection = null
try {
val url = "jdbc:mysql://localhost:3306/";
val dbName = "beren";
val driver = "com.mysql.jdbc.Driver";
val userName = "beren";
val password = "beren";
Class.forName(driver).newInstance();
conn = DriverManager.getConnection(url + dbName, userName, password)

sqlDoer(conn)

} catch {
case e: SQLException => {
e.printStackTrace
println("SQLException: " + e.getMessage)
}
case ex: Exception => {
println("Exception: " + ex.getMessage)
}
} finally {
conn.close
}
}

Here are the three lines that you will pass as a function,

def sqlDoer(conn: Connection): Int = {

val stmt = conn.prepareStatement("insert into stuff (id) values (?) ")

stmt.setString(1, "" + new Date())

stmt.executeUpdate

}

and this is how to call the method:

def doSqlStuffBetter(): Unit = {

doSqlStuffBoilerplate (sqlDoer)

}

We tried this idea in Java, but the code is unreadable, and we went back to the old way of copying and pasting. With Scala, you can abstract and CODE the design pattern. (Even better is to in-line the function when passing, but I will do it later).

Tuesday, June 7, 2011

EddUpdate talks about FreeEed, the first open source tool for eDiscovery

Mark Kerzner of FreeEed (https://github.com/markkerzner/FreeEed) released version 1 of a free open source e-discovery processing tool today. Aimed at the do-it-yourself crowd, law firms looking to bring data processing in-house, and vendors seeking a way to process client data license-free, it provides an egalitarian alternative to the various solutions already on the market.

Continue reading...

Monday, June 6, 2011

Houston Hadoop Meetup #3

Vikram Oberoi, a Big Data engineer from Cloudera, presented Hadoop use cases, explaining when and why you would use Hadoop. Here is Vikram's presentation, which he graciously provided for us - an note that he gave it a Houston-specific spin for the energy and medical applications.

Thanks to Erin Mynatt, also of Cloudera, for bringing Vikram from California and for coming from Denver, as well as for sponsoring the pizza & beer afterwards.

Thanks also to John Bland II for joining the group just in time. By the way, John is doing some awesome graphics work CNN, Fox, etc., so check him out!

Thursday, June 2, 2011

FreeEed processes its first Enron PST file

Version 1.0.2 fixes processing of PST and loose EML files. Thanks to the latest fixes by the Tika community, unusual email formats are processed unimpeded. When processing emails, Tika extracts the text from all the attachments, thus, one line of code does complete email processing.

These are the benefits of building on top of other open source tools and communities.

FreeEed is an open source tool for eDiscovery that can be downloaded on GitHub here.

Tuesday, May 31, 2011

FreeEed V1.0 Released!

Latest changes:

All processing examples work;

Testing is automated with JUnit;

All tests run;

PST coming very soon.

Cause for celebration!!!

eDiscovery open-source tool is available for all.

FreeEed Roadmap

While working on the already close release of 1.0 of FreeEed, the roadmap is becoming clear:

1.0 - basics of eDiscovery processing, small or big projects. with Java/Hadoop;

2.0 - backed by a NoSQL database, which scales without a problem to thousands of projects and billions of files, with document-oriented CouchDB being a natural fit, still with Java/Hadoop;

3.0 - text analytics, adding Natural Language Processing to legal review, written in the Scala language, which is a better fit for text processing than Java, and with Mahout;

4.0 - data collection, forensics, preservation.

When? 1.0 - in the next few days; 2.0 - two more months; 3.0 - two more months.

After that? - improvements, improvements, improvements.

Friday, May 27, 2011

Beautiful example of Scala code

This is directly from the "Programming in Scala" book, but it illustrates the language so well.

In calculating the greatest common divider, or gcd, one can write code like that

def gcd(x: Long, y: Long): Long =
if (y == 0) x else gcd(y, x % y)

and the beauty of it is in looking exactly as the mathematical formula from which the algorithm is derived.

HOWEVER

The biggest claim against Scala programmers, and possibly against newbie Scala programmers (like myself), is that they take the best Scala example and match it against the worst Java code. So, to be fair, here is the same code in Java:

public long gcd(long x, long y) {
if (y == 0) return x; else return gcd(y, x % y);
}

Is it that much different? Perhaps not, and all we achieved is give an example of things to come. Scala is closer to a mathematical formula, but so far that's it. Well, as everyone knows, the biggest advantage is that in Scala semicolon; is optional!

Art: William-Adolphe Bouguereau : La leçon difficile (The difficult lesson)

Sunday, May 22, 2011

Upcoming Houston Hadoop Meetup on June 6

This is a great chance to hear about Hadoop and other Big Data technologies, when you would and when you would not use them - from Cloudera's engineer Vikram Oberoi.

Vikram is a native Houstonian who left Texas for the Silicon Valley, first to study and then to work, but we won't hold it against him for two reasons: he is working on Big Data problems, and he is coming back to tell us about it.

While the world is on fire about Big Data and Hadoop, Houston is for the most part dormant in this regard, with the notable exception of medical research.

The Big Data technologies originated at Google, Amazon, and Facebook. However, lately they are used by a large number of companies, and in fact, your company may not be competitive without it, at least that is what McKinsey analysts are telling us. Therefore, come and hear.

Afterward, Cloudera and Houston's own SHMsoft invite y'all to Barry's Pizza on Richmond.

Art: Arthur Rackham : The Lion, Jupiter and the Elephant, illustration from Aesops Fables, published by Heinemann, 1912

Wednesday, May 18, 2011

Big Data, Legal, and Tolkien’s Seeing Stone

How does Big Data relate to legal? See this article on Forbes.

Apache Hadoop takes top prize at Media Guardian Innovation Awards

The Apache Hadoop open source software project won the top prize at Thursday night's 2011 MediaGuardian Innovation Awards, the Megas.

Described by the judging panel as a "Swiss army knife of the 21st century", Apache Hadoop picked up the innovator of the year award for having the potential to change the face of media innovations.

The judges felt the project had the potential as a greater catalyst for innovation than other nominees including WikiLeaks and the iPad.

Continue reading...

Tuesday, May 17, 2011

Big data’s potential for businesses - By McKinsey analysts

Data are now part of every sector and function of the global economy and, like other essential factors of production such as hard assets and human capital, much of modern economic activity simply could not take place without them. The use of big data — large pools of data that can be brought together — will become the key basis of competition and growth for individual firms, enhancing productivity and creating significant value for the world economy.

Continue reading...

Art: Claude Oscar Monet - At Large- Open Sea

Friday, May 6, 2011

Houston Hadoop Meetup #3 with Cloudera is in the works

Cloudera's own Doug Cutting or Jeff Hammerbacher will present at Houston Hadoop Meetup on June 6. Houston Hadoopers are all excited! News of agenda and venue to follow...

Wednesday, May 4, 2011

Houston Hadoop Meetup #2

The meeting was on May 2, 2011, and it was about SQL/NoSQL and Hadoop. Here are presentation slides. From now on, we will always publish the main points of each meeting as slides, possibly in advance.

The news of the Hadoopers who were present are

Hal Martin is playing with Hadoop code and will prepare a presentation for July

Jeremy R. Easton-Marks is building his first 5-node cluster at the Cancer Center at BCM

Marcel Poisot is the one who registers us in the library - thank you, Marcel!

Kumar got a Cloudera Hadoop certification

The two suggested themes for a June meetup are

Practical uses of Hadoop, and/or

Failing memory (mine) - please fill in

See y'all in June!

Tuesday, April 5, 2011

Metadata according to Judge Shira Sheindlin

In her recent decision, Judge Shira Sheindlin did one more service for eDiscovery - she outlined the standard metadata fields to be produced. That was a joy - these fields went straight into FreeEed.

Now it has standard fields from this list, and all the non-standard, or application-defined fields, following the standard ones.

I had a spare hour (not really, I should have been working on the book), but okay, and I added these fields. It never stops surprising me how fast you can add features if only you use the right tools. Look up the code :), it is committed. Now I only need to fill the fields out, but that too should be a breeze.

The first Houston Hadoop Meetup

It was definitely a success, all due to the wonderful and colorful people who came.

Nick Popov described how using a super-size database like HBase could have helped his previous company, a mortgage conglomerate, that needed to process thousands of events per second - not foreclosures:) as one of those present suggested, but all sorts of event.

Edward Wiener wanted to know what Hadoop and BigData is all about.

Helen Jiang sees definite opportunities for BigData at her current company, Structure Group, and is looking for ways to deepen her knowledge and also to introduce those ideas to the management.

Hal Martin came all the way from Clear Lake, and found that Montrose Library is a good compromise location for everybody. He is an experienced software contractor, and wants to expand his horizons.

Ron Chichester's appearance was a pleasant and welcome surprise: Ron is a lawyer, a legal expert, a forensics expert, an open source expert, and a software expert. Oof! Ron's opinions on politics and education (he also teaches law at the UH) were so fascinating that the group kept asking him and listening when we were forced out of the meeting room by the next group - for another half hour.

Pat Kerr is a pleasant and overall good fellow, and he will go anywhere where his friends' interests are.

The remaining members of the group - we missed you and hope to see you next time.

Oh, and what was the meeting about? We discussed what BigData is - each guest giving his or her example, how Hadoop and his friends can help, and were entertained by some anecdotal evidence about the BigData explosion.

Art: Waltner Charles Albert - The Pickwick Club