Monday, June 27, 2011

Toward FreeEed V2.0

Following are planned and for the most part already implemented for V2.0
  • Graphical User Interface
  • Exception processing
  • All logs in a separate "logs" directory
  • Clear separation between command-line and parameter_file parameters
  • FreeEed.org web site with the documentation
Once GUI is completed, the release will be ready.

Friday, June 24, 2011

Open source Linux OCR

Here is a good link comparing it start with. From my personal experience, using tesseract, you need to play with image resolution (if you create those images), because it makes a difference in accuracy.

For testing, one can use this link, http://code.google.com/p/isri-ocr-evaluation-tools/

Digital Forensics with Open Source Tools

By Ken Prior

"I was excited awhile back to learn Digital Forensics with Open Source Tools was being written and even more pleased when I heard who its authors were. I worked almost exclusively with open source tools while beginning my foray into the digital forensics world and happily continue using them today, so I knew this book would be of great interest to me. I had a general idea of what I thought the book would be like, but what I found in it was so much better than I expected. This book is an excellent introduction to open source forensic tools, but in many ways it's also a "how to do forensics" book. In the interest of full disclosure, I did receive a review copy of this book without cost to me, although I did buy a second copy to keep at my office as well."

Continue reading...

Friday, June 17, 2011

Open Source in eDiscovery – Discussion Continues

A commentary/opinion published today in Law Technology News is titled “The Cost of Open Source in eDiscovery.” It starts by saying that “There has been a lot of talk about open-source software and the dramatic effect it may have on e-discovery.” The author, attorney Sean Doherty, makes a number of good points. Let us see what open source can learn from them, and in doing so, let us also analyze the other points of view.

Any discussion is good, because it makes one wise, but what makes this discussion especially interesting is the fact that the original article by Evan Koblentz to which Doherty refers was itself published only three days ago. A "lot of talk" takes a special meaning when applied to just three days.

After stating that the effect of open source may be just for more technology-savvy lawyers, or that it may fit well into the eDiscovery market as it becomes commoditized, or that it may even go as far as bring eDiscovery to every lawyer who needs it, Doherty makes an important point that open-source and free are not one and the same.

The meaning of the word “free” in the context of open source has been discussed since its inception in the 1998. The usual definition given is “free like in freedom, not like in free beer.” That still depends on who you ask, but in my opinion it is fair to say that open-source is free in the sense the source code for the software is freely available, although its usage is regulated by the license under which it is released. It may be a license that requires that products that derive, or build on top of this, are also open-sourced, or it may be less restrictive. As you can see, “less restrictive” means that you are more free to use it. In general, open-source publishes its source code and specifies what one can do with it.

Practically however, one can charge money for open source products, and certainly for services based on it and on support offered with it. The meaning of “free” in open-source is thus a general argument about open source.

Specific to eDiscovery (and we will use the FreeEed open-source tool as an example), you need to hire someone to do it for you, especially since lawyers usually do not deal with technical matters. That, however, is true for any eDiscovery and any technology used in the practice of law, as evidenced by the existence paralegals and IT departments in law firms. Still, the software being free might make a difference.

You also need to take care of upgrades – even though here open-source may come out a winner, because its upgrades are free and frequent. Unless, as Doherty mentions, you use software-as-a-service. Here too there is nothing specific to open-source. In particular, FreeEed, is designed to work on the Amazon cloud, thus being software-as-a-service and also providing the scalability that comes with the use of compute clouds.

The next important point is licensing. There are a number of open-source licenses out there, and there is an on-going disagreement between the Apache 2.0 license used by Apache and the GPL V3.0 license used by Linux, for example. However, lawyers do not shy away from licensing issues. In fact, they will probably have an advantage in this over a lay business person. The FreeEed uses the Apache V 2.0 license, because the software packages which it builds upon – Hadoop for cluster processing, Lucene for text search, and Cassandra as a scalable fast NoSQL database – all use the Apache V2.0 license.

The next argument is that “Open source tools often require a level of sophistication that exceeds that needed for plug-and-play and click-to-receive software.” This may or may not be true. Some tools that are designed for programmers (a web crawler tool called Nutch comes to mind) expect one to be able to configure and run Linux applications. However, there are open-source applications that are quite easy to use. Consider such examples as OpenOffice, a Microsoft Office replacement, Ubuntu Linux for the Desktop use, RedHat Linux for the enterprise, and the Firefox browser. The FreeEed, for example, will do well to offer an easy-to-use graphical interface. This, in fact, was the first suggestion by tRon Chichester when he came on the FreeEed team, and the implementation has already begun. It will all depend on the hard work and user feedback, but not on the inherent limitations of closed source or open-source software.

Next, “End point, it's one thing to be locked into a proprietary code base and another to be locked in by the developers and administrators to your open source tools.” Assuming that the tool is easy to use, its adoption is no different from the closed source: for closed source, you have to evaluate the vendor, and for open source you have to evaluate the developer community and/or the commercial company providing support. Commercial eDiscovery vendors may be slow to respond, they may not care about you specifically, and then they may be bought out, or even go out of business. One extra option of open-source is that your IT departments can participate in the development, offer their contributions, and if implemented for in-house processing, they have an additional guarantee that the code will always be available in the eventuality that they have to continue using and developing it themselves.

The final argument is “Whether open source or proprietary code, you get what you pay for. And that value is often found in the service and support from manufacturers, not the software. “ It is true that you get what you pay for, however, sometimes it costs less. And sometimes excellent products, like the ones mentioned above, are free. Google and Bing search are examples of free but excellent services. The service and support from manufacturers is another story. Any company is free to offer their support of any open-source tool, and SHMsoft is already offering this support for FreeEed. Any eDiscovery vendor can use FreeEed or any other tool of their choosing. Here everything will depend on the execution.



Wednesday, June 15, 2011

Search in eDiscovery

After you process all the files you have collected, your next step is searching through the data and slicing and dicing it in every way possible. This may be done at different times by both the defendant and the plaintiff. Case analysts need to try and answer different "what if" questions and scenarios. But with millions of files, SQL solutions become terribly slow. Anybody who saw that please raise your hand.

So here is how I formulated this questions on HBase and Cassandra user groups - because I needed it for FreeEed, my open source eDiscovery tool:

Imagine I need to store, say, 10M-100M documents, with each document having say 100 fields, like author, creation date, access date, etc., and then I want to ask questions like "give me all documents whose author is like abc**, and creation date any time in 2010 and access date in 2010-2011, and so on, perhaps 10-20 conditions, matching a list of some keywords."

What's best, Lucene, Katta, HBase CF with secondary indices, or plain scan and compare of every record?


Well, the very nice and knowledgeable people on these groups came with a number of solutions:

  • I can't claim its the best, but I'd say solar or katta (Joey Echeverria)
  • 'Add search to HBase' - HBASE-3529 is in development. (Jason Rutherglen)
  • I, for one, am interested in learning more about elasticsearch with HBase after reading the article over at StumbleUpon (Matt Davies)
  • I think this's the key line: "ElasticSearch is the search analogue to HBase that frees us from some restrictions that Solr imposes".  It is quite true however if search is inside of HBase ones gets the same thing.  Solr does have serious limitations in terms of scaling etc.  I think ES has done a great job there, though this could have been done with Solr just as easily, eg, upgrade Solr with the same functionality and remove the need for schemas.  Solr does allow schema-less, with eg, dynamic fields. (Jason Rutherglen)
  • I'd say give Lily a spin. Currently, we rely on Solr for search. In the next few months, we'll take a good look at "HBase-native" secondary indexes as well (Steven Noels)
  • There's also DrawnToScale (M. C. Srivas)
  • I store over 500M documents in HBase, and index using Solr with dynamic fields.  This gives you tremendous flexibility to do the type of queries you are looking for -- and to make them simple and intuitive via a faceted interface. However, there was quite a bit of software that we had to write to get things going, and I can neither release all of it open source, or support other people using it.  If I had to start again, I would seriously look at solutions like elastic search and lily (David Buttler)
  • Check out Solandra (Jake Luciani)
Thank you, all! Now I need to investigate all suggestions and select the one most fitting for eDiscovery.

An open-source Hadoop alternative from LexisNexis and FreeEed

As if by coincidence, on the next day after the LTN article by Evan Koblents which mentioned FreeEed, LexisNexis announced that it will open-source its Hadoop alternative for handling Big Data:

"LexisNexis announced today that it will open-source its High Performance Computing Cluster (HPCC) technology, as well as offer an enterprise version with commercial support. The company is positioning HPCC Systems, developed internally by its Risk Solutions unit, as an alternative to Apache Hadoop. A virtual machine for testing purposes will be available soon, and code will be available in a few weeks." For fuller announcement, go here.

As a first impression, what are the major comparison points?

  • LexisNexis has been using its technology for a while and has a marketing clout to match, but it announced only plans to make the VM "available soon" and code "in a few weeks." One wonders if this is a reaction to the momentum that FreeEed has been gaining. On the other hand, FreeEed is already out on GitHub;
  • LexisNexis is essentially a closed-source company, so one wonders how really open-sourced the offering is going to be. But they may be successful - look at Microsoft open-source contributions. In LexisNexis own words, "Only the core technology is being released, LexisNexis' own data linking techniques aren't being released, nor are its data sources." In contrast, FreeEed is pure open source (with commercial support options), and people are already investigating using it in ways beyond eDiscovery. This illustrates the flexibility of an open source offering.
  • LexisNexis has Roxie, a system for query and data warehousing, but FreeEed will have the same based on Cassandra.
  • LexisNexis sports ECL (Enterprise Control Language), but Cassandra has CQL (Cassandra Query Language).
  • LexisNexis's "HPCC team has been working with Amazon Web Services to make sure the product work well on AWS servers," but FreeEed team has planned on the use of EC2 from the start and is actively working on it now.
The two are not exactly competitors at this point: LexisNexis releases the technology for high performance cluster computing and its risk handling applications, but they are close in their approach to open source and to handling Big Data, so it is worth watching.

Sunday, June 12, 2011

Implementing design patterns with Scala: Loan Pattern

Programmers are familiar with design patterns - these are best practice solutions to common problems. Then one often adds, almost apologetically, "you know, the code is different in every case, but you get to re-use the idea!"

Well, why would the code have to be different, and why can't you re-use even the code, and not only the idea? That's because in Java you cannot easily pass functions around. However, in Scala you can. Take, for example, the "loan pattern." It allows to borrow the resource and make sure you close it later. Here is what I mean. You may have thousands of little functions that open the SQL connection, use it, then close it, like in this picture




This screen image is to show off the syntax coloring of Scala code in NetBeans, but here is the same code in plain words:

  def doSqlStuff: Unit = {
    var conn: Connection = null
    try {    
      val url = "jdbc:mysql://localhost:3306/";
      val dbName = "beren";
      val driver = "com.mysql.jdbc.Driver";
      val userName = "beren";
      val password = "beren";          
      Class.forName(driver).newInstance();
      conn = DriverManager.getConnection(url + dbName, userName, password)
      val stmt = conn.prepareStatement("insert into stuff (id) values (?) ")
      stmt.setString(1, "" + new Date())
      stmt.executeUpdate
    } catch {
      case e: SQLException => {        
          e.printStackTrace
          println("SQLException: " + e.getMessage)
        }
      case ex: Exception => {        
          println("Exception: " + ex.getMessage)              
        }  
    } finally {
      conn.close
    }    
  }




But, only three lines are really important:


val stmt = conn.prepareStatement("insert into stuff (id) values (?) ")
stmt.setString(1, "" + new Date())
stmt.executeUpdate

Nevertheless, because of those three lines I am forced to replicate the boilerplate over and over. If I could somehow pass these lines to my boilerplate code, I would have it so much nicer!

OK, and indeed, it's better when you do it like this. Here is the boiler plate. Now you promise to pass it the function to do the real work.

def doSqlStuffBoilerplate (sqlDoer: Connection => Int): Unit = {
  var conn: Connection = null
  try {      
    val url = "jdbc:mysql://localhost:3306/";
    val dbName = "beren";
    val driver = "com.mysql.jdbc.Driver";
    val userName = "beren";
    val password = "beren";            
    Class.forName(driver).newInstance();
    conn = DriverManager.getConnection(url + dbName, userName, password)        
    
    sqlDoer(conn)
    
  } catch {
    case e: SQLException => {          
        e.printStackTrace
        println("SQLException: " + e.getMessage)   
      }
    case ex: Exception => {          
        println("Exception: " + ex.getMessage)                
    }    
  } finally {
    conn.close
  }        
}


Here are the three lines that you will pass as a function,

def sqlDoer(conn: Connection): Int = {
  val stmt = conn.prepareStatement("insert into stuff (id) values (?) ")
  stmt.setString(1, "" + new Date())
  stmt.executeUpdate
}

and this is how to call the method:

def doSqlStuffBetter(): Unit = {
  doSqlStuffBoilerplate (sqlDoer)
}

We tried this idea in Java, but the code is unreadable, and we went back to the old way of copying and pasting. With Scala, you can abstract and CODE the design pattern. (Even better is to in-line the function when passing, but I will do it later).


Tuesday, June 7, 2011

EddUpdate talks about FreeEed, the first open source tool for eDiscovery

Mark Kerzner of FreeEed (https://github.com/markkerzner/FreeEed) released version 1 of a free open source e-discovery processing tool today. Aimed at the do-it-yourself crowd, law firms looking to bring data processing in-house, and vendors seeking a way to process client data license-free, it provides an egalitarian alternative to the various solutions already on the market.

Continue reading...


Monday, June 6, 2011

Houston Hadoop Meetup #3

Vikram Oberoi, a Big Data engineer from Cloudera, presented Hadoop use cases, explaining when and why you would use Hadoop. Here is Vikram's presentation, which he graciously provided for us - an note that he gave it a Houston-specific spin for the energy and medical applications.


Thanks to Erin Mynatt, also of Cloudera, for bringing Vikram from California and for coming from Denver, as well as for sponsoring the pizza & beer afterwards.

Thanks also to John Bland II for joining the group just in time. By the way, John is doing some awesome graphics work CNN, Fox, etc., so check him out!

Thursday, June 2, 2011

FreeEed processes its first Enron PST file

Version 1.0.2 fixes processing of PST and loose EML files. Thanks to the latest fixes by the Tika community, unusual email formats are processed unimpeded. When processing emails, Tika extracts the text from all the attachments, thus, one line of code does complete email processing.

These are the benefits of building on top of other open source tools and communities.

FreeEed is an open source tool for eDiscovery that can be downloaded on GitHub here.