Tuesday, February 21, 2012

Top 20 replies by Programmers to Testers when their programs don't work

20. "That's weird..."

19. "It's never done that before."

18. "It worked yesterday."

17. "How is that possible?"

16. "It must be a hardware problem."

15. "What did you type in wrong to get it to crash?"

14. "There is something funky in your data."

13. "I haven't touched that module in weeks!"

12. "You must have the wrong version."

11. "It's just some unlucky coincidence."

10. "I can't test everything!"

9. "THIS can't be the source of THAT."

8. "It works, but it hasn't been tested."

7. "Somebody must have changed my code."

6. "Did you check for a virus on your system?"

5. "Even though it doesn't work, how does it feel?

4. "You can't use that version on your system."

3. "Why do you want to do it that way?"

2. "Where were you when the program blew up?"

1. "It works on my machine"

Processing Enron Data On A FreeEed Cluster

To process eDiscovery on a Hadoop cluster prepared for FreeEed (we will call it FreeEed Cluster), you can use your Amazon EC2 account and rent your computers there.

The images need to be prepared with the utilities that FreeEed requires (SHMsoft will publish and maintain these EC2 images soon). Therefore the first step - start the cluster!

You may notice that I use a separate namenode, a separate jobtracker node, and then as many data/work nodes as needed, four in this case. Datanode is stored as EBS and multiple copies of it can be started.

The procedure to start the cluster after stopping the machines has a few caveats, and I have described it in the Hadoop Illuminated book, which I am gradually publishing to the web, so you will have that soon.

Having started the Hadoop daemons, let's verify the cluster operation. If you don't want to open your machines to http access to the world (like I don't), you can login to the jobtracker server and run command-line browser there,

w3m http://localhost:50030

This is what I saw:

Since the Enron data is already in zip files, staging (when FreeEed packages the files into zip files for processing) is not needed, and we give these as input to FreeEed. The input parameters file looks as follows

As a courtesy to EDRM, which nicely provided the Enron data, we store all of these files in our own S3 buckets, so as not to download the data from them every time, and not to make them incur bandwidth charges.

Now let's run the processing, which, incidentally goes very fast with more data. I saw first-hand that Hadoop is designed for Big Data and behaves better under load. Eight custodians' PSTs (admittedly, smallish) took about a minute

For FreeEed processing on EC2, it makes sense to store the input and output data in S3, rather than in HDFS. That is because, unlike the unlimited HDFS of your private cluster, the HDFS of an Amazon cluster needs to be maintained, costs money even if not used, and disappears on cluster shutdown, if stored in ephemeral storage. The S3, on the other hand, fits the bill perfectly: it is protected against failure, replicated in many computing centers, and you pay only for the data you actually store. FreeEed has both "hdfs" and "s3" options for the file system.

Here are the output zip files, happily residing in designated S3 buckets

The "attempt..." name is given by FreeEd to make each zip file, coming from an individual reducer, is unique across the job output files, so that reducers don't overwrite each other's output.

That's it. You can now load the output in Hive for analysis.

Tuesday, February 14, 2012

Using FreeEed for Early Case Assessment

Open source tool for eDiscovery, FreeEed, can be used on your Windows, Mac, or Linux workstation. What's more, since it is built on Hadoop, it can run on a Hadoop cluster, whether in your computing center or on Amazon EC2, and it can run on tens or hundreds of machines. It produces a csv file with metadata (load file), and an archive containing the native documents, the extracted text, and any exceptions that happens in processing.

This is all very well, you can say, but what do I do with a load file? Enters Hive! Hive is an open source tool, and it runs in the same environments where FreeEed runs. You can load the csv metadata file, regardless of how large it is, and then you can run SQL queries against your results. Any question about your data that you can formulate as a SQL query, Hive can answer.

How does Hive do that? It runs on the same Hadoop cluster, and it harnesses all the same machines that you just used to run FreeEed. It runs locally too, in case you want to test it on a workstation. Actually, its language is called HiveQL, and it is a vast superset of SQL, so really you can do much more.

The caveat? Since runs a MapReduce job to find the answer, it takes at least a minute to run through it. More data will require more time. On the other hand,

  • Any amount of data is not a problem, precisely because the cluster computes the answer;
  • Hive is open source and easily available;
  • HiveQL language is powerful and is used by analysts to solve Big Data problem daily.
Let's get to the code (complete scripts for Hive come with FreeEed).

Create the Hive table:

create table load_file (Hash string, UPI string, File_Name string, Custodian string, .... )
row format delimited 
fields terminated by '\t'
stored as textfile

Load metadata into the Hive table we just created:

load data local inpath 'freeeed_output/output/part-r-00000'
overwrite into table load_file;

By the way, to run the script outside Hive shell, you use this command

hive -f scripts/hive_load_table.sql

Let's see what we've done so far: created the table, and loaded the metadata into Hive. Let's start the Hive shell, show all tables, then show our 'load_file" table:

Now let us run some simple queries.

How many rows do I have?

hive> select count (*) from load_file;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_201202101358_0020, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201202101358_0020
Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=localhost:8021 -kill job_201202101358_0020
2012-02-14 21:06:20,180 Stage-1 map = 0%,  reduce = 0%
2012-02-14 21:06:23,188 Stage-1 map = 100%,  reduce = 0%
2012-02-14 21:06:32,220 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201202101358_0020

For the rest, we will take out the Hadoop messages.

Who is writing, and how many emails each one wrote?

select email_from, count(email_from) from load_file group by email_from;

"Abrams  Clement " 1
"Adams  Suzanne " 1
"Anderson  Diane " 11
"Apollo  Beth " 1
"Aronowitz  Alan " 9
"Bailey  Derek " 1

How else can you use this FreeEed/Hive functionality? Investigators and compliance auditors may find it useful, and companies can make it the engine for their enterprise search.