Monday, August 11, 2008

litsupport summary for the week ending on 08/10/08

A lot of important and useful information is posted to litsupport each week. The following is a distilled summary, in the form of questions and answers.

Q. Is there a QuickBooks Viewer?
. One can download "Simple Start" application from Intuit website (search there). It's free and should open QuickBook files. For proper chain of custody, one can try to make the file read-only or at least keep a backup version.

Q. How can one study Summation?
. One can request an evaluation copy from the web site, if granted, it will be valid for a year; there is a "Lawyer's Guide to Summation (Paperback)", published 2004; and webinars here.

Q. How to authenticate MS Word docs as evidence?
  1. Hashing the file objects will give you their object metadata and content. For file system metadata encapsulate both file objects into an archive (ZIP, RAR, TAR, ISO) and hash the archive;
  2. For the OS info, data should have been collected with a validated tool for the best defensibility. If not, you could go back to the original and use a safe metadata viewer to pull the original OS info (assuming that it has not been modified in the meantime). One can use for example Pinpoint Metaviewer;
  3. Look at Judge Grimm's opinion in Lorraine v. Markel Amer. Ins. Co., 241 F.R.D. 534 (D. Md. 2007);
  4. Summarized and expanded upon in a newsletter here.

Q. What is near-deduplication and how reliable is the process?
  1. Near-duplicate identification is using a similarity measure for grouping versions of an item, applicable to finding almost identical versions of email or MS-Word doc and other documents. It is useful in investigations, and for consistency of review;
  2. Near duplication breaks documents into overlapping shingles of a certain length. A shingle is a sequence of words (or letters) starting with the first word in a file and then starting with the second word, and so forth.The common algorithm then chooses a sample of these shingles from each document using a rule that is likely to yield the same shingles from different documents (if they are present). Simplifying a bit, the probability that two documents are near duplicates is the proportion of the sampled shingles that are shared by the two documents. See more here and here.
  3. There's no such thing as "reliable" near de-duplication. The entire science is subjective and prone to error :)
  4. Although near-dupes are not recommended to bulk code, but the foundation methods of Equivio, Attenex, Syngence, etc are just as scientific/repeatable as full text search for keywords. Every tool has an appropriate use;
  5. Google has a patent for "Method and Apparatus for Estimating Similarity." Google needs it in order not to list in the search results essentially the same pages (as some people use this to direct traffic to their sites). Compared to bottom-up methods described above, Google patent is top-down in that it generates sketches of objects being compared, and similarity is based on these sketches.
This summary from the Litsupport Group postings created by the wonderful and talented members of the group has been culled by Mark Kerzner ( and edited by Aline Bernstein (

No comments: