A lot of important and useful information is posted to litsupport each week. The following is a distilled summary, in the form of questions and answers.
Q. How to select deduplication options for emails, such as for Outlook (PST) files? For those vendors who do not document it completely (using Clearwell as an example), what can be guessed?
A.
- Assume that they all (Trident Wave, Law, etc.) dedupe basically the same using MD5 or SHA-1 hash value http://www.secure-hash-algorithm-md5-sha-1.co.uk/ . Assume that Clearwell probably does the same thing. Basically the program looks at a number of fields, FROM, TO CC, SUBJECT and calculates a hash value (like a fingerprint) for the electronic message. Then it runs a comparison of the Hash value so that it can eliminate the duplicates;
- The problem is getting an exact list of which fields are used can be difficult. Some systems just list them in a selection page and leave it up to you which you want to use. Some problems to watch are: (a) identical header/subject/body content, but different contents of the attachments, (b) use of Microsoft MSGID which can have collisions in as few as as 10,000 email, the reverse issue - systems which are so picky that they only effectively dedupe on entire PST/MSGs, using the path and other delivery/usage MAPI fields so that you still end up with 20+ copies of the lunch notice from all of your custodians. Clearwell seems to be using a good hash of fields. Advice: always run a couple tests on your sample sets;
- The specific fields used for Law are located in the help file under the dedupe section;
- Clearwell has a 4-page document that outlines how de-duplication works in their product. A number of fields are used from the email data, these fields are different from those used by LAW or Trident. For loose file de-duplication Clearwell uses some meta fields and the hash of the content which is a different approach to just hashing the content. It can identify files that have the same content but have different filenames and meta fields. The feature is called File Analysis;
- Clearwell does deduplication differently in version 4.5 then in 4.0, due to foreign language changes;
- It would be nice to have a standard for deduplication of electronic evidence. However, it would be complicated: there is a legal standard for identifying 'identical evidence' or duplicates, by which a deduplication strategy can be crafted. It is called the 'rules of evidence' in whatever jurisdiction one finds their case. The definition varies by the evidence and nature of the case. Today, this necessitates various options in the processing software and the understanding of them;
- Can anyone identify a court that explicitly defines, dictates or publishes guidelines for ESI duplicate detection and handling? - Take a look at a well crafted Case Management Order where deduplication was discussed between educated lawyers in the Meet & Confer. The factual underpinnings of the case will define duplicates, and the regimine to be used to de-duplicate or re-populate. Which is why a "universal standard" is utopian.
This summary from the Litsupport Group postings created by the wonderful and talented members of the group has been culled by Mark Kerzner and edited by Aline Bernstein.
No comments:
Post a Comment