Thursday, November 24, 2011

How to process Microsoft Outlook .PST files

Here is an efficient way that FreeEed uses:
  • Convert PST to MBOX formats. Use readpst in Linux and JPST in Windows. Before I used individual EML emails, but this is not so efficient, since there are too many of them. Dealing with MBOX files that correspond to top-level PST folders fits much better with the overall Hadoop processing;
  • Use javamail in conjunction with mstor local access provider to process these MBOX files. This approach is great because it allows to use standard components of high quality. It also gives full access to attachments, CC, BCC, etc.
Now this approach is something I feel very good about, because it combines the best practices with overall efficiency. 

1 comment:

Mark Kerzner said...

Well, and one should never rush feeling good, because the reality is more complex. I ended up processing EML with Java Mail API and MBOX with mstor.