- Convert PST to MBOX formats. Use readpst in Linux and JPST in Windows. Before I used individual EML emails, but this is not so efficient, since there are too many of them. Dealing with MBOX files that correspond to top-level PST folders fits much better with the overall Hadoop processing;
- Use javamail in conjunction with mstor local access provider to process these MBOX files. This approach is great because it allows to use standard components of high quality. It also gives full access to attachments, CC, BCC, etc.
Now this approach is something I feel very good about, because it combines the best practices with overall efficiency.