Saturday, 28 February 2009

PDF documents and metadata - some examples

Before I do a deeper dive into what metadata a PDF document contains, let's take a look at what must have been the main headline hitting example in 2008 of sensitive information being discovered within PDF metadata.

I am referring to the situation Google found themselves in with a submission they made, supposedly anonymously, to the Australian Competition and Consumer Commission regarding eBay and their proposal to force their users to use PayPal. After speculation on many blogs about the author of the anonymous submission one Dave Bromage took a look at the metadata in the PDF document and let the world know who it was. Despite the submission being replaced with a new version without the revealing metadata the word was out. I won’t comment on the reasons why this was at least embarrassing to Google (this is one report that gives the details as well as showing the metadata contents), but will add that there was an additional chuckle in the techie community that the metadata also showed that the document had not been created using Google’s own word processing app, one being The Register. My main comment is that this unintentional leakage of information involved a regulator as well as embarrassment at the very least to the originator (author and company).

The submission also had masked what would have been visible text about the submitter within the document. However the PDF did not have any security applied to it so it was very easy to copy that area of the document and paste it into another text processor to see the underlying information. Facebook/ConnectU have just this month fallen foul for the same reason. Numerous other examples in this area, GE and the US Justice Department being a couple of examples from 2008. If you want to mask visible text at the very least add security settings to the PDFs that you generate to disallow copying and pasting of text. Also look at redacting software which fully removes and masks text whilst retaining the layout in the PDF document.

I am sure it is pure coincidence that one of the other headlines in 2008 around information garnered from PDF metadata also involved Google, but from the other side of the fence. As reported here metadata in a PDF version of a lobbying letter from the Corn Farmers to Congress linked, albeit tentatively, the author back to some of Google’s political adversaries.

The lesson from these examples is that you should not assume that converting and sending/publishing a PDF removes metadata that could contain sensitive information.