Transiently Damaged PDF AttachmentsPublished 11 years, 1 month past
I have this very odd problem that seems to be some combination of PDF, Acrobat, Outlook, Thunderbird, and maybe even IMAP and GMail. I know, right?
The problem is that certain PDFs sent to me by a single individual won’t open at first. I’ll get one as an email attachment. I drag the attachment to a folder in my (Snow Leopard) Finder and double-click it to open. The error dialog I immediately get from Acrobat Professional is:
There was an error opening this document. The file is damaged and could not be repaired.
Preview, on the other hand, tells me:
The file “[redacted]” could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.
When this happens, I tell the person who sent me the file that The Problem has happened again. She sends me the exact same file as an attachment. Literally, she just takes the same file she sent before and drags it onto the new message to send to me again.
And this re-sent file opens without incident. Every time. Furthermore, extra re-sends open without incident. I recently had her send me the same initially damaged file five times, some attached to replies and others to brand-new messages. All of them opened flawlessly. The initially damaged file remained damaged.
Furthermore, if I go through the GMail web interface, I can view the initial attached PDF (the one my OS X applications say is damaged) through the GMail UI without trouble. If I download that attachment to my hard drive, it similarly opens in Acrobat (and Preview) without trouble.
A major indication of damage: that first download is a different size than all the others. In the most recent instance, the damaged file is 680,302 bytes. The undamaged files are all 689,188 bytes. If only I knew why it’s damaged the first time, and not all the others!
So far, I’ve yet to see this happen with PDFs from anyone else, but then I receive very few attached PDFs from people other than this one (our events manager at An Event Apart, who sends and receives PDFs and Office documents like they’re conversational speech — an occupational hazard of her line of work), and it only seems to happen with PDFs of image scans that she’s created. Other types of PDFs, whether she generated them or not, seem to come through fine; ditto for other file types, like Word documents. I’d be tempted to blame the scanning software, but again: the exact same file is damaged the first time, and fine on every subsequent re-attachment.
I’ve done some Googling, and found scattered advice on ways clear up corrupted-PDF-attachment problems in Thunderbird. I’ve followed these pieces of advice, and nothing has helped. In summary, I have so far:
- Tried the Thunderbird extension OPENATTACHMENTBYEXTENSION. That failed, and so I immediately uninstalled it because handling files by extension alone is just asking to be pwned, regardless of your operating system or personal level of datanoia. (I wouldn’t have left it installed had it worked; I just wanted to see if it did work as a data point.)
Here’s what I know about the various systems in play here:
- I’m using Thunderbird 11.0.1 on OS X 10.6.8.
- The attachments are always sent via Outlook 2010 on Windows 7.
- The software used for the scanning is the HP scanning software that was installed with the scanner. Scans are saved to the hard drive, renamed, and then manually attached to the email. On resend, the same file is manually attached to the email.
- My email account is a GMail IMAP account.
So. Any ideas?
Well, test-install another email software on your OS X. That way you should be able to determine if it is a Thunderbird issue in combination with the scans.
Can you diff the two versions? It sounds like the trouble you get with FTP programs when they convert LF to CR+LF or vice versa during transfer, though why that would happen is a mystery.
I’ve had the exact same problems with outgoing Outlook attachments, which I suspect may be the root of your problem, too. I’ve had PDF attachments behave weird exactly as you describe, and also various image attachments. Last summer we even had an office mystery where we kept seeing transparent regions in images become opaque. Yup, also traced to them having been attached to an outgoing email sent from Outlook.
My only scientific inquiry into this involved having the same attachments resent from a Gmail account and the problem was never replicated, so that’s why I suspect Outlook (outgoing) in every case. We never identified any other variables that seemed to contribute to the situation.
I’ve been seeing the exact same thing with multiple filetypes (but particularly Office files); but I’m getting it in Firefox+Preview, not Thunderbird. With mine I’ve noticed that the downloaded file will actually open correctly from the Downloads folder, even if it failed the first time when the browser launched Preview. So my assumption was that this is just a race condition, somehow preview is opening these files before they are ‘ready’. The files with the problem are generally on a server I’ve got a slow connection to.
So – have you tried opening the broken files from ‘Downloads’?
I wonder if there might be .js in there, perhaps doing something funky or even mean?
I’m wondering if it’s not an odd gmail quirk because I have a very similar problem. My company is switching over to using gmail as their email provider, and we use the web interface. I have a customer who sends pdfs to me all the time. 9/10 if I try to open the pdf after I’ve saved it locally, it fails to open. If I resave the file (from the same email), it will then open. If I forward that email to a colleague (who’s not on gmail yet), I often have to send it 2-3 times, then it mysteriously can be opened.
Drives me insane.
I have seen smtp servers that misbehave in escaping. Any line that starts with a period should be escaped in transit but some servers fail that. An email attachment is probably base64 encoded which shouldn’t contain periods so this might not be the cause but it’s easy to test for. Just send an email with some lines of content, a line with just a period, and then more lines of content. If all content arrives at the destination then the problem lies elsewhere. If not then some smtp server is mangling emails.
Is it a problem with the filenames? Do they contain characters that might cause your system problems? And what happens is she doesn’t rename the PDFs before sending them? I’d try anything incase it works!
Whenever I send a PDF from InDesign too hastily before the printing is completed it’s reported as corrupted – maybe it could be something as simple as that.
Same problem here, but with Gmail + Mail.app. Maybe something gmail related?
In reading your article, I became fascinated with a term you used “datanoia”, and I did a search query to see if there might be a definition of this word out there. I came across a site with an exact replica of your blog. It appears that razworks thinks they wrote the articles on this blog: web designer Sarasota. com.
Please feel free to moderate out of this discussion both this and my previous comment. It was FYI only. Thanks. And I like this new word (to me), datanoia. Thanks for another educational read!
Hey, Rachel, thanks for the heads-up. Getting my posts scraped and republished by others has been a frequent occurrence over the years and I’ve long since made my peace with it. The only way to prevent it is not to publish in the first place, and for me that’s not worth the cost.
I made “datanoia” up on the spot, though doubtless I wasn’t the first, so it really just meant what I had in my head, which is: “paranoia about the data one receives”, with a secondary definition of “paranoid about the data one shares”—sort of a mirror image of the first definition. That was all. Though I suppose it also makes for a useful way to ferret out blog-scrapers!
Hi, I’m a long time gmail user. I’ve found that pdfs I have stored on the gmail servers are often corrupted. I’ve read where if these files are scanned by software on the server, that sometimes they will be corrupted.
I know and have no reason to think that things have changed – that Google continues to scan the accounts and uses the information for purpose of advertising. Has anyone considered that Google may be damaging these files while scanning them?
In my case, Czar, the files were actually viewable via the GMail web interface but broken in Thunderbird. They stayed viewable in GMail and continued to be broken in Thunderbird.
So that was my eventual solution: if an attachment is borked in Thunderbird, download it via the GMail web interface. Unsatisfying, but practical.
Mine are not retrievable in any circumstance once they’ve become corrupted. I’ve not used Thunderbird in a long time and so this isn’t applicable to my circumstance. I did read that when files are opened and scanned often, they can become corrupted. Needless to say I’ll be storing mine in someplace where they won’t get scanned by third parties. It should keep this to a minimum.
In Thunderbird if you Select “View Source” from the “Other Actions” drop down, you will find the problem.
My guess is that the PDF was sent not with the MIME (Multipurpose Internet Mail Extensions) Content Type of application/pdf and rather used the fall back MIME Content Type of “Content-Type: application/octetstream;”
The IETF’s RFC 2046 says:
The problem is that the developer of an email client can interpret the RFC as they see fit. As shown below every PDF attachment is done a little differently.
Content-Type: application/pdf; name="SSS.pdf"
Content-Type: application/pdf; name="ZZZ.PDF"
Content-Disposition: attachment; filename="ZZZ.PDF"; size=143065; creation-date="Tue, 29 May 2012 16:47:46 GMT";
modification-date="Tue, 29 May 2012 16:47:46 GMT"
I’ve just hit this, and I think I’ve figured out what causes it.
Outlook is sending the PDFs using quoted printable rather than base64 encoding. Unfortunately, it’s not encoding the line breaks within the file, and when your mail client opens the file it correctly switches these to system line endings. In other words 0D 0A in the original gets replaced by 0A when you detach it. This is why the corrupted file is smaller than the original. Unfortunately, PDFs are not text files, and these substitutions will break the file.
I can’t explain why it works when it is resent. Presumably Outlook, for reasons best known to itself, switches to base64 encoding. You can check this by looking at the source of the emails and seeing which encoding Outlook has used in both cases.
If using Firefox you have to disable to Adobe Acrobat NPAPI plug In. In Firefox you have to go to Tools dropdown window then select Add ons. In the add ons you will see the Adobe Acrobat NPAPI plug in and there should be an option to disable it. If using Safari you have to go to your Users Library file then to “internet plug ins and in the internet plug ins file there should be the Adobe Acrobat NPAPI plug in. When you locate it drag it to the trash. You should also trash the attachment that was saved to your desktop. Restart firefox and/or safari and re-save the pdf attachments again. After you re-save the attachments you should then be able to open them from the desktop. It corrected my problem on a Macbook Pro.
I found this by searching for the particular error. In my case the issue is that Entourage was only partially downloading attachments. I would expect that it would be smart enough to download the rest of the attachment when you try to open it, but that’s not the case!
In any rate, the easiest fix was to just setup the account to download the full attachment (the default was set to only download 4 MB).
I have read this thread & many others like it. And have made the 3 Config setting changes to False. In general, all attachments (pdf, jpg, doc, etc) from any email source (Outlook, iphone, gmail, etc) arrive corrupted in my Thunderbird IMAP account (me.com), but arrive perfectly in my Thunderbird POP account (***.com). Unless I can find an IMAP solution, I am going have to revert to using POP address.