Correcting Corrupted Characters
Published 14 years, 11 months past
At some point, for some reason I cannot quite fathom, a WordPress or PHP or mySQL or some other upgrade took all of my WordPress database’s UTF-8 and translated it to (I believe) ISO-8859-1 and then dumped the result back right back into the database. So “Emil Björklund” became “Emil Björklund”. (If those looked the same to you, then I see “Börklund” for the second one, and you should tell me which browser and OS you’re using in the comments.) This happened all throughout the WordPress database, including to commonly-used characters like ‘smart’ quotes, both single and double; em and en dashes; ellipses; and so on. It also apparently happened in all the DB fields, so not only were posts and comments affected, but commenters’ names as well (for example).
And I’m pretty sure this isn’t just a case of the correct characters lurking in the DB and being downsampled on their way to me, as I have WordPress configured to use UTF-8, the site’s head
contains a meta
that declares UTF-8, and a peek at the HTTP response headers shows that I’m serving UTF-8. Of course, I’m not really expert at this, so it’s possible that I’ve misunderstood or misinterpreted, well, just about anything. To be honest, I find it deeply objectionable that this kind of stuff is still a problem here on the eve of 2010, and in general, enduring the effluvia of erroneous encoding makes my temples throb in a distinctly unhealthy fashion.
Anyway. Moving on.
I found a search-and-replace plugin—ironically enough, one written by a person whose name contains a character that would currently be corrupted in my database—that lets me fix the errors I know about, one at a time. But it’s a sure bet there are going to be tons of these things littered all over the place and I’m not likely to find them all, let alone be able to fix them all by hand, one find-and-replace at a time.
What I need is a WordPress plugin or something that will find the erroneous character strings in various fields and turn them back into good old UTF-8. Failing that, I need a good table that shows the ISO-8859-1 equivalents of as many UTF-8 characters as possible, or else a way to generate that table for myself. With that table in hand, I at least have a chance of writing a plugin to go through and undo the mess. I might even have it monitor the DB to see if it happens again, and give me a big “Clean up!” button if it does.
So: anyone got some pointers they could share, information that might help, even code that might make the whole thing go away?