Correcting Corrupted Characters

Published 16 years, 8 months past

At some point, for some reason I cannot quite fathom, a WordPress or PHP or mySQL or some other upgrade took all of my WordPress database’s UTF-8 and translated it to (I believe) ISO-8859-1 and then dumped the result back right back into the database. So “Emil Björklund” became “Emil BjÃ¶rklund”. ~~(If those looked the same to you, then I see “BÃ¶rklund” for the second one, and you should tell me which browser and OS you’re using in the comments~~.) This happened all throughout the WordPress database, including to commonly-used characters like ‘smart’ quotes, both single and double; em and en dashes; ellipses; and so on. It also apparently happened in all the DB fields, so not only were posts and comments affected, but commenters’ names as well (for example).

And I’m pretty sure this isn’t just a case of the correct characters lurking in the DB and being downsampled on their way to me, as I have WordPress configured to use UTF-8, the site’s head contains a meta that declares UTF-8, and a peek at the HTTP response headers shows that I’m serving UTF-8. Of course, I’m not really expert at this, so it’s possible that I’ve misunderstood or misinterpreted, well, just about anything. To be honest, I find it deeply objectionable that this kind of stuff is still a problem here on the eve of 2010, and in general, enduring the effluvia of erroneous encoding makes my temples throb in a distinctly unhealthy fashion.

Anyway. Moving on.

I found a search-and-replace plugin—ironically enough, one written by a person whose name contains a character that would currently be corrupted in my database—that lets me fix the errors I know about, one at a time. But it’s a sure bet there are going to be tons of these things littered all over the place and I’m not likely to find them all, let alone be able to fix them all by hand, one find-and-replace at a time.

What I need is a WordPress plugin or something that will find the erroneous character strings in various fields and turn them back into good old UTF-8. Failing that, I need a good table that shows the ISO-8859-1 equivalents of as many UTF-8 characters as possible, or else a way to generate that table for myself. With that table in hand, I at least have a chance of writing a plugin to go through and undo the mess. I might even have it monitor the DB to see if it happens again, and give me a big “Clean up!” button if it does.

So: anyone got some pointers they could share, information that might help, even code that might make the whole thing go away?

Correcting Corrupted Characters was published on Thursday, November 19th, 2009.
It was assigned to the Web and WordPress categories.
There have been seventy-two replies.

Comments (72)

The person that created this page: http://rishida.net/tools/conversion/ also uses wordpress. While, sadly, I cannot offer you any advice perhaps contacting him could be a help. I use his converter a great deal as my job requires converting characters all too frequently.

Signed,
Mitchell Shelton
Thursday, November 19th, 2009 9:23am
The list of characters: http://rishida.net/scripts/uniview/
The conversion from unicode to just about any other form:
http://rishida.net/scripts/uniview/conversion.php?origin=codepoint&codepoints=2026

Signed,
Steven Pemberton
Thursday, November 19th, 2009 9:23am
Are you sure the encoding isn’t just wrong and that it still isn’t really utf-8?

Could you test with perl and try decoding as utf-8?

use utf8;
use Encode;
decode(“utf8”, $string);
encode(“utf8”, $string);

try copying and pasting into xxd and see what the hex of the characters really is?

Signed,
Peter Mahnke
Thursday, November 19th, 2009 9:24am
Hi Eric,

If you are (relatively) sure that the data got converted to latin1/iso-8859-1, you can convert it back to utf-8 using mysql.

You would need to use a query like below – adjusted for your tables and columns – to do it:

INSERT INTO utf8table (utf8column) SELECT CONVERT(latin1field USING utf8) FROM latin1table;

For a complete php script to convert your database:
http://www.phpwact.org/php/i18n/utf-8/mysql

More info on CONVERT() and CAST():
http://dev.mysql.com/doc/refman/5.1/en/charset-convert.html
http://dev.mysql.com/doc/refman/5.1/en/cast-functions.html

Hope this helps and if you need more info, please feel free to contact me with twitter or email.

Juliette

Signed,
jrf_nl
Thursday, November 19th, 2009 9:25am
Hey,

I dunno a script, but for my part, I made sure that the table structure and the database is set to f.e. utf8_generel_ci or utf8_unicode_ci before importing the SQL (the SQL file should be in UTF-8 too).

handling with encoding problems is very annoying. I just had to convert a 2GB ISO8859-9 MySQL Dump to UTF-8. In addition the dump was from a MySQL 4.0.x database, whereas the target server was 5.0. I am happy to have an ssh access. Otherwise I had to download 2GB and upload again. Then the next problem was that iconv could only handle ca. 90 MB file at once, so splitted, converted, combined, removed duplicate entry inconsistencies, too large varchar length values, …… I never ever wanna be a UNIX administrator

Signed,
Halil K
Thursday, November 19th, 2009 9:43am
Hi,
I’m no expert either but I’d go with Peter Mahnke’s suggestion before really messing with your data.

It might be that you’ve upgraded from a MySQL version that doesn’t support charsets to one that does, and that the upgrade has assumed your data is ISO-8859-1 (it is the MySQL default I believe). i.e. it’s taken your raw utf-8 and simply “labelled” it iso-8859-1.
Ben

Signed,
Ben Holland
Thursday, November 19th, 2009 9:59am
Your problem is that in

Signed,
Priit Pirita
Thursday, November 19th, 2009 10:14am
Do you (and all your scripts) always execute SET NAMES UTF-8 before running any MySQL queries?

If not, that is likely to blame. This is really really painful gotcha. MySQL defaults to latin1, and UTF-8-based applications that fail to declare UTF-8 each time every time when talking to MySQL, will screw up encoding badly.

Signed,
kL
Thursday, November 19th, 2009 10:14am
Can’t help you with the problem… :( But you might find this one worth a look once you are up and running again: http://ilfilosofo.com/blog/wp-db-backup There are at least a couple alternatives but this is pretty robust…

Signed,
Frederik Raabye
Thursday, November 19th, 2009 10:16am
I don’t feel I can be very helpful at the moment with regard to the post subject matter, but I’m pretty sure that plugin stuff is not irony. :D

Signed,
Coheed
Thursday, November 19th, 2009 10:16am
Do you have this in wp-config.php?

define(‘DB_CHARSET’, ‘utf8’); // force utf8
define(‘DB_COLLATE’, ‘utf8_general_ci’); // force collation

Signed,
Steven Pemberton
Thursday, November 19th, 2009 10:19am
Oh and I forgot to mention that possibly somebody/something took your UTF-8 database and converted it once more into UTF-8. My solution reverts that back.

Signed,
Priit Pirita
Thursday, November 19th, 2009 10:23am
Steven: no, wp-config.php says:

define(‘DB_CHARSET’, ‘utf8’);
define(‘DB_COLLATE’, ”);

To all asking if I’m sure that my data isn’t UTF-8, I’m pretty sure it is UTF-8, but was at some point converted to ISO-8859-1 and then the resulting characters were re-inserted and are now treated as UTF-8.

My reason to believe this is that I did a search-and-replace of all corrupted smart quotes (singles and doubles), finding the three-characters strings and replacing them with raw UTF-8 characters (right?), and now they display correctly instead of with the 8859 three-character strings. Here, I’ll input some straight non-ASCII characters. If they come out clean, I would assume that I’m storing things in the DB as UTF-8.

Signed,
Eric Meyer
Thursday, November 19th, 2009 10:31am
Well, I see the quotes in my previous comment correctly (and they’re not HTML entities) but the other symbols I inserted are now question marks, even though they looked good when I previewed the comment. Let me try a few more:

Signed,
Eric Meyer
Thursday, November 19th, 2009 10:33am
Very interested to see if a solution comes of this; this happens to me from time to time when I do export & imports of my SQL databases (or even using WordPress’ own built-in export/import function)

Signed,
Jeff Byrnes
Thursday, November 19th, 2009 11:15am
Priit: your solution didn’t work for me. I tried the following as a test:
```
$str = "ö";
$str = mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8');
echo $str;
```
The output was a question mark. I checked and the page was UTF-8.
Signed,
Eric Meyer
Thursday, November 19th, 2009 11:15am
kL: I don’t know, since pretty much close to all the PHP that touches the content is WordPress, and I have no idea if it’s uniformly clean with character handling or not. Of course, as others have pointed out, it could’ve been a mySQL upgrade that did it.

Signed,
Eric Meyer
Thursday, November 19th, 2009 11:20am
That’s because your PHP files is NOT UTF-8.

Use some editor which can save PHP source’s as UTF.

And yes I did check that solution before I posted and yes now I converted my php source into MacRoman and got ?.

Signed,
Priit Pirita
Thursday, November 19th, 2009 11:21am
Ah HAH! Thanks many times over, Priit! It’s all starting to make a lot more sense. The temples are still throbbing, but a good deal less so now.

Now I’m wondering about the best way to check a great big UTF-8 database to see if its content has gotten mangled this way. I could of course search for records containing mangled string sequences but that would mean a lot of checks for all the possible mangles. It seems like there ought to be a better way.

Signed,
Eric Meyer
Thursday, November 19th, 2009 11:32am
Try to dump, then try to convert that dumpfile back as I recommened and then try to read it back in.

And no, i haven’t tried it myself :-)

Signed,
Priit Pirita
Thursday, November 19th, 2009 11:38am
Hey Eric,
It seems like you

Signed,
Oli
Thursday, November 19th, 2009 11:48am
Grrr. My further, UTF-8-powered tests are hitting a couple of snags.

Here’s an example. “Fatih Hayrio?lu’nun not defteri” (see the top of this page for what that’s really supposed to look like, since it looks like WP glitched one character in the process of posting the comment) got corrupted into “Fatih Hayrioğlu

Signed,
Eric Meyer
Thursday, November 19th, 2009 11:49am
Can you live with entities? as ğ

Signed,
Priit Pirita
Thursday, November 19th, 2009 12:00pm
No, I’m stupid. Sorry.

Use ISO-8859-15

Signed,
Priit Pirita
Thursday, November 19th, 2009 12:05pm
You also need to check the page source as well for some reason you are setting the Content-TYpe Twice once for latin1 (iso-8859-1 line 8 this page) and again as utf-8 (line 10). It will probably not solve all your problems but it does not make it any easier for the browser to know which one to use.

The headers sent by apache look ok agian this page via curl -I

HTTP/1.1 200 OK
Date: Thu, 19 Nov 2009 17:31:18 GMT
Server: Apache
X-Powered-By: PHP/5.2.8
Vary: Cookie
X-Pingback: http://meyerweb.com/eric/thoughts/xmlrpc.php
Connection: close
Content-Type: text/html; charset=UTF-8

here’s some swedish characters just to test:

Signed,
Brian Riddle
Thursday, November 19th, 2009 12:33pm
I can

Signed,
Smokey Ardisson
Thursday, November 19th, 2009 12:52pm
Brian, thanks for pointing out that erroneous, superfluous meta. I’ve corrected it. It only showed up on some pages, which is why I hadn’t seen it when I viewed source on other pages.

Priit: I changed my test script to translate from ISO-8859-15 to UTF-8 and it translated ğ into ? (entity version ľ – codepoint x13E – Unicode name LATIN SMALL LETTER L WITH CARON) instead of ? (entity version: ğ). I could live with entities as long as I could convert to them accurately, though I’d really prefer to store the raw characters. I could decide to convert to entities when displaying on the web, or just push them out raw, or whatever. The point being that I’d much rather store the real text and decide how to convert it for display than store the converted forms.

Signed,
Eric Meyer
Thursday, November 19th, 2009 1:45pm
Strange – if the result is ? then input IS already correct UTF and you shouldn’t touch it. Or it means that this record wasn’t converted or something.

Signed,
Priit Pirita
Thursday, November 19th, 2009 2:02pm
Smokey: if I change my wp-config.php to read define('DB_CHARSET', ''); then even more things go wrong, as in more characters in this post and its comments get mangled. (Though usually just as question marks.) Undoing that back to define('DB_CHARSET', 'utf-8'); gets me back to where I am now, which is better but not great.

Steven: I finally got to test the settings you recommended. Setting define('DB_COLLATE', '

Signed,
Eric Meyer
Thursday, November 19th, 2009 2:41pm
This plug in may do the whole thing in one shebang:

http://wordpress.org/extend/plugins/utf-8-database-converter/

Signed,
Mark Groen
Thursday, November 19th, 2009 2:46pm

Hi.

1. I assume that your problem has nothing to do with PHP neither WordPress. It is a bug (or a feature) of MySQL.

2. Implement a dbclass for all database connects like this:

class dbFactory {
    private static $conn='';

    public static function getConnection() {
        try {
            if (!$conn) {
                $conn = new PDO('mysql:host=localhost;dbname=yourdatabase','user','password');
                $conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
		$conn->query('SET NAMES utf8;'); # This is important.
            }
        } catch (PDOException $e) {
            die($e->getMessage());
        }
        return $conn;
    }
}

3. Grap phpMyAdmin and backup every table of your application in sql format.

4. Insert the data back from the files and look at the results.

5. You’re done.

Mark: Smokey says that plugin doesn’t work for recent versions of WP.

JeyJey: that looks very cool, except I have no idea how I would go about using it. As in, no idea where I would add that class, nor how to make sure I invoked it.

Signed,
Eric Meyer
Thursday, November 19th, 2009 3:21pm

Okay, folks, I think I may have found the root of the problem. Thanks to Oli’s comment, I managed to pull the following from my WP install:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     | 
| character_set_connection | latin1                     | 
| character_set_database   | latin1                     | 
| character_set_filesystem | binary                     | 
| character_set_results    | latin1                     | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

So, yeah. I’m thinking that’s my problem. Now I just have to figure out how to fix it.

Eric: Are older posts (at least those without characters you

Signed,
Smokey Ardisson
Thursday, November 19th, 2009 3:44pm
I ran into this same problem recently, and documented how to fix it on my blog. Long story short, older wordpress databases were created in latin1, but when the utf8 option came out, wordpress happily put UTF8 data into the latin1 database. The most recent version of WP got more strict about how it reads that database, and starts puking. The solution is to export your DB in latin1 format, manually change the dump file to say it’s UTF8, and then re-import. This solved the problem on several of my older WP blogs.

Details here: http://spaceninja.com/2009/10/how-to-convert-your-old-wordpress-database-to-utf8/

Signed,
Scott Vandehey
Thursday, November 19th, 2009 3:53pm
Continuing from twitter.

DB stores bits, that’s all. Let’s take that “Bj

Signed,
Priit Pirita
Thursday, November 19th, 2009 4:15pm
Oh, and in your case some piece of software now actively CONVERTS (or converted) your data and this is causing problems. Normally this should be avoided.

Signed,
Priit Pirita
Thursday, November 19th, 2009 4:30pm
After digging around by links in comments 34&35 I found following:

If DB_COLLATE and DB_CHARSET do not exist in your wp-config.php file, DO NOT add either definition to your wp-config.php file unless you read and understand Converting Database Character Sets.

So Smokey’s advice seems to ge good one.

Signed,
Priit Pirita
Thursday, November 19th, 2009 4:42pm
Here’s the thing, though, Priit: so far as I can tell, the string actually stored in my database is actually “Björklund”, 10 bits of data. (Or is it bytes? 10 characters, anyway.) Once upon a time, though, it wasn’t. Somehow, the original text with the

Signed,
Eric Meyer
Thursday, November 19th, 2009 4:43pm
It’s bytes of course :-) and your DB seems to be OK, nothing converter there (and saved) so that’s good news.

But seriously, try this advice http://codex.wordpress.org/Editing_wp-config.php#Database_character_set

and remove DB_COLLATE and DB_CHARSET from wp_config, easy to try and costs nothing.

Signed,
Priit Pirita
Thursday, November 19th, 2009 4:50pm
Holy [REDACTED], Priit, removing (commenting out) both lines did the trick. What the [REDACTED]?

Right now, I feel like I went down another dead end and was shocked to discover the terminal wall was just a hologram, beyond which was a land of cake and ice cream.

Signed,
Eric Meyer
Thursday, November 19th, 2009 4:55pm
And ONLY this entry is messed up, because of the wrong settings, but the next ones will not.

Signed,
Priit Pirita
Thursday, November 19th, 2009 4:58pm
Wait, so you’re saying that any content using non-Latin characters that was entered while the wp-config settings were in effect will have been mangled? Because I have no idea how long those settings have been there. It might’ve been only since I upgraded to 2.8 (straight from 2.6—long story) but I don’t know when they were added to wp-config.

Signed,
Eric Meyer
Thursday, November 19th, 2009 5:06pm
Dunno, previous entry seems to be OK? Besides, I don’t know WP at all :-)

But, here is the root of that mess, if somebody is interested: Your database is old one, text fields are defined in latin1. Now when you tell mysql to set character set to UTF-8, mysql “knows” that input is in latin1 (by definition) and CONVERTS that text to you (which is actually already UTF-8). Result is mess…

Signed,
Priit Pirita
Thursday, November 19th, 2009 5:19pm
Is there any way I can clean up the root of the mess so that this doesn’t keep happening in the future? Like converting the database’s text fields to UTF-8, for example? Or would that just create more mess?

Signed,
Eric Meyer
Thursday, November 19th, 2009 5:30pm
If you look at my earlier comment, you can see how you can convert a database to UTF-8.

However, for UTF-8 to work properly, *everything* should be aimed at UTF-8, files, database, headers, php functions used etc. If you are not comfortable looking under the hood of an app or the database, you are probably best of leaving things in whichever way WP has/had designed them to be even remotely safe in getting expected results.

In the article which I’ve linked my name to, you’ll find some more info. Hopefully I’ll finally get the time sometime soon to sort through some stuff and put up the slides of a couple of presentations I’ve given on UTF-8. I’ll link to them from the site.
If you’d like, I can mail you a PDF for now. The slides contain info on (nearly) every aspect of UTF-8 use in a database driven webapplication from basic linguistics through client side to server side handling of it.

Signed,
jrf_nl
Thursday, November 19th, 2009 6:29pm
If you want to use UTF-8 (and you should), the first thing to do when installing a new WP is to create a database with the right charset/collation (e.g. utf8_general_ci). Then you can keep the default DB_CHARSET (UTF-8) in your wp_config, and correctly handle any non-Latin character.

If you are willing to attempt, you should be able to: export your actual database to a backup file, delete the database, then create a new one with the correct charset/collation and finally import back your backup file. It should work just fine.

Signed,
Samuel Santos
Thursday, November 19th, 2009 7:12pm
Hey Eric,
To see if the problem is your data or just your MySQL/WP settings you need to do a dump from the command line:

mysqldump -t -uUser -pPassword database_name >dump_data.sql
mysqldump -d -uUser -pPassword database_name >dump_schema.sql

Open the dump_data.sql file in a text editor and see if you have mojibake (garbled characters) in there. If not the problem is only in your schema and MySQL settings—open dump_schema.sql and remove DEFAULT CHARSET=some_charset, save and close. Set up MySQL to be end-to-end UTF8 (ref my article), then go into mysql -uUsername -p and CREATE DATABASE new_database_name DEFAULT CHARACTER SET utf8. Exit mysql then import your data:

mysql -uUser -pPassword new_database_name <dump_schema.sql
mysql -uUser -pPassword new_database_name <dump_data.sql

Now if you check the dumped data and there *is* mojibake in there, it’s possible that you’ve got a database saved as latin1 that’s being incorrectly displayed in UTF8. Try this: dump as latin1, find-and-replace latin1 to utf8, then reimport as utf8. Again make sure you change MySQL to have a 100% UTF8 setup before reimporting.

Once MySQL is 100% utf8 and you can see the characters correctly in PHPMyAdmin or Terminal mysql (assuming they’re set to utf8), you can then move onto to WP setup problems. But it’s probably going to be MySQL. :/

When this has happened to me I’ve generally been able to muddle through without needing to manually convert the borked entities, but it can take some muddling. Ping me on Twitter if you want

peace – oli

PS isn’t it great that Japanese *has a word* for garbled text? :-|

Signed,
Oli
Thursday, November 19th, 2009 8:28pm
Okay, so far I think that’s four recommendations to mysqldump the data in Latin1, find-and-replace “latin1” with “utf8”, and then re-import it as UTF-8 into the database. When I actually did that this afternoon, I had comments and posts that were severely truncated—so far as I could tell, at the point where the first non-Latin1 character was encountered in each field. I used the instructions in SpaceNinja’s post, but it seems like they’re basically the same as all the others people have referenced.

So I’m kind of not willing to do that again without a really compelling reason to do so.

Oh, and YES I want to use UTF-8. Would I have gone through all of this pain if I didn’t care about doing character encoding right?

Signed,
Eric Meyer
Thursday, November 19th, 2009 10:54pm
This has happened to us before, both for various diacritical symbols and European letters, and for Japanese. We love UTF-8 and Unicode and operate both English and Japanese sites. But you know what? For English sites we just don’t have time to dick around with these problems, which have repeatedly happened, so we just adopted a policy of only using straight lower-bit ASCII in our blog posts. It saves a lot of headaches and time and lets us concentrate on stuff that matters and makes up money. Some day in the future this will not be necessary, and you’d think it shouldn’t be necessary today, but unfortunately if these glitches only happen occasionally, it’s still a huge waste of time to fix them, and we have washed our hands of it.

Signed,
Aklap
Friday, November 20th, 2009 3:20am
no,no,no,

The reason of this mess lies entirely in WP. WP did use defaults and now decided to do the safe thing and not to trust the defaults. To ensure that your DB does not convert anything and acts like dumb storage you need to do three things a) specify charset when creating the table b) specify charset when accessing the table c) ensure that the two are the same.

So far it worked with default – WP (afaik) did not specify those charsets nowhere. The problem is that also some other software can use the same DB and someday some admin may change some global DB charset default settings. Then you have problems, because your DB is created with different charset and now you are accessing the DB with new, different, charset (exactly what happened to you). To avoid that WP now does the three things I mentioned earlier. New databases are created as UTF-8 and by access charset is set to UTF-8.

Only now all the old installations are srewed, but fortunately there’s an easy fix. As in your case – use the DB’s default charset by access.

There’s no such thing as converting database to UTF-8. That “converting” actually means describing – you just SAY to the DB that “this text field is UTF-8”, but the actual bytes remain the same. Like in programming where signed integer and word are the same data…

Look at the advice in comment 48 – you dump the database, DO NOTHING WITH THE DATA, then change description (charset to UTF) and read THE SAME data back in. Where’s the converting?

My advice is to let it be. Relax. In future WP upgrades be sure to comment that line out again and everything should be normal (read UTF-8 :-) ). Do not start messing with DB settings.

Signed,
Priit Pirita
Friday, November 20th, 2009 3:38am
Priit Pirita is right, by dumping and importing back, you are *not* converting the data.
To convert the data you need the query/script I posted/linked to in the 4th comment.

Signed,
jrf_nl
Friday, November 20th, 2009 6:13am
I think I got all that. However, I did exactly what comment #48 describes: “dump as latin1, find-and-replace latin1 to utf8, then reimport as utf8”. I did that. The end result was truncated content in comments, and probably posts. I immediately imported the original, un-altered dump, and everything went back to the way it was. (Which, at this point, is only slightly broken.)

Per your recommendation, Priit, I’m probably going to let things be, although it irks me that I may have to re-edit wp-config in future upgrades. Actually, scratch that: what irks me is that I’m going to have to remember what to do if a future upgrade borks things up again. I was trying to avoid that possibility. However, I’m willing to live with it for now. If I ever get a free day, maybe I’ll re-examine the export/FnR/import path again.

Signed,
Eric Meyer
Friday, November 20th, 2009 9:45am
“The end result was truncated content in comments, and probably posts.”

I suspect that this is because your editor settings, although I’m not sure. If you open that dump and treat it as UTF-8 and save as UTF-8 this shouldn’t happen. I have done this kind of things many times and it worked, only I find-replaced other things… One possible cause might be if you dumped with charset settings. I think this is wrong, try to dump with default settings, _exactly_ as in post 48.

Signed,
Priit Pirita
Friday, November 20th, 2009 10:13am
Wow, just wow.

This whole post (and the comments) in my opinion demonstrates quite well why you should not trust your data to a database.

It’s simply too much of a pain (certainly for the average user, but also for the knowledgeable user who doesn’t want to waste their time) to be an admin and deal with crap like this.

For HTML5 files, all I have to do is put this at the top (square brackets instead of angle brackets for sake of comment form)

[!DOCTYPE html]
[meta charset=”utf-8″]

And I’m done. It works. Today. Forever.

No dealing with database corruption nonsense, no weird config incantations to make sure a database doesn’t screw itself up. etc.

Databases are not worth the time tax of being a database admin to find/fix/maintain problems like corrupted characters.

Signed,
Tantek Çelik
Friday, November 20th, 2009 6:04pm
Having not used PHP or WordPress in years, I’m a bit mind-boggled. Frankly, any technology that wouldn’t support, and default to, UTF-8 in 2009 scares the hell out of me. Wow.

Signed,
Jeff Croft
Friday, November 20th, 2009 6:24pm
This looks like a classic double-encoding problem.

ie, your data stored in your database is UTF-8, but something in between the database and your HTML output thinks that it’s LATIN-1, so goes ahead and converts it to UTF-8. The result is that multi-byte characters get treated as multiple single byte characters, each of which is converted into a single multi-byte UTF-8 character.

Skimming through your comments, it looks like your database is set to latin1. Check if your tables are also set to latin1 or utf-8 (SHOW CREATE TABLE).

If they are latin1, then what you have to do is dump all your data out as latin1 or binary (so that it doesn’t do any conversion). You could just use mysqldump to do this and it will use the default character set.

After that, you need to delete all data in the database. This can be done with by issuing a drop database. This will break your website, and probably give the first visitor the option of creating a new blog, so you might want to turn off access to your blog while this is happening.

After you’ve dropped the database, do two things. In the /etc/my.cnf file, make sure the default character sets are set to utf8, you’ll need to check the mysql website for the specific names they use for collation. I think it’s utf8_general_ci or utf8_unicode_ci. Secondly, go through the dump file in a text editor and make sure none of the tables have the latin1 charset. If they do, you need to change this as well (make a backup of the file first in case a typo gets in there).

Once done, recreate the DB using mysql < dump-file.sql

After this, all your data should still be in utf8, but now mysql actually knows it's in utf8. You can verify by doing a SHOW CREATE TABLE on all your tables and make sure they're set to utf8.

Now re-enable your blog and hope it works.

Note that WordPress may have a way to dump the data and re-import it, but the dumped data may have the same problems that you're seeing. One option is to maybe tell wordpress that you expect latin1 data in the db and that you want your output to also be latin1. That way WP will ask MySQL for latin1 data (no conversion done) and will output latin1 data (no conversion done), and what you get is what is in the db. Then before re-importing, change the charsets everywhere to utf8.

Note that I don't use WP, so have not tested this on my own with WP.

@Tantek: while I agree with you that users who don't want to be DBAs shouldn't be writing their data to a DB that they control, the problem here is neither the database's fault, nor the user's fault. It appears to be a piece of software (WP?) making assumptions about the data and the database's character set. One of those it controls, and the other is easy to check.

Signed,
Philip Tellis
Friday, November 20th, 2009 7:21pm
Okay, now that’s five recommendations to dump and re-import after I already tried that and it didn’t work. Made things much, much worse, in fact.

Tantek, I sort of agree with you, but there are things WP does for me that hand-rolling wouldn’t provide. Like comments, for example, which I am emphatically not willing to outsource to a third-party cloud service; and which simply listing inbound links does not come close to replicating. Perhaps there are solutions now that would do all this but not rely on a database, but I don’t remember seeing any back in 2004.

Jeff, I believe that if I freshly installed WP in 2009, it would set things up using UTF-8 and there’d be no issue. I installed it almost six years ago, though. Things have advanced a bit since then.

Signed,
Eric Meyer
Friday, November 20th, 2009 9:44pm
comments: tl;dr, just that I’ve had exactly zero succes with dumping the database (via phpMyAdmin) and reimporting.

I’ve ran across this many times, the problem is that MySQL is encoding aware, and once you get bogus data in the database, no amount of recoding between different encodings or setting “set names” will help (in fact, the worst thing you can do is try to repair it by simply setting the correct encoding for your tables – if the target encoding doesn’t have a glyph, it gets irreversibly replaced by ?).

There’s a simple workaround that worked for me many times.

First, change the database column type from TEXT (or varchar) to BLOB (or VARBINARY). This makes MySQL “forget” about any encoding it thinks the data is in, and prevents any recoding of the data that goes around behind the scenes (in my case, the data was often encoded in CP1250 or UTF-8, but column encoding was set to LATIN1).

Then you have to find out in what encoding the data is in, and reset the column type to text/varchar/char with the encoding that matches the physical encoding of the data (in my case often CP1250). Once the physical encoding in the database matches the “logical” encoding of the columns, it’s possible to simply change the encoding of the columns (to UTF-8), and with the correct SET NAMES, you can have your webpage output anything you want (from UTF-8 to LATIN1).

When the data is double encoded (it originally was in UTF-8, it got reimported into the database as latin1 and then the encoding of the columns changed to UTF-8), you first have to set the encoding of the columns back to what it was when it was imported – this changes doubly encoded UTF-8 to physically singly encoded UTF-8 that the database thinks is in LATIN1 (for example), and then you go the route from TEXT (latin1) -> BLOB -> TEXT(UTF-8).

I think I have seen some scripts that try to do this automatically (by being really smart and getting information from the data dictionary), but for smaller scale databases such as wordpress, doing everything manually might be more tedious, but I think it’s safer.

A few short points:
1. It is vital to get the physical encoding to match the encoding that is set in the table column type (I’m not sure if search and replace will help because it works on already encoded data)
2. The encoding that is set in the HTML pages only determines what encoding the browser sends to PHP
3. PHP doesn’t know (or, unfortunately, care) what encoding you get from the browser. GIGO.
4. The MySQL cares about the encoding of the data from the browser (and what encoding it sends back). Use “SET ENCODING” SQL command to tell the database this information (AFAIK, WP does this).
5. The database performs a lot of conversion behind the scenes – if the database thinks you send it data in latin1, but you have tables columns in UTF-8, it WILL do a conversion from latin1 to utf-8, even if the data already was in UTF-8 (or worse, cp1250).
6. Changing the encoding of a column from one encoding to another performs physical recoding of the data, so you have to roundtrip it via BLOB or BINARY.
7. Once you try to convert two incompatible encodings, MySQL will insert a question mark (physically) for every character it can’t convert (happens for example when changing between CP1250 and LATIN1, or importing UTF-8 data as UTF-8 data in table columns that have their encoding set to LATIN1).

Signed,
Kim Sullivan
Sunday, November 22nd, 2009 10:10am
I had the same or similar problem last year, with a client, where my WordPress data got encoded to UTF-8 twice. I rolled my own script to “double decode” the binary mess in my SQL dump file back to some sane text with the script that is linked above. Perhaps that can be of any help, if you’re the scripting kind of person.

Signed,
Jeroen Pulles
Sunday, November 22nd, 2009 11:49am
@Eric,

I had to do this last year and blogged about it at the time. I remember an early draft including instructions “open vim and…” – I quickly realised as soon as you get an editor involved you’re fooked. Fortunately, MySQL has a command-line tool for doing search-and-replace so you don’t have to worry about editor settings or other random phenomena. (the instructions are in my latin1 to utf8 conversion post)

BTW. Using Safari 4.0.4 (the latest) on OS X, the encoding in this article looks fine, but the comments are screwed up. Forcing the text-encoding to ISO Latin 1 fixes the comments, but borks the names of the commenters (e.g. Tantek Çelik) I don’t know how far you think you’ve got fixing the issues, but it looks like there’s some way to go… (Note: using the web inspector / firebug you can check document.characterSet for the displayed character-set – which is handy when you’re checking you’ve overridden the text-encoding via browser menus.)

Signed,
Ash Searle
Tuesday, November 24th, 2009 12:08pm
Eric: Just quickly commenting to say that I’m having the exact same problem of late, and will read through the comments and any updates from you on a workable solution. Tired of doing find/replace.

Signed,
Wade Kwon
Saturday, November 28th, 2009 6:55pm
Finally solved this problem for myself, using the method described in this post:

http://tlug.dnho.net/?q=node/276

Yes, it’s another MySQL dump & import procedure. Haven’t checked for truncated content, but so far all my em & en dashes look good.

Signed,
Matt Sharkey
Wednesday, December 2nd, 2009 6:31pm
I’ve had to deal with this a bunch of times…. what I do – and it’s always worked for me – is 1st do a dump. Then – assuming you’re on Mac or Linux – run from the commandline:

iconv -f latin1 -t utf8 myDump.sql > myDumpUTF8.sql

Reimport….

Signed,
Aeron Glemann
Friday, December 4th, 2009 8:32am
“This whole post (and the comments) in my opinion demonstrates quite well why you should not trust your data to a database.”

Funny.

Seriously though, you’re probably already going to do this but please post a follow-up post with an overview of the problem and the eventual solution, when you find it. Going through all of these comments makes me feel like a total N00000000B. This has happened to me in WordPress a couple of times and each time I’ve just done manual search-and-replace for the characters I know about.

Signed,
Mike D.
Sunday, December 6th, 2009 9:15pm
After reading this article + comment thread, I’ve decided that the easiest solution to these pesky characted encoding problems is if I just change my name.

I was thinking maybe Emil Borkedchar?

Signed,
Emil Björklund
Saturday, December 12th, 2009 5:01am
I’m not sure if this helps, and I know some people already pointed some of it out, but I recently converted my DB to UTF-8 and this is what I did:

1. mysqldump the whole thing to a file
2. Add a special character (like “Ö”) to said file that looks good in the editor
3. Open the file with Firefox and check which encoding is used when the “Ö” looks ok (to find out exactly what encoding the file is)
4. Run iconv on the file to actually convert it to UTF-8 (from whatever encoding Firefox said it was)
5. Manually convert bad characters to good ones (and change potential encoding=latin1-settings in the sql-file to utf8)
6. Create new database where everything is UTF-8
7. Import the new, clean, utf8 SQL

That worked for me at least and I’ve had problems with encodings as far as I can remember.

I think one important bit I didn’t see in the comments (although it may have been mentioned) is to not only convert the characters but also convert the actual file (which I used iconv for).

Signed,
Andreas Lagerkvist
Saturday, January 2nd, 2010 6:31am
This powerful but simple perl script is marvelous to convert your MySQL database charsets to UTF8 quick and easy. I use it every time.

http://www.pablowe.net/convert_charset

Signed,
Josue Rodriguez
Wednesday, January 6th, 2010 6:29pm
To make a very long story short – this is what you need to do:

– Amend DB specific entries (host, user, pass, db, fields and table).
– Don’t fiddle with the rest of the code.
– Upload to browsable part of your web server/site.
– Call the “page” from firefox.
– Wait (and don’t reload) until complete.

If you need any of the code explained, feel free to drop me an email.

hth, cheers.
/j.

ps. the code tag strips brackets, which is a bit annoying (converted to lt|gt)…

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <meta name="uid" content="10" /> </head> <body>
<?php $DB = new mysqli('host','user','pass','db'); $DB->query("SET NAMES latin1"); if ($res = $DB->query ("SELECT unique_field, fix_field_1, fix_field_2, fix_field_3, fix_field_4 FROM fix_table WHERE 1=1")) { echo 'rows: '.$res->num_rows; $cnt = 0; while ($data = $res->fetch_object() ) { $DB->query("SET NAMES utf8;"); $unique_field = $data->unique_field; $fix_field_1 = bin2hex($data->fix_field_1); $fix_field_2 = bin2hex($data->fix_field_2); $fix_field_3 = bin2hex($data->fix_field_3); $fix_field_4 = bin2hex($data->fix_field_4); $result = $DB->query (" UPDATE fix_table SET fix_field_1 = UNHEX('".$DB->real_escape_string($fix_field_1)."'), fix_field_2 = UNHEX('".$DB->real_escape_string($fix_field_2)."'), fix_field_3 = UNHEX('".$DB->real_escape_string($fix_field_3)."'), fix_field_4 = UNHEX('".$DB->real_escape_string($fix_field_4)."') WHERE unique_field = '".$unique_field."'"); echo $cnt." - ".$unique_field."<br />"; unset($unique_field); unset($fix_field_1); unset($fix_field_2); unset($fix_field_3); unset($fix_field_4); echo $DB->error; $cnt++; } }
?>

Signed,
Johan Sand
Thursday, January 7th, 2010 3:11am
and for Friday Fun – if you want to update the entire database including all potentially affected records in all relevant fields in all tables, then this would be a crazy kenobi option.

This time only amend db host, user, pass and name.

Again, upload to site and run through firefox as is.

ps. make sure there’s enough execution time for php to wrap it up.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <meta name="uid" content="10" /> </head> <body>
<?php $db_host = 'host'; $db_user = 'user'; $db_pass = 'pass'; $db_name = 'name'; $DB = new mysqli($db_host, $db_user, $db_pass, $db_name); $field_types = array('varchar','text','tinytext','longtext'); if ($res_tables = $DB->query ("SHOW TABLES")) { while ($tables = $res_tables->fetch_array(MYSQLI_NUM) ) { if ($res_fields = $DB->query ("SHOW COLUMNS FROM ".$tables[0])) { if ($res_key = $DB->query ("SHOW COLUMNS FROM ".$tables[0]." WHERE `Key` LIKE 'PRI'")) { $key = $res_key->fetch_assoc(); $unique_key = $key['Field']; } while ($fields = $res_fields->fetch_array(MYSQLI_ASSOC) ) { if (in_array($fields['Type'], $field_types)) { $DB->query("SET NAMES latin1"); if ($res = $DB->query ("SELECT ".$unique_key.", ".$fields['Field']." FROM ".$tables[0]." WHERE 1=1")) { while ($data = $res->fetch_object() ) { $DB->query("SET NAMES utf8;"); $unique_field = $data->$unique_key; $fix_field = bin2hex($data->$fields['Field']); $result = $DB->query (" UPDATE ".$tables[0]." SET ".$fields['Field']." = UNHEX('".$DB->real_escape_string($fix_field)."') WHERE ".$unique_key." = '".$unique_field."' "); unset($unique_field); unset($fix_field); } } } } } unset($key); unset($unique_key); } }
?>

Signed,
Johan Sand
Thursday, January 7th, 2010 5:34pm
I’m not sure if this is still relevant, but since I didn’t see any mention of this in the other comments, and going by character set details you posted, the issue is probably in the connection character set / collation.

It’s a common issue that MySQL selects an inappropriate connection collation, regardless of the headers in the HTTP request (those are irrelevant, since it is the PHP script that connects to the database). You can either force the connection to UTF8 in the MySQL configuration, or issue two queries on every queries that set the connection to UTF.

Those would be:
SET CHARACTER SET UTF8;
SET NAMES UTF8;

You can read on those on the MySQL docs –
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html

Signed,
Eran Galperin
Thursday, February 4th, 2010 12:54pm
Pingback :: Sunday, November 28th, 2010 5:58pm
Find, Search, Replace, and Delete In A WordPress Database - WordCast
[…] Eric’s Archived Thoughts: Correcting Corrupted Characters in WordPress […]

Correcting Corrupted Characters

Comments (72)

Add Your Thoughts

Comment Preview

Browse the Archive

Feeds