Correcting Corrupted Characters
Published 15 years, 1 month past
At some point, for some reason I cannot quite fathom, a WordPress or PHP or mySQL or some other upgrade took all of my WordPress database’s UTF-8 and translated it to (I believe) ISO-8859-1 and then dumped the result back right back into the database. So “Emil Björklund” became “Emil Björklund”. (If those looked the same to you, then I see “Börklund” for the second one, and you should tell me which browser and OS you’re using in the comments.) This happened all throughout the WordPress database, including to commonly-used characters like ‘smart’ quotes, both single and double; em and en dashes; ellipses; and so on. It also apparently happened in all the DB fields, so not only were posts and comments affected, but commenters’ names as well (for example).
And I’m pretty sure this isn’t just a case of the correct characters lurking in the DB and being downsampled on their way to me, as I have WordPress configured to use UTF-8, the site’s head
contains a meta
that declares UTF-8, and a peek at the HTTP response headers shows that I’m serving UTF-8. Of course, I’m not really expert at this, so it’s possible that I’ve misunderstood or misinterpreted, well, just about anything. To be honest, I find it deeply objectionable that this kind of stuff is still a problem here on the eve of 2010, and in general, enduring the effluvia of erroneous encoding makes my temples throb in a distinctly unhealthy fashion.
Anyway. Moving on.
I found a search-and-replace plugin—ironically enough, one written by a person whose name contains a character that would currently be corrupted in my database—that lets me fix the errors I know about, one at a time. But it’s a sure bet there are going to be tons of these things littered all over the place and I’m not likely to find them all, let alone be able to fix them all by hand, one find-and-replace at a time.
What I need is a WordPress plugin or something that will find the erroneous character strings in various fields and turn them back into good old UTF-8. Failing that, I need a good table that shows the ISO-8859-1 equivalents of as many UTF-8 characters as possible, or else a way to generate that table for myself. With that table in hand, I at least have a chance of writing a plugin to go through and undo the mess. I might even have it monitor the DB to see if it happens again, and give me a big “Clean up!” button if it does.
So: anyone got some pointers they could share, information that might help, even code that might make the whole thing go away?
Comments (72)
The person that created this page: http://rishida.net/tools/conversion/ also uses wordpress. While, sadly, I cannot offer you any advice perhaps contacting him could be a help. I use his converter a great deal as my job requires converting characters all too frequently.
The list of characters: http://rishida.net/scripts/uniview/
The conversion from unicode to just about any other form:
http://rishida.net/scripts/uniview/conversion.php?origin=codepoint&codepoints=2026
Are you sure the encoding isn’t just wrong and that it still isn’t really utf-8?
Could you test with perl and try decoding as utf-8?
use utf8;
use Encode;
decode(“utf8”, $string);
encode(“utf8”, $string);
try copying and pasting into xxd and see what the hex of the characters really is?
Hi Eric,
If you are (relatively) sure that the data got converted to latin1/iso-8859-1, you can convert it back to utf-8 using mysql.
You would need to use a query like below – adjusted for your tables and columns – to do it:
INSERT INTO utf8table (utf8column)
SELECT CONVERT(latin1field USING utf8)
FROM latin1table;
For a complete php script to convert your database:
http://www.phpwact.org/php/i18n/utf-8/mysql
More info on CONVERT() and CAST():
http://dev.mysql.com/doc/refman/5.1/en/charset-convert.html
http://dev.mysql.com/doc/refman/5.1/en/cast-functions.html
Hope this helps and if you need more info, please feel free to contact me with twitter or email.
Juliette
Hey,
I dunno a script, but for my part, I made sure that the table structure and the database is set to f.e. utf8_generel_ci or utf8_unicode_ci before importing the SQL (the SQL file should be in UTF-8 too).
handling with encoding problems is very annoying. I just had to convert a 2GB ISO8859-9 MySQL Dump to UTF-8. In addition the dump was from a MySQL 4.0.x database, whereas the target server was 5.0. I am happy to have an ssh access. Otherwise I had to download 2GB and upload again. Then the next problem was that iconv could only handle ca. 90 MB file at once, so splitted, converted, combined, removed duplicate entry inconsistencies, too large varchar length values, …… I never ever wanna be a UNIX administrator
Hi,
I’m no expert either but I’d go with Peter Mahnke’s suggestion before really messing with your data.
It might be that you’ve upgraded from a MySQL version that doesn’t support charsets to one that does, and that the upgrade has assumed your data is ISO-8859-1 (it is the MySQL default I believe). i.e. it’s taken your raw utf-8 and simply “labelled” it iso-8859-1.
Ben
Your problem is that in
Do you (and all your scripts) always execute
SET NAMES UTF-8
before running any MySQL queries?If not, that is likely to blame. This is really really painful gotcha. MySQL defaults to latin1, and UTF-8-based applications that fail to declare UTF-8 each time every time when talking to MySQL, will screw up encoding badly.
Can’t help you with the problem… :( But you might find this one worth a look once you are up and running again: http://ilfilosofo.com/blog/wp-db-backup There are at least a couple alternatives but this is pretty robust…
I don’t feel I can be very helpful at the moment with regard to the post subject matter, but I’m pretty sure that plugin stuff is not irony. :D
Do you have this in wp-config.php?
define(‘DB_CHARSET’, ‘utf8’); // force utf8
define(‘DB_COLLATE’, ‘utf8_general_ci’); // force collation
Oh and I forgot to mention that possibly somebody/something took your UTF-8 database and converted it once more into UTF-8. My solution reverts that back.
Steven: no, wp-config.php says:
define(‘DB_CHARSET’, ‘utf8’);
define(‘DB_COLLATE’, ”);
To all asking if I’m sure that my data isn’t UTF-8, I’m pretty sure it is UTF-8, but was at some point converted to ISO-8859-1 and then the resulting characters were re-inserted and are now treated as UTF-8.
My reason to believe this is that I did a search-and-replace of all corrupted smart quotes (singles and doubles), finding the three-characters strings and replacing them with raw UTF-8 characters (right?), and now they display correctly instead of with the 8859 three-character strings. Here, I’ll input some straight non-ASCII characters. If they come out clean, I would assume that I’m storing things in the DB as UTF-8.
Well, I see the quotes in my previous comment correctly (and they’re not HTML entities) but the other symbols I inserted are now question marks, even though they looked good when I previewed the comment. Let me try a few more:
Very interested to see if a solution comes of this; this happens to me from time to time when I do export & imports of my SQL databases (or even using WordPress’ own built-in export/import function)
Priit: your solution didn’t work for me. I tried the following as a test:
The output was a question mark. I checked and the page was UTF-8.
kL: I don’t know, since pretty much close to all the PHP that touches the content is WordPress, and I have no idea if it’s uniformly clean with character handling or not. Of course, as others have pointed out, it could’ve been a mySQL upgrade that did it.
That’s because your PHP files is NOT UTF-8.
Use some editor which can save PHP source’s as UTF.
And yes I did check that solution before I posted and yes now I converted my php source into MacRoman and got ?.
Ah HAH! Thanks many times over, Priit! It’s all starting to make a lot more sense. The temples are still throbbing, but a good deal less so now.
Now I’m wondering about the best way to check a great big UTF-8 database to see if its content has gotten mangled this way. I could of course search for records containing mangled string sequences but that would mean a lot of checks for all the possible mangles. It seems like there ought to be a better way.
Try to dump, then try to convert that dumpfile back as I recommened and then try to read it back in.
And no, i haven’t tried it myself :-)
Hey Eric,
It seems like you
Grrr. My further, UTF-8-powered tests are hitting a couple of snags.
Here’s an example. “Fatih Hayrio?lu’nun not defteri” (see the top of this page for what that’s really supposed to look like, since it looks like WP glitched one character in the process of posting the comment) got corrupted into “Fatih Hayrioğlu
Can you live with entities? as ğ
No, I’m stupid. Sorry.
Use ISO-8859-15
You also need to check the page source as well for some reason you are setting the Content-TYpe Twice once for latin1 (iso-8859-1 line 8 this page) and again as utf-8 (line 10). It will probably not solve all your problems but it does not make it any easier for the browser to know which one to use.
The headers sent by apache look ok agian this page via curl -I
HTTP/1.1 200 OK
Date: Thu, 19 Nov 2009 17:31:18 GMT
Server: Apache
X-Powered-By: PHP/5.2.8
Vary: Cookie
X-Pingback: http://meyerweb.com/eric/thoughts/xmlrpc.php
Connection: close
Content-Type: text/html; charset=UTF-8
here’s some swedish characters just to test:
I can
Brian, thanks for pointing out that erroneous, superfluous
meta
. I’ve corrected it. It only showed up on some pages, which is why I hadn’t seen it when I viewed source on other pages.Priit: I changed my test script to translate from ISO-8859-15 to UTF-8 and it translated ğ into ? (entity version ľ – codepoint x13E – Unicode name LATIN SMALL LETTER L WITH CARON) instead of ? (entity version: ğ). I could live with entities as long as I could convert to them accurately, though I’d really prefer to store the raw characters. I could decide to convert to entities when displaying on the web, or just push them out raw, or whatever. The point being that I’d much rather store the real text and decide how to convert it for display than store the converted forms.
Strange – if the result is ? then input IS already correct UTF and you shouldn’t touch it. Or it means that this record wasn’t converted or something.
Smokey: if I change my wp-config.php to read define('DB_CHARSET', ''); then even more things go wrong, as in more characters in this post and its comments get mangled. (Though usually just as question marks.) Undoing that back to define('DB_CHARSET', 'utf-8'); gets me back to where I am now, which is better but not great.
Steven: I finally got to test the settings you recommended. Setting define('DB_COLLATE', '
This plug in may do the whole thing in one shebang:
http://wordpress.org/extend/plugins/utf-8-database-converter/
Hi.
1. I assume that your problem has nothing to do with PHP neither WordPress. It is a bug (or a feature) of MySQL.
2. Implement a dbclass for all database connects like this:
3. Grap phpMyAdmin and backup every table of your application in sql format.
4. Insert the data back from the files and look at the results.
5. You’re done.
Mark: Smokey says that plugin doesn’t work for recent versions of WP.
JeyJey: that looks very cool, except I have no idea how I would go about using it. As in, no idea where I would add that class, nor how to make sure I invoked it.
Okay, folks, I think I may have found the root of the problem. Thanks to Oli’s comment, I managed to pull the following from my WP install:
So, yeah. I’m thinking that’s my problem. Now I just have to figure out how to fix it.
Eric: Are older posts (at least those without characters you
I ran into this same problem recently, and documented how to fix it on my blog. Long story short, older wordpress databases were created in latin1, but when the utf8 option came out, wordpress happily put UTF8 data into the latin1 database. The most recent version of WP got more strict about how it reads that database, and starts puking. The solution is to export your DB in latin1 format, manually change the dump file to say it’s UTF8, and then re-import. This solved the problem on several of my older WP blogs.
Details here: http://spaceninja.com/2009/10/how-to-convert-your-old-wordpress-database-to-utf8/
Continuing from twitter.
DB stores bits, that’s all. Let’s take that “Bj
Oh, and in your case some piece of software now actively CONVERTS (or converted) your data and this is causing problems. Normally this should be avoided.
After digging around by links in comments 34&35 I found following:
If DB_COLLATE and DB_CHARSET do not exist in your wp-config.php file, DO NOT add either definition to your wp-config.php file unless you read and understand Converting Database Character Sets.
So Smokey’s advice seems to ge good one.
Here’s the thing, though, Priit: so far as I can tell, the string actually stored in my database is actually “Björklund”, 10 bits of data. (Or is it bytes? 10 characters, anyway.) Once upon a time, though, it wasn’t. Somehow, the original text with the
It’s bytes of course :-) and your DB seems to be OK, nothing converter there (and saved) so that’s good news.
But seriously, try this advice http://codex.wordpress.org/Editing_wp-config.php#Database_character_set
and remove DB_COLLATE and DB_CHARSET from wp_config, easy to try and costs nothing.
Holy [REDACTED], Priit, removing (commenting out) both lines did the trick. What the [REDACTED]?
Right now, I feel like I went down another dead end and was shocked to discover the terminal wall was just a hologram, beyond which was a land of cake and ice cream.
And ONLY this entry is messed up, because of the wrong settings, but the next ones will not.
Wait, so you’re saying that any content using non-Latin characters that was entered while the wp-config settings were in effect will have been mangled? Because I have no idea how long those settings have been there. It might’ve been only since I upgraded to 2.8 (straight from 2.6—long story) but I don’t know when they were added to wp-config.
Dunno, previous entry seems to be OK? Besides, I don’t know WP at all :-)
But, here is the root of that mess, if somebody is interested: Your database is old one, text fields are defined in latin1. Now when you tell mysql to set character set to UTF-8, mysql “knows” that input is in latin1 (by definition) and CONVERTS that text to you (which is actually already UTF-8). Result is mess…
Is there any way I can clean up the root of the mess so that this doesn’t keep happening in the future? Like converting the database’s text fields to UTF-8, for example? Or would that just create more mess?
If you look at my earlier comment, you can see how you can convert a database to UTF-8.
However, for UTF-8 to work properly, *everything* should be aimed at UTF-8, files, database, headers, php functions used etc. If you are not comfortable looking under the hood of an app or the database, you are probably best of leaving things in whichever way WP has/had designed them to be even remotely safe in getting expected results.
In the article which I’ve linked my name to, you’ll find some more info. Hopefully I’ll finally get the time sometime soon to sort through some stuff and put up the slides of a couple of presentations I’ve given on UTF-8. I’ll link to them from the site.
If you’d like, I can mail you a PDF for now. The slides contain info on (nearly) every aspect of UTF-8 use in a database driven webapplication from basic linguistics through client side to server side handling of it.
If you want to use
UTF-8
(and you should), the first thing to do when installing a new WP is to create a database with the right charset/collation (e.g.utf8_general_ci
). Then you can keep the defaultDB_CHARSET
(UTF-8
) in yourwp_config
, and correctly handle any non-Latin character.If you are willing to attempt, you should be able to: export your actual database to a backup file, delete the database, then create a new one with the correct charset/collation and finally import back your backup file. It should work just fine.
Hey Eric,
To see if the problem is your data or just your MySQL/WP settings you need to do a dump from the command line:
mysqldump -t -uUser -pPassword database_name >dump_data.sql
mysqldump -d -uUser -pPassword database_name >dump_schema.sql
Open the dump_data.sql file in a text editor and see if you have mojibake (garbled characters) in there. If not the problem is only in your schema and MySQL settings—open dump_schema.sql and remove
DEFAULT CHARSET=some_charset
, save and close. Set up MySQL to be end-to-end UTF8 (ref my article), then go intomysql -uUsername -p
andCREATE DATABASE new_database_name DEFAULT CHARACTER SET utf8
. Exitmysql
then import your data:mysql -uUser -pPassword new_database_name <dump_schema.sql
mysql -uUser -pPassword new_database_name <dump_data.sql
Now if you check the dumped data and there *is* mojibake in there, it’s possible that you’ve got a database saved as latin1 that’s being incorrectly displayed in UTF8. Try this: dump as latin1, find-and-replace latin1 to utf8, then reimport as utf8. Again make sure you change MySQL to have a 100% UTF8 setup before reimporting.
Once MySQL is 100% utf8 and you can see the characters correctly in PHPMyAdmin or Terminal mysql (assuming they’re set to utf8), you can then move onto to WP setup problems. But it’s probably going to be MySQL. :/
When this has happened to me I’ve generally been able to muddle through without needing to manually convert the borked entities, but it can take some muddling. Ping me on Twitter if you want
peace – oli
PS isn’t it great that Japanese *has a word* for garbled text? :-|
Okay, so far I think that’s four recommendations to mysqldump the data in Latin1, find-and-replace “latin1” with “utf8”, and then re-import it as UTF-8 into the database. When I actually did that this afternoon, I had comments and posts that were severely truncated—so far as I could tell, at the point where the first non-Latin1 character was encountered in each field. I used the instructions in SpaceNinja’s post, but it seems like they’re basically the same as all the others people have referenced.
So I’m kind of not willing to do that again without a really compelling reason to do so.
Oh, and YES I want to use UTF-8. Would I have gone through all of this pain if I didn’t care about doing character encoding right?
This has happened to us before, both for various diacritical symbols and European letters, and for Japanese. We love UTF-8 and Unicode and operate both English and Japanese sites. But you know what? For English sites we just don’t have time to dick around with these problems, which have repeatedly happened, so we just adopted a policy of only using straight lower-bit ASCII in our blog posts. It saves a lot of headaches and time and lets us concentrate on stuff that matters and makes up money. Some day in the future this will not be necessary, and you’d think it shouldn’t be necessary today, but unfortunately if these glitches only happen occasionally, it’s still a huge waste of time to fix them, and we have washed our hands of it.
no,no,no,
The reason of this mess lies entirely in WP. WP did use defaults and now decided to do the safe thing and not to trust the defaults. To ensure that your DB does not convert anything and acts like dumb storage you need to do three things a) specify charset when creating the table b) specify charset when accessing the table c) ensure that the two are the same.
So far it worked with default – WP (afaik) did not specify those charsets nowhere. The problem is that also some other software can use the same DB and someday some admin may change some global DB charset default settings. Then you have problems, because your DB is created with different charset and now you are accessing the DB with new, different, charset (exactly what happened to you). To avoid that WP now does the three things I mentioned earlier. New databases are created as UTF-8 and by access charset is set to UTF-8.
Only now all the old installations are srewed, but fortunately there’s an easy fix. As in your case – use the DB’s default charset by access.
There’s no such thing as converting database to UTF-8. That “converting” actually means describing – you just SAY to the DB that “this text field is UTF-8”, but the actual bytes remain the same. Like in programming where signed integer and word are the same data…
Look at the advice in comment 48 – you dump the database, DO NOTHING WITH THE DATA, then change description (charset to UTF) and read THE SAME data back in. Where’s the converting?
My advice is to let it be. Relax. In future WP upgrades be sure to comment that line out again and everything should be normal (read UTF-8 :-) ). Do not start messing with DB settings.
Priit Pirita is right, by dumping and importing back, you are *not* converting the data.
To convert the data you need the query/script I posted/linked to in the 4th comment.
I think I got all that. However, I did exactly what comment #48 describes: “dump as latin1, find-and-replace latin1 to utf8, then reimport as utf8”. I did that. The end result was truncated content in comments, and probably posts. I immediately imported the original, un-altered dump, and everything went back to the way it was. (Which, at this point, is only slightly broken.)
Per your recommendation, Priit, I’m probably going to let things be, although it irks me that I may have to re-edit wp-config in future upgrades. Actually, scratch that: what irks me is that I’m going to have to remember what to do if a future upgrade borks things up again. I was trying to avoid that possibility. However, I’m willing to live with it for now. If I ever get a free day, maybe I’ll re-examine the export/FnR/import path again.
“The end result was truncated content in comments, and probably posts.”
I suspect that this is because your editor settings, although I’m not sure. If you open that dump and treat it as UTF-8 and save as UTF-8 this shouldn’t happen. I have done this kind of things many times and it worked, only I find-replaced other things… One possible cause might be if you dumped with charset settings. I think this is wrong, try to dump with default settings, _exactly_ as in post 48.
Wow, just wow.
This whole post (and the comments) in my opinion demonstrates quite well why you should not trust your data to a database.
It’s simply too much of a pain (certainly for the average user, but also for the knowledgeable user who doesn’t want to waste their time) to be an admin and deal with crap like this.
For HTML5 files, all I have to do is put this at the top (square brackets instead of angle brackets for sake of comment form)
[!DOCTYPE html]
[meta charset=”utf-8″]
And I’m done. It works. Today. Forever.
No dealing with database corruption nonsense, no weird config incantations to make sure a database doesn’t screw itself up. etc.
Databases are not worth the time tax of being a database admin to find/fix/maintain problems like corrupted characters.
Having not used PHP or WordPress in years, I’m a bit mind-boggled. Frankly, any technology that wouldn’t support, and default to, UTF-8 in 2009 scares the hell out of me. Wow.
This looks like a classic double-encoding problem.
ie, your data stored in your database is UTF-8, but something in between the database and your HTML output thinks that it’s LATIN-1, so goes ahead and converts it to UTF-8. The result is that multi-byte characters get treated as multiple single byte characters, each of which is converted into a single multi-byte UTF-8 character.
Skimming through your comments, it looks like your database is set to latin1. Check if your tables are also set to latin1 or utf-8 (SHOW CREATE TABLE).
If they are latin1, then what you have to do is dump all your data out as latin1 or binary (so that it doesn’t do any conversion). You could just use mysqldump to do this and it will use the default character set.
After that, you need to delete all data in the database. This can be done with by issuing a drop database. This will break your website, and probably give the first visitor the option of creating a new blog, so you might want to turn off access to your blog while this is happening.
After you’ve dropped the database, do two things. In the /etc/my.cnf file, make sure the default character sets are set to utf8, you’ll need to check the mysql website for the specific names they use for collation. I think it’s utf8_general_ci or utf8_unicode_ci. Secondly, go through the dump file in a text editor and make sure none of the tables have the latin1 charset. If they do, you need to change this as well (make a backup of the file first in case a typo gets in there).
Once done, recreate the DB using mysql < dump-file.sql
After this, all your data should still be in utf8, but now mysql actually knows it's in utf8. You can verify by doing a SHOW CREATE TABLE on all your tables and make sure they're set to utf8.
Now re-enable your blog and hope it works.
Note that WordPress may have a way to dump the data and re-import it, but the dumped data may have the same problems that you're seeing. One option is to maybe tell wordpress that you expect latin1 data in the db and that you want your output to also be latin1. That way WP will ask MySQL for latin1 data (no conversion done) and will output latin1 data (no conversion done), and what you get is what is in the db. Then before re-importing, change the charsets everywhere to utf8.
Note that I don't use WP, so have not tested this on my own with WP.
@Tantek: while I agree with you that users who don't want to be DBAs shouldn't be writing their data to a DB that they control, the problem here is neither the database's fault, nor the user's fault. It appears to be a piece of software (WP?) making assumptions about the data and the database's character set. One of those it controls, and the other is easy to check.
Okay, now that’s five recommendations to dump and re-import after I already tried that and it didn’t work. Made things much, much worse, in fact.
Tantek, I sort of agree with you, but there are things WP does for me that hand-rolling wouldn’t provide. Like comments, for example, which I am emphatically not willing to outsource to a third-party cloud service; and which simply listing inbound links does not come close to replicating. Perhaps there are solutions now that would do all this but not rely on a database, but I don’t remember seeing any back in 2004.
Jeff, I believe that if I freshly installed WP in 2009, it would set things up using UTF-8 and there’d be no issue. I installed it almost six years ago, though. Things have advanced a bit since then.
comments: tl;dr, just that I’ve had exactly zero succes with dumping the database (via phpMyAdmin) and reimporting.
I’ve ran across this many times, the problem is that MySQL is encoding aware, and once you get bogus data in the database, no amount of recoding between different encodings or setting “set names” will help (in fact, the worst thing you can do is try to repair it by simply setting the correct encoding for your tables – if the target encoding doesn’t have a glyph, it gets irreversibly replaced by ?).
There’s a simple workaround that worked for me many times.
First, change the database column type from TEXT (or varchar) to BLOB (or VARBINARY). This makes MySQL “forget” about any encoding it thinks the data is in, and prevents any recoding of the data that goes around behind the scenes (in my case, the data was often encoded in CP1250 or UTF-8, but column encoding was set to LATIN1).
Then you have to find out in what encoding the data is in, and reset the column type to text/varchar/char with the encoding that matches the physical encoding of the data (in my case often CP1250). Once the physical encoding in the database matches the “logical” encoding of the columns, it’s possible to simply change the encoding of the columns (to UTF-8), and with the correct SET NAMES, you can have your webpage output anything you want (from UTF-8 to LATIN1).
When the data is double encoded (it originally was in UTF-8, it got reimported into the database as latin1 and then the encoding of the columns changed to UTF-8), you first have to set the encoding of the columns back to what it was when it was imported – this changes doubly encoded UTF-8 to physically singly encoded UTF-8 that the database thinks is in LATIN1 (for example), and then you go the route from TEXT (latin1) -> BLOB -> TEXT(UTF-8).
I think I have seen some scripts that try to do this automatically (by being really smart and getting information from the data dictionary), but for smaller scale databases such as wordpress, doing everything manually might be more tedious, but I think it’s safer.
A few short points:
1. It is vital to get the physical encoding to match the encoding that is set in the table column type (I’m not sure if search and replace will help because it works on already encoded data)
2. The encoding that is set in the HTML pages only determines what encoding the browser sends to PHP
3. PHP doesn’t know (or, unfortunately, care) what encoding you get from the browser. GIGO.
4. The MySQL cares about the encoding of the data from the browser (and what encoding it sends back). Use “SET ENCODING” SQL command to tell the database this information (AFAIK, WP does this).
5. The database performs a lot of conversion behind the scenes – if the database thinks you send it data in latin1, but you have tables columns in UTF-8, it WILL do a conversion from latin1 to utf-8, even if the data already was in UTF-8 (or worse, cp1250).
6. Changing the encoding of a column from one encoding to another performs physical recoding of the data, so you have to roundtrip it via BLOB or BINARY.
7. Once you try to convert two incompatible encodings, MySQL will insert a question mark (physically) for every character it can’t convert (happens for example when changing between CP1250 and LATIN1, or importing UTF-8 data as UTF-8 data in table columns that have their encoding set to LATIN1).
I had the same or similar problem last year, with a client, where my WordPress data got encoded to UTF-8 twice. I rolled my own script to “double decode” the binary mess in my SQL dump file back to some sane text with the script that is linked above. Perhaps that can be of any help, if you’re the scripting kind of person.
@Eric,
I had to do this last year and blogged about it at the time. I remember an early draft including instructions “open vim and…” – I quickly realised as soon as you get an editor involved you’re fooked. Fortunately, MySQL has a command-line tool for doing search-and-replace so you don’t have to worry about editor settings or other random phenomena. (the instructions are in my latin1 to utf8 conversion post)
BTW. Using Safari 4.0.4 (the latest) on OS X, the encoding in this article looks fine, but the comments are screwed up. Forcing the text-encoding to ISO Latin 1 fixes the comments, but borks the names of the commenters (e.g. Tantek Çelik) I don’t know how far you think you’ve got fixing the issues, but it looks like there’s some way to go… (Note: using the web inspector / firebug you can check document.characterSet for the displayed character-set – which is handy when you’re checking you’ve overridden the text-encoding via browser menus.)
Eric: Just quickly commenting to say that I’m having the exact same problem of late, and will read through the comments and any updates from you on a workable solution. Tired of doing find/replace.
Finally solved this problem for myself, using the method described in this post:
http://tlug.dnho.net/?q=node/276
Yes, it’s another MySQL dump & import procedure. Haven’t checked for truncated content, but so far all my em & en dashes look good.
I’ve had to deal with this a bunch of times…. what I do – and it’s always worked for me – is 1st do a dump. Then – assuming you’re on Mac or Linux – run from the commandline:
iconv -f latin1 -t utf8 myDump.sql > myDumpUTF8.sql
Reimport….
“This whole post (and the comments) in my opinion demonstrates quite well why you should not trust your data to a database.”
Funny.
Seriously though, you’re probably already going to do this but please post a follow-up post with an overview of the problem and the eventual solution, when you find it. Going through all of these comments makes me feel like a total N00000000B. This has happened to me in WordPress a couple of times and each time I’ve just done manual search-and-replace for the characters I know about.
After reading this article + comment thread, I’ve decided that the easiest solution to these pesky characted encoding problems is if I just change my name.
I was thinking maybe Emil Borkedchar?
I’m not sure if this helps, and I know some people already pointed some of it out, but I recently converted my DB to UTF-8 and this is what I did:
1. mysqldump the whole thing to a file
2. Add a special character (like “Ö”) to said file that looks good in the editor
3. Open the file with Firefox and check which encoding is used when the “Ö” looks ok (to find out exactly what encoding the file is)
4. Run iconv on the file to actually convert it to UTF-8 (from whatever encoding Firefox said it was)
5. Manually convert bad characters to good ones (and change potential encoding=latin1-settings in the sql-file to utf8)
6. Create new database where everything is UTF-8
7. Import the new, clean, utf8 SQL
That worked for me at least and I’ve had problems with encodings as far as I can remember.
I think one important bit I didn’t see in the comments (although it may have been mentioned) is to not only convert the characters but also convert the actual file (which I used iconv for).
This powerful but simple perl script is marvelous to convert your MySQL database charsets to UTF8 quick and easy. I use it every time.
http://www.pablowe.net/convert_charset
To make a very long story short – this is what you need to do:
– Amend DB specific entries (host, user, pass, db, fields and table).
– Don’t fiddle with the rest of the code.
– Upload to browsable part of your web server/site.
– Call the “page” from firefox.
– Wait (and don’t reload) until complete.
If you need any of the code explained, feel free to drop me an email.
hth, cheers.
/j.
ps. the code tag strips brackets, which is a bit annoying (converted to lt|gt)…
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="uid" content="10" />
</head>
<body>
<?php
$DB = new mysqli('host','user','pass','db');
$DB->query("SET NAMES latin1");
if ($res = $DB->query ("SELECT unique_field, fix_field_1, fix_field_2, fix_field_3, fix_field_4 FROM fix_table WHERE 1=1")) {
echo 'rows: '.$res->num_rows;
$cnt = 0;
while ($data = $res->fetch_object() ) {
$DB->query("SET NAMES utf8;");
$unique_field = $data->unique_field;
$fix_field_1 = bin2hex($data->fix_field_1);
$fix_field_2 = bin2hex($data->fix_field_2);
$fix_field_3 = bin2hex($data->fix_field_3);
$fix_field_4 = bin2hex($data->fix_field_4);
$result = $DB->query ("
UPDATE fix_table
SET
fix_field_1 = UNHEX('".$DB->real_escape_string($fix_field_1)."'),
fix_field_2 = UNHEX('".$DB->real_escape_string($fix_field_2)."'),
fix_field_3 = UNHEX('".$DB->real_escape_string($fix_field_3)."'),
fix_field_4 = UNHEX('".$DB->real_escape_string($fix_field_4)."')
WHERE unique_field = '".$unique_field."'");
echo $cnt." - ".$unique_field."<br />";
unset($unique_field);
unset($fix_field_1);
unset($fix_field_2);
unset($fix_field_3);
unset($fix_field_4);
echo $DB->error;
$cnt++;
}
}
?>
and for Friday Fun – if you want to update the entire database including all potentially affected records in all relevant fields in all tables, then this would be a crazy kenobi option.
This time only amend db host, user, pass and name.
Again, upload to site and run through firefox as is.
ps. make sure there’s enough execution time for php to wrap it up.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="uid" content="10" />
</head>
<body>
<?php
$db_host = 'host';
$db_user = 'user';
$db_pass = 'pass';
$db_name = 'name';
$DB = new mysqli($db_host, $db_user, $db_pass, $db_name);
$field_types = array('varchar','text','tinytext','longtext');
if ($res_tables = $DB->query ("SHOW TABLES")) {
while ($tables = $res_tables->fetch_array(MYSQLI_NUM) ) {
if ($res_fields = $DB->query ("SHOW COLUMNS FROM ".$tables[0])) {
if ($res_key = $DB->query ("SHOW COLUMNS FROM ".$tables[0]." WHERE `Key` LIKE 'PRI'")) {
$key = $res_key->fetch_assoc();
$unique_key = $key['Field'];
}
while ($fields = $res_fields->fetch_array(MYSQLI_ASSOC) ) {
if (in_array($fields['Type'], $field_types)) {
$DB->query("SET NAMES latin1");
if ($res = $DB->query ("SELECT ".$unique_key.", ".$fields['Field']." FROM ".$tables[0]." WHERE 1=1")) {
while ($data = $res->fetch_object() ) {
$DB->query("SET NAMES utf8;");
$unique_field = $data->$unique_key;
$fix_field = bin2hex($data->$fields['Field']);
$result = $DB->query ("
UPDATE ".$tables[0]."
SET ".$fields['Field']." = UNHEX('".$DB->real_escape_string($fix_field)."')
WHERE ".$unique_key." = '".$unique_field."'
");
unset($unique_field);
unset($fix_field);
}
}
}
}
}
unset($key);
unset($unique_key);
}
}
?>
I’m not sure if this is still relevant, but since I didn’t see any mention of this in the other comments, and going by character set details you posted, the issue is probably in the connection character set / collation.
It’s a common issue that MySQL selects an inappropriate connection collation, regardless of the headers in the HTTP request (those are irrelevant, since it is the PHP script that connects to the database). You can either force the connection to UTF8 in the MySQL configuration, or issue two queries on every queries that set the connection to UTF.
Those would be:
SET CHARACTER SET UTF8;
SET NAMES UTF8;
You can read on those on the MySQL docs –
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
Pingback ::
Find, Search, Replace, and Delete In A WordPress Database - WordCast
[…] Eric’s Archived Thoughts: Correcting Corrupted Characters in WordPress […]