我已经进去了
确切地
同样的位置。生产mysql服务器设置为latin1,旧数据设置为latin1,新数据设置为utf8,但存储在latin1列中,然后添加utf8列,每行可以包含任意数量的编码。
最大的问题是没有一个单一的解决方案可以解决所有问题,因为许多传统编码对不同的字符使用相同的字节。这意味着你将不得不诉诸启发式。在我的
Utf8Voodoo
类中有一个很大的字节数组,从127到255,也就是传统的单字节编码非ascii字符。
// ISO-8859-15 has the Euro sign, but ISO-8859-1 has also been used on the
// site. Sigh. Windows-1252 has the Euro sign at 0x80 (and other printable
// characters in 0x80-0x9F), but mb_detect_encoding never returns that
// encoding when ISO-8859-* is in the detect list, so we cannot use it.
// CP850 has accented letters and currency symbols in 0x80-0x9F. It occurs
// just a few times, but enough to make it pretty much impossible to
// automagically detect exactly which non-ISO encoding was used. Hence the
// need for "likely bytes" in addition to the "magic bytes" below.
/**
* This array contains the magic bytes that determine possible encodings.
* It works by elimination: the most specific byte patterns (the array's
* keys) are listed first. When a match is found, the possible encodings
* are that entry's value.
*/
public static $legacyEncodingsMagicBytes = array(
'/[\x81\x8D\x8F\x90\x9D]/' => array('CP850'),
'/[\x80\x82-\x8C\x8E\x91-\x9C\x9E\x9F]/' => array('Windows-1252', 'CP850'),
'/./' => array('ISO-8859-15', 'ISO-8859-1', 'Windows-1252', 'CP850'),
);
/**
* This array contains the bytes that make it more likely for a string to
* be a certain encoding. The keys are the pattern, the values are arrays
* with (encoding => likeliness score modifier).
*/
public static $legacyEncodingsLikelyBytes = array(
// Byte | ISO-1 | ISO-15 | W-1252 | CP850
// 0x80 | - | - | ⬠| Ã
'/\x80/' => array(
'Windows-1252' => +10,
),
// Byte | ISO-1 | ISO-15 | W-1252 | CP850
// 0x93 | - | - | â | ô
// 0x94 | - | - | â | ö
// 0x95 | - | - | ⢠| ò
// 0x96 | - | - | â | û
// 0x97 | - | - | â | ù
// 0x99 | - | - | ⢠| Ã
'/[\x93-\x97\x99]/' => array(
'Windows-1252' => +1,
),
// Byte | ISO-1 | ISO-15 | W-1252 | CP850
// 0x86 | - | - | â | Ã¥
// 0x87 | - | - | ⡠| ç
// 0x89 | - | - | Ⱐ| ë
// 0x8A | - | - | Š| è
// 0x8C | - | - | Š| î
// 0x8E | - | - | Ž | Ã
// 0x9A | - | - | Å¡ | Ã
// 0x9C | - | - | Š| £
// 0x9E | - | - | ž | Ã
'/[\x86\x87\x89\x8A\x8C\x8E\x9A\x9C\x9E]/' => array(
'Windows-1252' => -1,
),
// Byte | ISO-1 | ISO-15 | W-1252 | CP850
// 0xA4 | ¤ | ⬠| ¤ | ñ
'/\xA4/' => array(
'ISO-8859-15' => +10,
),
// Byte | ISO-1 | ISO-15 | W-1252 | CP850
// 0xA6 | ¦ | Š| ¦ | ª
// 0xBD | ½ | Š| ½ | ¢
'/[\xA6\xBD]/' => array(
'ISO-8859-15' => -1,
),
// Byte | ISO-1 | ISO-15 | W-1252 | CP850
// 0x82 | - | - | â | é
// 0xA7 | § | § | § | º
// 0xFD | ý | ý | ý | ²
'/[\x82\xA7\xCF\xFD]/' => array(
'CP850' => +1
),
// Byte | ISO-1 | ISO-15 | W-1252 | CP850
// 0x91 | - | - | â | æ
// 0x92 | - | - | â | Ã
// 0xB0 | ° | ° | ° | â
// 0xB1 | ± | ± | ± | â
// 0xB2 | ² | ² | ² | â
// 0xB3 | ³ | ³ | ³ | â
// 0xB9 | ¹ | ¹ | ¹ | â£
// 0xBA | º | º | º | â
// 0xBB | » | » | » | â
// 0xBC | ¼ | Å | ¼ | â
// 0xC1 | Ã | Ã | Ã | â´
// 0xC2 | à | à | à | â¬
// 0xC3 | Ã | Ã | Ã | â
// 0xC4 | Ã | Ã | Ã | â
// 0xC5 | Ã
| Ã
| Ã
| â¼
// 0xC8 | Ã | Ã | Ã | â
// 0xC9 | Ã | Ã | Ã | â
// 0xCA | Ã | Ã | Ã | â©
// 0xCB | à | à | à | â¦
// 0xCC | Ã | Ã | Ã | â
// 0xCD | Ã | Ã | Ã | â
// 0xCE | à | à | à | â¬
// 0xD9 | Ã | Ã | Ã | â
// 0xDA | Ã | Ã | Ã | â
// 0xDB | Ã | Ã | Ã | â
// 0xDC | Ã | Ã | Ã | â
// 0xDF | Ã | Ã | Ã | â
// 0xE7 | ç | ç | ç | þ
// 0xE8 | è | è | è | Ã
'/[\x91\x92\xB0-\xB3\xB9-\xBC\xC1-\xC5\xC8-\xCE\xD9-\xDC\xDF\xE7\xE8]/' => array(
'CP850' => -1
),
/* etc. */
然后循环字符串中的字节(而不是字符)并保留分数。如果你想知道更多的信息,请告诉我。