Windows 7 UTF-8 and Unicode
2014-07
Hi: Could someone please explain what has changed in Windows 7 (Pro 64-bit)?
Details: Previously I had Windows XP and had some translations files (UTF-8 encoded) in CSV format. I was able to view the fonts in both Notepad and also in Excel. After upgrading to Windows 7, when open these files -- All I see is square boxes (Just you know, if I open them in browser -- I was able to see all the translation). Wondering what was going on and finally I save those files in Unicode, everything seems to be fine.
So, what exactly going on? Why Windows 7 works with Unicode and not with UTF-8??
Thanks in advance.
Why Windows 7 works with Unicode and not with UTF-8?
Terminology
Unicode and UTF-8 are not the same kind of thing: Unicode is a character-set which defines a set of characters (a repertoire) and assigns numbers (code points) to each of those characters. UTF-8 is one of several encodings that can be used to represent a stream of Unicode characters on disk or in transmission. The same stream of Unicode characters could also be encoded as UTF-16, UTF-32 or UTF-7 for example.
However, Notepad offers you "encoding" options including ANSI
,Unicode
, Unicode big-endian
and UTF-8
. The Microsoft developers who wrote this have used the wrong terms. When they say "Unicode" they most likely mean "UTF-16 little endian". When they say "ANSI" they mean Code-Page 1252 (CP1252)
Microsoft Notepad
I believe Microsoft's Notepad writes UTF-16 with a byte order mark (BOM) and that Notepad looks for the BOM when reading a text file. The BOM tells the app the file is UTF-16 and indicates whether it is big-endian or little-endian.
[Edited] If Notepad doesn't find the BOM, it calls a library function isTextUnicode
which looks at the data and attempts to guess what encoding was used. Sometimes it inevitably guesses incorrectly. Sometimes it guesses that an "ANSI" file is "Unicode". Trying to interpret a UTF-16 or UTF-8 file as Code Page 1252 would cause it to display the wrong glyphs and be unable to find glyphs to render some 8-bit values - these would then be shown as squares.
As harrymc says, in his answer below, there are better alternatives to Notepad. But Notepad lets you explicitly choose the encoding when opening a file (rather than leaving Notepad to try to guess).
Byte Order Marks
According to the Unicode consortium, Byte Order Marks (BOMs) are optional. However Windows relies on BOMs to distinguish between some encodings.
So in short, maybe your files lacked a BOM for some reason? Maybe the BOM was lost sometime during the upgrade process?
If you still have the original files that show as squares, you could make a hex dump of them to see if they contain a BOM.
[Edit]
Plain text file standards
The problem is that there are effectively none. No universal standards for plain text files. Instead we have a number of incompatibilites and unknowns.
How have line-endings been marked? Some platforms use the control-characters Carriage Return (CR) followed by Line Feed (LF), some use CR alone and some use LF alone.
Are the above terminators or separators? This has an effect at the end of a file and has been known to cause problems.
Treatment of tabs and other control characters. We might assume that a tab is used to align to a multiple of 8 standard character widths from the start of the line but really there is no certainty to this. Many programs allow tab positions to be altered.
Character set & Encoding? There is no universal standard for indicating which of these have been used for the text in the file. The nearest we have is to look for the presence of a BOM which indicates the encoding is one of those used for Unicode. From the BOM value the program reading the file can distinguish between UTF-8 and UTF-16 etc and between Little Endian and Big-Endian variants of UTF-16 etc. There is no universal standard for indicating that a file is encoded in any other popular encoding such as CP-1252 or KOI-8.
And so on. None of the above metadata is written into the text file - so the end-user must inform the program when reading the file. The end-user has to know the metadata values for any specific file or run the risk that their program will use the wrong metadata values.
Bush hid the facts
Try this on Windows XP.
- Open Notepad
- Set the font to Arial Unicode MS
- Enter the text "Bush hid the facts"
- Choose
Save As
and selectEncoding
ANSI
- Close Notepad
- Reopen the document (e.g. using
Start
,My Recent Documents
) - You will see 畂桳栠摩琠敨映捡獴 instead of "Bush hid the facts"
This illustrates that the isTextUnicode
function used by Notepad incorrectly guesses that the ANSI (Really CodePage 1252) text is Unicode UTF-16LE without a BOM. There is no BOM in a file saved as ANSI
.
Windows 7
With Windows 7, Microsoft adjusted isTextUnicode
so that the above does not happen. In the absense of a BOM, it is now more likely to guess ANSI (CP 1252) than Unicode (UTF-16LE). With Windows-7 I expect you are therefore more likely to have the reverse problem: A file containing Unicode characters with code points greater than 255, but with no BOM, is now more likely to be guessed as being ANSI - and therefore displayed incorrectly.
Preventing encoding problems
Currently, the best approach seems to be to use UTF-8 everywhere. Ideally you would re-encode all old text files into UTF-8 and only ever save text files as UTF-8. There are tools such as recode and iconv that can help with this.
A remark: You can use Notepad++ to view theses files, using the Encoding menu.
Once the files are displayed correctly, saving them will add the correct BOM.
When I try to save a text file with non-English text in Notepad, I get an option to choose between Unicode, Unicode Big Endian and UTF-8. What is the difference between these formats?
Assuming I do not want any backward compatibility (with older OS versions or apps) and I do not care about the file size, which of these formats is better?
(Assume that the text can be in languages like Chinese or Japanese, in addition to other languages.)
Note: From the answers and comments below it seems that in Notepad lingo, Unicode is UTF-16 (Little Endian), Unicode Big Endian is UTF-16 (Big Endian) and UTF-8 is well UTF-8.
Dunno. Which is better: a saw or a hammer? :-)
There's a bit in the article that's a bit more relevant to the subject at hand though:
- UTF-8 focuses on minimizing the byte size for representation of characters from the ASCII set (variable length representation: each character is represented on 1 to 4 bytes, and ASCII characters all fit on 1 byte). As Joel puts it:
“Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings
UTF-32 focuses on exhaustiveness and fixed-length representation, using 4 bytes for all characters. It’s the most straightforward translation, mapping directly the Unicode code-point to 4 bytes. Obviously, it’s not very size-efficient.
UTF-16 is a compromise, using 2 bytes most of the time, but expanding to 2 * 2 bytes per character to represent certain characters, those not included in the Basic Multilingual Plane (BMP).
For European languages, UTF-8 is smaller. For Oriental languages, the difference is not so clear-cut.
Both will handle all possible Unicode characters, so it should make no difference in compatibility.
There are more Unicode character encodings than you may think.
UTF 8
The UTF-8 encoding is variable-width, ranging from 1-4 bytes, with the upper bits of each byte reserved as control bits. The leading bits of the first byte indicate the total number of bytes used for that character. The scalar value of a character's code point is the concatenation of the non-control bits. In this table,
x
represents the lowest 8 bits of the Unicode value,y
represents the next higher 8 bits, andz
represents the bits higher than that.Unicode Byte1 Byte2 Byte3 Byte4 U+0000-U+007F 0xxxxxxx U+0080-U+07FF 110yyyxx 10xxxxxx U+0800-U+FFFF 1110yyyy 10yyyyxx 10xxxxxx U+10000-U+10FFFF 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx
- UCS-16
- UCS-16BE
- UCS-16LE
- UTF-16
- UTF-16BE
- UTF-16LE
- UTF-32
- UTF-32-BE
The only real advantage with small files like text files is the resulting file size. UTF-8 generally produces smaller files. But this difference may be less pronounced with Chinese/Japanese text.
"Unicode" is another term for "UTF-16", which is an encoding of the Unicode character set into sixteen-bits per character. UTF-8 encodes it into eight bits per character.
In both cases, any overflow is allocated to another 16 or eight bits.