Computer not displaying unicode characters

08
2014-07
  • msarchet

    My computer recently displayed Unicode characters fine. However; it recently stopped displaying them, I can successfully display the characters in notepad on another computer, running the same Win 7 but it is x86 instead of x64. Both cases are using the same font (arial) switching to a unicode supported font on the x64 machine fixes the problem. Any thoughts?

    음식

  • Answers
  • RedGrittyBrick

    Any thoughts?

    Notepad Font

    In Windows Notepad, you cannot (in general) mix fonts, you can only select one font at a time. However, this statement has to be qualified for recent versions of Windows.


    Windows XP and earlier

    On Windows XP and before, Notepad could not display characters which were not in the selected Font. Therefore a missing or incorrectly displayed character (typically shown as an empty box) could be caused by:

    Using Arial instead of Arial Unicode.
    Arial is 778,552 bytes, Arial Unicode is 23,275,812 bytes. The difference is that a huge number of characters are in Arial Unicode but not in Arial. (these sizes are from Vista not XP but the difference will be similar).

    Omitting the Byte Order Mark(BOM)
    Windows expects Unicode files (UTF-8, UTF-16 LE, etc) to contain a BOM. If not, Windows then guesses the encoding using a Windows API function (isTextUnicode) which is well-known to make mistakes, resulting in multiple Latin-1 characters being shown instead of a single correct character.


    (update)

    Windows Vista onwards

    Notepad has now adopted the strategy first seen in web-browsers - if the character does not exist in the current font, find a font that does contain that character and, for that character only, use the other font. Therefore if you have different additional fonts on one computer, it can behave differently to other computers (even though the OS is the same).

    "Wrong" Arial
    I don't know the algorithm used but it seems possible that if you have a corrupted or much smaller Arial Unicode that lacks some characters, Notepad may believe the font contains a character it does not. An Arial font of 3,395 KB is not what I would expect on Windows 7. Perhaps installing some application has replaced the default font with one that is faulty in some way?

    Additional "bad" font Alternatively Notepad might search a different, recently added, font for the missing character before looking at Arial Unicode. If this different font claims to contain the character but does not (e.g. incorrect layout tags, Notepad could fail to display the character


    (update 2)

    Suggested action

    On the computer that has the 3,395 KB Arial regular, copy the font file to a safe place, un-install it, then install the 761 KB Arial regular font file copied from the other computer.


  • Related Question

    unicode - Why do (Russian) characters in some received emails change when reading in David InfoCenter?
  • waszkiewicz

    I'm using David InfoCenter as email Software, and I have troubles with some of my emails in Russian. It's only a few letters, in some emails (sent from different people), like for example the "R" ("P" in russian) will be shown as a "T". In other emails in Russian, the problem doesn't appear. Isn't it strange? Does anyone had the same problem already and found where it came from?

    When I transmit that email to an external mailbox (internet email account), it's even worse, and gives me symbols instead of all Russian letters...

    The default encoding was "Russian (ISO)", I changed it to "Russian (Windows)", but same problem. Another weird reaction is when I write an intern email and name it TEST in Russian (Тест), with Тест in the text window, it changes the title to "Oano"? But the content stays in Russian...

    ![alt text][1]

    With Mailinator I got the following, for message and subject "тест":

    Subject: ????
    [..]
    MIME-Version: 1.0
    Content-Type: multipart/alternative;
     boundary="----_=_NextPart_000_00017783.4AF7FB71"
    This message is in MIME format. Since your mail reader does not understand
    this format, some or all of this message may not be legible.
    ------_=_NextPart_000_00017783.4AF7FB71
    Content-Type: text/plain;
     charset="utf-8"
    Content-Transfer-Encoding: base64
    0KLQtdGB0YI=
    ------_=_NextPart_000_00017783.4AF7FB71
    Content-Type: text/html;
     charset="utf-8"
    Content-Transfer-Encoding: base64
    PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMCBUcmFuc2l0aW9uYWwv
    L0VOIj4NCjxIVE1MPjxIRUFEPg0KPE1FVEEgaHR0cC1lcXVpdj1Db250ZW50LVR5cGUgY29udGVu
    dD0idGV4dC9odG1sOyBjaGFyc2V0PXV0Zi04Ij4NCjxNRVRBIG5hbWU9R0VORVJBVE9SIGNvbnRl
    bnQ9Ik1TSFRNTCA4LjAwLjYwMDEuMTg4NTIiPjwvSEVBRD4NCjxCT0RZIHN0eWxlPSJGT05UOiAx
    MHB0IENvdXJpZXIgTmV3OyBDT0xPUjogIzAwMDAwMCIgbGVmdE1hcmdpbj01IHRvcE1hcmdpbj01
    Pg0KPERJViBzdHlsZT0iRk9OVDogMTBwdCBDb3VyaWVyIE5ldzsgQ09MT1I6ICMwMDAwMDAiPtCi
    0LXRgdGCPFNQQU4gDQppZD10b2JpdF9ibG9ja3F1b3RlPjxTUEFOIGlkPXRvYml0X2Jsb2NrcXVv
    dGU+PC9ESVY+PC9TUEFOPjwvU1BBTj48L0JPRFk+PC9IVE1MPg==
    ------_=_NextPart_000_00017783.4AF7FB71--

  • Related Answers
  • Arjan

    To break down the message:

    Subject: ????

    Too bad, your David InfoCenter is not doing things right. The above should have been something like:

    Subject: =?utf-8?Q?=D0=A2=D0=B5=D1=81=D1=82?=

    So, this is a bug that should be reported, and fixed.

    Next:

    MIME-Version: 1.0
    Content-Type: multipart/alternative;
     boundary="----_=_NextPart_000_00017783.4AF7FB71"

    The above tells the recipient that after each line "----_=_NextPart_000_00017783.4AF7FB71" it will find the very same message in a different format. Good.

    Next:

    This message is in MIME format. Since your mail reader does not understand
    this format, some or all of this message may not be legible.

    The above will be visible to users of old email software that does not understand MIME. Good.

    Next:

    ------_=_NextPart_000_00017783.4AF7FB71
    Content-Type: text/plain;
     charset="utf-8"
    Content-Transfer-Encoding: base64
    0KLQtdGB0YI=

    The above is the plain text, without bold, italic, etcetera. Using the great Online Base64 Decoder from FileFormat.info, the 0KLQtdGB0YI= translates back to Тест. Aha, not the lowercase тест like you wrote...? Anyway, seems fine, and a good email client should understand this part.

    In some more detail: 0KLQtdGB0YI= actually decodes to hexadecimal d0 a2 d0 b5 d1 81 d1 82 and you (should) see the same hexadecimal numbers in the Subject above. (When not properly decoded as being UTF-8, like when erroneously interpreted as ASCII, this would show as ТеÑÑ‚.)

    Next:

    ------_=_NextPart_000_00017783.4AF7FB71
    Content-Type: text/html;
     charset="utf-8"
    Content-Transfer-Encoding: base64
    PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMCBUcmFuc2l0aW9uYWwv
    L0VOIj4NCjxIVE1MPjxIRUFEPg0KPE1FVEEgaHR0cC1lcXVpdj1Db250ZW50LVR5cGUgY29udGVu
    dD0idGV4dC9odG1sOyBjaGFyc2V0PXV0Zi04Ij4NCjxNRVRBIG5hbWU9R0VORVJBVE9SIGNvbnRl
    bnQ9Ik1TSFRNTCA4LjAwLjYwMDEuMTg4NTIiPjwvSEVBRD4NCjxCT0RZIHN0eWxlPSJGT05UOiAx
    MHB0IENvdXJpZXIgTmV3OyBDT0xPUjogIzAwMDAwMCIgbGVmdE1hcmdpbj01IHRvcE1hcmdpbj01
    Pg0KPERJViBzdHlsZT0iRk9OVDogMTBwdCBDb3VyaWVyIE5ldzsgQ09MT1I6ICMwMDAwMDAiPtCi
    0LXRgdGCPFNQQU4gDQppZD10b2JpdF9ibG9ja3F1b3RlPjxTUEFOIGlkPXRvYml0X2Jsb2NrcXVv
    dGU+PC9ESVY+PC9TUEFOPjwvU1BBTj48L0JPRFk+PC9IVE1MPg==

    The above is the very same, as a HTML formatted message. This will look about the same, though this is not at all valid HTML, as the tags are not closed in the order in they are opened, and an id should be unique but id=tobit_blockquote is used twice in this one-line message. Actually, the word "blockquote" suggests that you might have copied the word Тест from another message?

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
    <HTML><HEAD>
    <META http-equiv=Content-Type content="text/html; charset=utf-8">
    <META name=GENERATOR content="MSHTML 8.00.6001.18852"></HEAD>
    <BODY style="FONT: 10pt Courier New; COLOR: #000000" leftMargin=5 topMargin=5>
    <DIV style="FONT: 10pt Courier New; COLOR: #000000">Тест<SPAN 
    id=tobit_blockquote><SPAN id=tobit_blockquote></DIV></SPAN></SPAN>
    </BODY></HTML>
    

    Also, there's no need to send HTML for simple messages...

    Finally (note the two trailing dashes):

    ------_=_NextPart_000_00017783.4AF7FB71--

    This tells the email software the end of all formats is reached.

    This test message does not explain how Тест could become Oano, as the question marks could never translate into that. Maybe the question marks are not real question marks after all. Anyway: the Subject being wrong is a bug in your email client, which does not send the correct Subject. Also the HTML is buggy. Stop using that software.

  • 8088

    It surely must be a character set and/or encoding problem. Nowadays all the different character sets like "Russian (ISO)" and "Russian (Windows)" should no longer be required when using Unicode. And when using Unicode, most messages will be encoded using UTF-8.

    So:

    • Does changing the character set to Unicode help?
    • Does changing the encoding to UTF-8 help?
    • If not: can you post the source of the test message, after you received that? (Be careful to replace any email addresses with something like [email protected] before adding it to your question.)

    All email clients have a different way to show the true source, so maybe using some online service might be the easiest way to explain how see what is received:

    • Send a test message to some Mailinator account. No need to create an account: anything you put before @mailinator.com will work, but note that anyone who guesses that address can read the Inbox.
    • Go to its Inbox at mailinator.com
    • Click on the subject to open the message
    • While viewing the message, click the "(text view)" link:

    Mailinator Inbox

    • This will show something like:

      Received: from [..] 
        by [..] 
        for &lt;[email protected]&gt;; Fri, 6 Nov 2009 11:58:10 +0100 (CET)
      Subject: =?utf-8?Q?Test_/_=D1=82=D0=B5=D1=81=D1=82?=
      From: Arjan &lt;[..]&gt;
      Content-Type: text/plain; charset=utf-8; format=flowed
      Message-Id: [..]
      Date: Fri, 6 Nov 2009 11:58:08 +0100
      To: [email protected]
      Content-Transfer-Encoding: quoted-printable
      Mime-Version: 1.0 (Apple Message framework v1076)
      X-Mailer: Apple Mail (2.1076)
      X-Virus-Scanned: by XS4ALL Virus Scanner
      
      A test / =D1=82=D0=B5=D1=81=D1=82 for Super User.
      
      Gr=C3=BC=C3=9Fen!
      
      Arjan.=
      .
      

    Above, some personal details have been removed: no need to show us your email address or server details (like IP addresses).

    (For some reason Mailinator shows the UTF-8 encoded source of the message, "A test / тест for Super User. Grüßen!" as ASCII in the above screen capture. Seeing things like ü for ü and Ãe for ß is typically UTF-8 encoded text which has not been decoded. Still, it converts the subject just fine. And the last dot is actually a left-over from the SMTP communications, and could have been removed by Mailinator.)