linux - Program to check/look up UTF-8/Unicode characters in string on command line?

08
2014-07
  • sdaau

    I've just realized I have a file on my system; it lists normally:

    $ ls -la TΕSТER.txt 
    -rw-r--r-- 1 user user 8 2013-04-11 18:07 TΕSТER.txt
    $ cat TΕSТER.txt 
    testing
    

    ... yet, it crashes a piece of software with a UTF-8/Unicode related error. I was really puzzled, since I couldn't tell why such a file is a problem; and finally I remembered to check the output of ls with hexdump:

    $ ls TΕSТER.txt 
    TΕSТER.txt
    $ ls TΕSТER.txt | hexdump -C
    00000000  54 ce 95 53 d0 a2 45 52  2e 74 78 74 0a           |T..S..ER.txt.|
    0000000d
    

    ... Well, obviously there are some bytes in between/instead of some letters, so I guess it is a Unicode encoding problem. And I can try to echo the bytes back to see what is printed:

    $ echo -e "\x54\xCE\x95\x53\xD0\xA2\x45\x52\x2E\x74\x78\x74"
    TΕSТER.txt
    

    ... but I still cannot tell which - if any - Unicode characters these are.

    So is there a command line tool, which I can to inspect a string on the terminal, and get Unicode information about it's characters?

  • Answers
  • sdaau

    Well, I looked a bit on the net, and found a one-liner ugrep in Look up a unicode character by name | commandlinefu.com; but that doesn't help me much here.

    Then I saw codecs – String encoding and decoding - Python Module of the Week, which does have a lot of options - but not much related to Unicode character names.

    So finally I coded a small tool utfinfo.pl, which only accepts input on stdin:

    ... which gives me the following information:

    $ ls TΕSТER.txt | perl utfinfo.pl 
    Got 10 uchars
    Char: 'T' u: 84 [0x0054] b: 84 [0x54] n: LATIN CAPITAL LETTER T [Basic Latin]
    Char: 'Ε' u: 917 [0x0395] b: 206,149 [0xCE,0x95] n: GREEK CAPITAL LETTER EPSILON [Greek and Coptic]
    Char: 'S' u: 83 [0x0053] b: 83 [0x53] n: LATIN CAPITAL LETTER S [Basic Latin]
    Char: 'Т' u: 1058 [0x0422] b: 208,162 [0xD0,0xA2] n: CYRILLIC CAPITAL LETTER TE [Cyrillic]
    Char: 'E' u: 69 [0x0045] b: 69 [0x45] n: LATIN CAPITAL LETTER E [Basic Latin]
    Char: 'R' u: 82 [0x0052] b: 82 [0x52] n: LATIN CAPITAL LETTER R [Basic Latin]
    Char: '.' u: 46 [0x002E] b: 46 [0x2E] n: FULL STOP [Basic Latin]
    Char: 't' u: 116 [0x0074] b: 116 [0x74] n: LATIN SMALL LETTER T [Basic Latin]
    Char: 'x' u: 120 [0x0078] b: 120 [0x78] n: LATIN SMALL LETTER X [Basic Latin]
    Char: 't' u: 116 [0x0074] b: 116 [0x74] n: LATIN SMALL LETTER T [Basic Latin]
    

    ... which then identifies which characters are not the "plain" ASCII ones.

    Hope this helps someone,
    Cheers!


  • Related Question

    notepad - Unicode, Unicode Big Endian or UTF-8? What is the difference? Which format is better?
  • Ashwin

    When I try to save a text file with non-English text in Notepad, I get an option to choose between Unicode, Unicode Big Endian and UTF-8. What is the difference between these formats?

    Assuming I do not want any backward compatibility (with older OS versions or apps) and I do not care about the file size, which of these formats is better?

    (Assume that the text can be in languages like Chinese or Japanese, in addition to other languages.)

    Note: From the answers and comments below it seems that in Notepad lingo, Unicode is UTF-16 (Little Endian), Unicode Big Endian is UTF-16 (Big Endian) and UTF-8 is well UTF-8.


  • Related Answers
  • Jason Baker

    Dunno. Which is better: a saw or a hammer? :-)

    Unicode isn't UTF

    There's a bit in the article that's a bit more relevant to the subject at hand though:

    • UTF-8 focuses on minimizing the byte size for representation of characters from the ASCII set (variable length representation: each character is represented on 1 to 4 bytes, and ASCII characters all fit on 1 byte). As Joel puts it:

    “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings

    • UTF-32 focuses on exhaustiveness and fixed-length representation, using 4 bytes for all characters. It’s the most straightforward translation, mapping directly the Unicode code-point to 4 bytes. Obviously, it’s not very size-efficient.

    • UTF-16 is a compromise, using 2 bytes most of the time, but expanding to 2 * 2 bytes per character to represent certain characters, those not included in the Basic Multilingual Plane (BMP).

    Also see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

  • Mark Ransom

    For European languages, UTF-8 is smaller. For Oriental languages, the difference is not so clear-cut.

    Both will handle all possible Unicode characters, so it should make no difference in compatibility.

  • Brad Gilbert

    There are more Unicode character encodings than you may think.

    • UTF 8

      The UTF-8 encoding is variable-width, ranging from 1-4 bytes, with the upper bits of each byte reserved as control bits. The leading bits of the first byte indicate the total number of bytes used for that character. The scalar value of a character's code point is the concatenation of the non-control bits. In this table, x represents the lowest 8 bits of the Unicode value, y represents the next higher 8 bits, and z represents the bits higher than that.

      Unicode              Byte1     Byte2     Byte3     Byte4
      U+0000-U+007F       0xxxxxxx    		
      U+0080-U+07FF       110yyyxx  10xxxxxx  		
      U+0800-U+FFFF       1110yyyy  10yyyyxx  10xxxxxx    
      U+10000-U+10FFFF    11110zzz  10zzyyyy  10yyyyxx  10xxxxxx
      
    • UCS-16
    • UCS-16BE
    • UCS-16LE

    • UTF-16
    • UTF-16BE
    • UTF-16LE

    • UTF-32
    • UTF-32-BE
  • zildjohn01

    The only real advantage with small files like text files is the resulting file size. UTF-8 generally produces smaller files. But this difference may be less pronounced with Chinese/Japanese text.

  • John Saunders

    "Unicode" is another term for "UTF-16", which is an encoding of the Unicode character set into sixteen-bits per character. UTF-8 encodes it into eight bits per character.

    In both cases, any overflow is allocated to another 16 or eight bits.