linux - Batch convert to UTF-8 a directory having both UTF-8 and CP-1251 files

07
2014-07
  • sashoalm

    I have a directory containing files, some of them are UTF-8, some are CP-1251. I want to convert the ones that are CP-1251 to be UTF-8, but without corrupting the UTF-8 files.

    I tried using iconv -f cp1251 -t utf8 <...>, it works for CP-1251, but if the file is already UTF-8, it is also converted and becomes incomprehensible.

  • Answers
  • grawity

    You could get a list of files that are neither UTF-8 nor US-ASCII using:

    file -0 -i *.txt | awk -F '\0' '$2 !~ /charset=(us-ascii|utf-8)$/ {print $1}'
    
  • sashoalm

    I found a way to do it using enconv:

    enconv -L bulgarian -x utf8 file.txt
    

    It works for both UTF-8 and CP-1251 files.


  • Related Question

    encoding - convert file type to utf-8 on unix - iconv is failing
  • pedalpete

    Possible Duplicates:
    Batch-convert files for encoding or line ending under Windows
    How can I convert multiple files to UTF-8 encoding using *nix command line tools?

    I've got a php file on my windows machine that upon moving over to *nix with winSCP, is not showing the characters correctly.

    I've dragged the file back from the linux machine down to windows and checked the encoding with Notepad++, and it says it ANSI.

    So i tried iconv -f ANSI -t utf-8 filename.php>filename.php, but get an error that ANSI conversion is not supported. I've also tried MS_ANSI, and I get no error, but I also don't get the file showing the proper encoding.

    I open the file with winSCP to see how it looks, and many special characters show up as '?'. Seeing as the purpose of the script is to remove these special characters from my data, it is really causing a bit of an issue.

    Is there another tool for changing the encoding? I tried yum iconv, but get a no package available response.

    How would you convert this file to the proper encoding?


  • Related Answers
  • quack quixote

    I have similar troubles with MD5 hashes created on WindowsXP (under Cygwin), saved to a file, then copied to a Linux system where the hashes are computed for copy verification. If the name of a file being hashed contains non-ASCII characters, md5sum reports the file missing, because it's not decoding the filename correctly. However, if I open the textfile containing the hashes in Notepad and change the encoding from ANSI to UTF-8, the Linux md5sum will get the encoding correct.

    ANSI isn't really a proper encoding (to anyone but Microsoft), so that's why iconv isn't picking up on it. You might get away windows-1252 instead, but there's no guarantee it will always work:

    iconv -f windows-1252 -t utf-8 filename.from > filename.to
    

    For the record, file gives me this on one of those MD5 textfiles:

    $ file tequila.ansi.txt
    tequila.ansi.txt: ISO-8859 text
    
  • Matthew Talbert

    You could just convert it to UTF-8 with Notepad++.

  • CesarB

    There are several encodings which are called "ANSI" in Windows. In fact, ANSI is a misnomer. iconv has no way of guessing which you want.

    The ANSI encoding is the encoding used by the "A" functions in the Windows API (the "W" functions use UTF-16). Which encoding it corresponds to usually depends on your Windows system language. The most common is CP 1252 (also known as Windows-1252). So, when your editor says ANSI, it is meaning "whatever the API functions use as the default ANSI encoding", which is the default non-Unicode encoding used in your system (and thus usually the one which is used for text files).

    So, to convert the file correctly, you first should find out which is the "ANSI" encoding for your Windows system (or simply ask your text editor there to save using a specific encoding).

  • hlovdal

    Are you sure "ANSI" is the correct character encoding/input name for iconv? You could try to run "file filename.php", often file will tell (what it thinks) the encoding is. You could also try to not specify the from encoding when doing the conversion, or you could just try all of them:

    for i in `iconv -l`; do iconv -f $i -t utf-8 filename.php > filename.php.$i; done