encoding - convert file type to utf-8 on unix - iconv is failing

23
2014-03
  • user7926

    Possible Duplicates:
    Batch-convert files for encoding or line ending under Windows
    How can I convert multiple files to UTF-8 encoding using *nix command line tools?

    I've got a php file on my windows machine that upon moving over to *nix with winSCP, is not showing the characters correctly.

    I've dragged the file back from the linux machine down to windows and checked the encoding with Notepad++, and it says it ANSI.

    So i tried iconv -f ANSI -t utf-8 filename.php>filename.php, but get an error that ANSI conversion is not supported. I've also tried MS_ANSI, and I get no error, but I also don't get the file showing the proper encoding.

    I open the file with winSCP to see how it looks, and many special characters show up as '?'. Seeing as the purpose of the script is to remove these special characters from my data, it is really causing a bit of an issue.

    Is there another tool for changing the encoding? I tried yum iconv, but get a no package available response.

    How would you convert this file to the proper encoding?

  • Answers
  • quack quixote

    I have similar troubles with MD5 hashes created on WindowsXP (under Cygwin), saved to a file, then copied to a Linux system where the hashes are computed for copy verification. If the name of a file being hashed contains non-ASCII characters, md5sum reports the file missing, because it's not decoding the filename correctly. However, if I open the textfile containing the hashes in Notepad and change the encoding from ANSI to UTF-8, the Linux md5sum will get the encoding correct.

    ANSI isn't really a proper encoding (to anyone but Microsoft), so that's why iconv isn't picking up on it. You might get away windows-1252 instead, but there's no guarantee it will always work:

    iconv -f windows-1252 -t utf-8 filename.from > filename.to
    

    For the record, file gives me this on one of those MD5 textfiles:

    $ file tequila.ansi.txt
    tequila.ansi.txt: ISO-8859 text
    
  • Matthew Talbert

    You could just convert it to UTF-8 with Notepad++.

  • CesarB

    There are several encodings which are called "ANSI" in Windows. In fact, ANSI is a misnomer. iconv has no way of guessing which you want.

    The ANSI encoding is the encoding used by the "A" functions in the Windows API (the "W" functions use UTF-16). Which encoding it corresponds to usually depends on your Windows system language. The most common is CP 1252 (also known as Windows-1252). So, when your editor says ANSI, it is meaning "whatever the API functions use as the default ANSI encoding", which is the default non-Unicode encoding used in your system (and thus usually the one which is used for text files).

    So, to convert the file correctly, you first should find out which is the "ANSI" encoding for your Windows system (or simply ask your text editor there to save using a specific encoding).

  • hlovdal

    Are you sure "ANSI" is the correct character encoding/input name for iconv? You could try to run "file filename.php", often file will tell (what it thinks) the encoding is. You could also try to not specify the from encoding when doing the conversion, or you could just try all of them:

    for i in `iconv -l`; do iconv -f $i -t utf-8 filename.php > filename.php.$i; done
    

  • Related Question

    windows - Batch-convert files for encoding
  • desolat

    How can I batch-convert files in a directory for their encoding (e.g. ANSI->UTF-8) with a command or tool?

    For single files an editor helps, but how to do the mass files job?


  • Related Answers
  • elbekko

    Cygwin or GnuWin32 provide Unix tools like iconv and dos2unix (and unix2dos). Under Unix/Linux/Cygwin, you'll want to use "windows-1252" as the encoding instead of ANSI (see below). (Unless you know your system is using a codepage other than 1252 as its default codepage, in which case you'll need to tell iconv the right codepage to translate from.)

    Convert from one (-f) to the other (-t) with:

    $ iconv -f windows-1252 -t utf-8 infile > outfile
    

    Or in a find-all-and-conquer form:

    ## this will clobber the original files!
    $ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 {} \> {} \;
    

    Alternatively:

    ## this will clobber the original files!
    $ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 -o {} {} \;
    

    This question has been asked many times on this site, so here's some additional information about "ANSI". In an answer to a related question, CesarB mentions:

    There are several encodings which are called "ANSI" in Windows. In fact, ANSI is a misnomer. iconv has no way of guessing which you want.

    The ANSI encoding is the encoding used by the "A" functions in the Windows API (the "W" functions use UTF-16). Which encoding it corresponds to usually depends on your Windows system language. The most common is CP 1252 (also known as Windows-1252). So, when your editor says ANSI, it is meaning "whatever the API functions use as the default ANSI encoding", which is the default non-Unicode encoding used in your system (and thus usually the one which is used for text files).

    The page he links to gives this historical tidbit (quoted from a Microsoft PDF) on the origins of CP 1252 and ISO-8859-1, another oft-used encoding:

    [...] this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see "ANSI" or "A" used to signify Windows code page support.

  • Community

    with powershell you can do something like this:

    %  get-content IN.txt | out-file -encoding ENC -filepath OUT.txt
    

    while ENC is something like unicode, ascii, utf8, utf32. checkout 'help out-file'.

    to convert all the *.txt files in a directory to utf8 do something like this:

    % foreach($i in ls -name DIR/*.txt) { \
           get-content DIR/$i | \
           out-file -encoding utf8 -filepath DIR2/$i \
      }
    

    which creates a converted version of each .txt file in DIR2.

    EDIT: To replace the files in all subdirectories use:

    % foreach($i in ls -recurse -filter "*.java") {
        $temp = get-content $i.fullname
        out-file -filepath $i.fullname -inputobject $temp -encoding utf8 -force
    }
    
  • nagul

    The Wikipedia page on newlines has a section on conversion utilities.

    This seems your best bet for a conversion using only tools Windows ships with:

    TYPE unix_file | FIND "" /V > dos_file
    
  • 8088

    UTFCast is a Unicode converter for Windows which supports batch mode. I'm using the paid version and am quite comfortable with it.

    UTFCast is a Unicode converter that lets you batch convert all text files to UTF encodings with just a click of your mouse. You can use it to convert a directory full of text files to UTF encodings including UTF-8, UTF-16 and UTF-32 to an output directory, while maintaining the directory structure of the original files. It doesn't even matter if your text file has a different extension, UTFCast can automatically detect text files and convert them.

  • nik

    There is dos2unix on unix.
    There was another similar tool for Windows (another ref here).

    How do I convert between Unix and Windows text files? has some more tricks

  • user1055927

    You can use EncodingMaster. It's free, it has a Windows, Linux and Mac OS X version and works really good.