windows - Batch-convert files for encoding

25
2014-03
  • desolat

    How can I batch-convert files in a directory for their encoding (e.g. ANSI->UTF-8) with a command or tool?

    For single files an editor helps, but how to do the mass files job?

  • Answers
  • elbekko

    Cygwin or GnuWin32 provide Unix tools like iconv and dos2unix (and unix2dos). Under Unix/Linux/Cygwin, you'll want to use "windows-1252" as the encoding instead of ANSI (see below). (Unless you know your system is using a codepage other than 1252 as its default codepage, in which case you'll need to tell iconv the right codepage to translate from.)

    Convert from one (-f) to the other (-t) with:

    $ iconv -f windows-1252 -t utf-8 infile > outfile
    

    Or in a find-all-and-conquer form:

    ## this will clobber the original files!
    $ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 {} \> {} \;
    

    Alternatively:

    ## this will clobber the original files!
    $ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 -o {} {} \;
    

    This question has been asked many times on this site, so here's some additional information about "ANSI". In an answer to a related question, CesarB mentions:

    There are several encodings which are called "ANSI" in Windows. In fact, ANSI is a misnomer. iconv has no way of guessing which you want.

    The ANSI encoding is the encoding used by the "A" functions in the Windows API (the "W" functions use UTF-16). Which encoding it corresponds to usually depends on your Windows system language. The most common is CP 1252 (also known as Windows-1252). So, when your editor says ANSI, it is meaning "whatever the API functions use as the default ANSI encoding", which is the default non-Unicode encoding used in your system (and thus usually the one which is used for text files).

    The page he links to gives this historical tidbit (quoted from a Microsoft PDF) on the origins of CP 1252 and ISO-8859-1, another oft-used encoding:

    [...] this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see "ANSI" or "A" used to signify Windows code page support.

  • Community

    with powershell you can do something like this:

    %  get-content IN.txt | out-file -encoding ENC -filepath OUT.txt
    

    while ENC is something like unicode, ascii, utf8, utf32. checkout 'help out-file'.

    to convert all the *.txt files in a directory to utf8 do something like this:

    % foreach($i in ls -name DIR/*.txt) { \
           get-content DIR/$i | \
           out-file -encoding utf8 -filepath DIR2/$i \
      }
    

    which creates a converted version of each .txt file in DIR2.

    EDIT: To replace the files in all subdirectories use:

    % foreach($i in ls -recurse -filter "*.java") {
        $temp = get-content $i.fullname
        out-file -filepath $i.fullname -inputobject $temp -encoding utf8 -force
    }
    
  • nagul

    The Wikipedia page on newlines has a section on conversion utilities.

    This seems your best bet for a conversion using only tools Windows ships with:

    TYPE unix_file | FIND "" /V > dos_file
    
  • 8088

    UTFCast is a Unicode converter for Windows which supports batch mode. I'm using the paid version and am quite comfortable with it.

    UTFCast is a Unicode converter that lets you batch convert all text files to UTF encodings with just a click of your mouse. You can use it to convert a directory full of text files to UTF encodings including UTF-8, UTF-16 and UTF-32 to an output directory, while maintaining the directory structure of the original files. It doesn't even matter if your text file has a different extension, UTFCast can automatically detect text files and convert them.

  • nik

    There is dos2unix on unix.
    There was another similar tool for Windows (another ref here).

    How do I convert between Unix and Windows text files? has some more tricks

  • user1055927

    You can use EncodingMaster. It's free, it has a Windows, Linux and Mac OS X version and works really good.


  • Related Question

    linux - How can I convert multiple files to UTF-8 encoding using *nix command line tools?
  • jason

    Possible Duplicate:
    Batch-convert files for encoding or line ending

    I have a bunch of text files that I'd like to convert from any given charset to UTF-8 encoding.

    Are there any command line tools or Perl (or language of your choice) one liners I can use to do this en masse?


  • Related Answers
  • grawity

    iconv does convert between many character encodings. So adding a little bash magic and we can write

    for file in *.txt; do
        iconv -f ascii -t utf-8 "$file" -o "${file%.txt}.utf8.txt"
    done
    

    This will run iconv -f ascii -t utf-8 to every file ending in .txt, sending the recoded file to a file with the same name but ending in .utf8.txt instead of .txt.

    It's not as if this would actually do anything to your files (because ASCII is a subset of UTF-8), but to answer your question about how to convert between encodings.