linux - Converting Filename Encoding

25
2014-03

kaykun

my operating system is Arch Linux. I am trying to extract a .zip archive that contains CJK characters in its filenames. It was most likely created on a Windows machine.

I tried the unzip utility and it produced invalid symbols. The same with 7za, but with slightly different ones. My LANG variable was set to en_US.UTF-8, but setting it to ja_JP.ujis seems to have no effect. I'm assuming that this means that the CJK filenames were encoded into the archive in a format other than UTF-8, and I need to convert it to UTF-8 for them to display properly.

I know of convmv, and I used a shell script to test every possible encoding from convmv --list to no avail. I have the Unicode equivalents of the filenames for the most part, but in a format that's cumbersome to manually rename all of them, but with them I can verify if the conversion was successful or not.

Observing the hex dump of ls and with positional deduction I concluded that U+4EBA (人) is represented as 0xC9 0x6C with the unzip output and 0xC2 0x90 0x6C with the 7za output. This also means it isn't impossible that I'm not dealing with the original encoding in the first place.

So, why would two zip extractors produce different results, and are there any other leads to help me convert these filenames to UTF-8 correctly?

Answers

new123456

My first guess, when dealing with UTF8 pathnames, is to attempt to use the Python zipfile library - I'm guessing that its cross-platform enough for your needs (OTOH, the module docs mention nothing about UTF8...).

Here is a small script to attempt this:

#!/usr/bin/python
import zipfile
import sys
import os

if len(sys.argv) < 2:
    print "I require a file name and a directory to unzip to"
    sys.exit()
zip = zipfile.ZipFile(sys.argv[1])
if not os.path.exists(sys.argv[2]):
    os.mkdir(sys.argv[2])

zip.extractall(sys.argv[2])

This can be chmod +x'd and run - see if it works in your case.

In all infinite improbability, this will solve your problem.

Related Answers

elbekko

Cygwin or GnuWin32 provide Unix tools like iconv and dos2unix (and unix2dos). Under Unix/Linux/Cygwin, you'll want to use "windows-1252" as the encoding instead of ANSI (see below). (Unless you know your system is using a codepage other than 1252 as its default codepage, in which case you'll need to tell iconv the right codepage to translate from.)

Convert from one (-f) to the other (-t) with:

$ iconv -f windows-1252 -t utf-8 infile > outfile

Or in a find-all-and-conquer form:

## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 {} \> {} \;

Alternatively:

## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 -o {} {} \;

This question has been asked many times on this site, so here's some additional information about "ANSI". In an answer to a related question, CesarB mentions:

There are several encodings which are called "ANSI" in Windows. In fact, ANSI is a misnomer. iconv has no way of guessing which you want.

The ANSI encoding is the encoding used by the "A" functions in the Windows API (the "W" functions use UTF-16). Which encoding it corresponds to usually depends on your Windows system language. The most common is CP 1252 (also known as Windows-1252). So, when your editor says ANSI, it is meaning "whatever the API functions use as the default ANSI encoding", which is the default non-Unicode encoding used in your system (and thus usually the one which is used for text files).

The page he links to gives this historical tidbit (quoted from a Microsoft PDF) on the origins of CP 1252 and ISO-8859-1, another oft-used encoding:

[...] this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see "ANSI" or "A" used to signify Windows code page support.

Community

with powershell you can do something like this:

%  get-content IN.txt | out-file -encoding ENC -filepath OUT.txt

while ENC is something like unicode, ascii, utf8, utf32. checkout 'help out-file'.

to convert all the *.txt files in a directory to utf8 do something like this:

% foreach($i in ls -name DIR/*.txt) { \
       get-content DIR/$i | \
       out-file -encoding utf8 -filepath DIR2/$i \
  }

which creates a converted version of each .txt file in DIR2.

EDIT: To replace the files in all subdirectories use:

% foreach($i in ls -recurse -filter "*.java") {
    $temp = get-content $i.fullname
    out-file -filepath $i.fullname -inputobject $temp -encoding utf8 -force
}

nagul

The Wikipedia page on newlines has a section on conversion utilities.

This seems your best bet for a conversion using only tools Windows ships with:

TYPE unix_file | FIND "" /V > dos_file

8088

UTFCast is a Unicode converter for Windows which supports batch mode. I'm using the paid version and am quite comfortable with it.

UTFCast is a Unicode converter that lets you batch convert all text files to UTF encodings with just a click of your mouse. You can use it to convert a directory full of text files to UTF encodings including UTF-8, UTF-16 and UTF-32 to an output directory, while maintaining the directory structure of the original files. It doesn't even matter if your text file has a different extension, UTFCast can automatically detect text files and convert them.

nik

There is dos2unix on unix.
There was another similar tool for Windows (another ref here).

How do I convert between Unix and Windows text files? has some more tricks

user1055927

You can use EncodingMaster. It's free, it has a Windows, Linux and Mac OS X version and works really good.

Home

linux - Converting Filename Encoding