linux - pscp and UTF-8 characters

07
2014-07

lucek

I'm having trouble copying files with UTF-8 characters using pscp on Windows. I'm using command line for that and following command:

chcp 65001
pscp -scp -p -pw {pass} -batch "user@remote_host:/Справочник/file.txt" "E:\Справочник\file.txt"
scp: E:/??????????/file.txt: Cannot create file

As shown I get scp: E:/??????????/file.txt: Cannot create file error. How can I transfer files with UTF-8 characters in their path?

Answers

STTR

Application registry setting UTF-8 font (example:cmd)

Use path to pscp

Way-1, powershell:

Application path, change \ to _:

AppUTF8Font.ps1:

$app='.\%SystemRoot%_system32_cmd.exe'
SL HKCU:\Console;NI $app;SL $app

New-ItemProperty . FaceName -t STRING -va "Lucida Console"
New-ItemProperty . FontFamily -t DWORD -va 0x00000036

enter image description here

Need allow local powershell script:

powershell -command "Set-ExecutionPolicy RemoteSigned"

And run:

powershell .\AppUTF8Font.ps1

Way-2, reg-file:

Or use reg-file:

REGEDIT4

[HKEY_CURRENT_USER\Console\%SystemRoot%_system32_cmd.exe]
"FaceName"="Lucida Console"
"FontFamily"=dword:00000036

Command line:

REG IMPORT Cmd_UTF8.reg

Delete setting reg-file:

REGEDIT4

[-HKEY_CURRENT_USER\Console\%SystemRoot%_system32_cmd.exe]

Command line:

REG IMPORT Cmd_UTF8_Delete.reg

Related Question

linux - How to recode to UTF-8 conditionally?

linux unix character-encoding utf-8 conversion

Jonik

I'm unifying the encoding of a large bunch of text files, gathered over time on different computers. I'm mainly going from ISO-8859-1 to UTF-8. This nicely converts one file:

recode ISO-8859-1..UTF-8 file.txt

I of course want to do automated batch processing for all the files, and simply running the above for each file has the problem that files whose already encoded in UTF-8, will have their encoding broken. (For instance, the character 'ä' originally in ISO-8859-1 will appear like this, viewed as UTF-8, if the above recode is done twice: � -> ä -> Ã¤)

My question is, what kind of script would run recode only if needed, i.e. only for files that weren't already in the target encoding (UTF-8 in my case)?

From looking at recode man page, I couldn't figure out how to do something like this. So I guess this boils down to how to easily check the encoding of a file, or at least if it's UTF-8 or not. This answer implies you could recognise valid UTF-8 files with recode, but how? Any other tool would be fine too, as long as I could use the result in a conditional in a bash script...

Related Answers

Jonik

This script, adapted from harrymc's idea, which recodes one file conditionally (based on existence of certain UTF-8 encoded Scandinavian characters), seems to work for me tolerably well.

$ cat recode-to-utf8.sh 

#!/bin/sh
# Recodes specified file to UTF-8, except if it seems to be UTF-8 already

result=`grep -c [åäöÅÄÖ] $1` 
if [ "$result" -eq "0" ]
then
    echo "Recoding $1 from ISO-8859-1 to UTF-8"
    recode ISO-8859-1..UTF-8 $1 # overwrites file
else
    echo "$1 was already UTF-8 (probably); skipping it"
fi

(Batch processing files is of course a simple matter of e.g. for f in *txt; do recode-to-utf8.sh $f; done.)

NB: this totally depends on the script file itself being UTF-8. And as this is obviously a very limited solution suited to what kind of files I happen to have, feel free to add better answers which solve the problem in a more generic way.

harrymc

Both ISO-8859-1 and UTF-8 are identical on the first 128 characters, so your problem is really how to detect files that contain funny characters, meaning numerically encoded as above 128.

If the number of funny characters is not excessive, you could use egrep to scan and find out which files need recoding.

user46971

UTF-8 has strict rules about which byte sequences are valid. This means that if data could be UTF-8, you'll rarely get false positives if you assume that it is.

So you can do something like this (in Python):

def convert_to_utf8(data):
    try:
        data.decode('UTF-8')
        return data  # was already UTF-8
    except UnicodeError:
        return data.decode('ISO-8859-1').encode('UTF-8')

In a shell script, you can use iconv to perform the converstion, but you'll need a means of detecting UTF-8. One way is to use iconv with UTF-8 as both the source and destination encodings. If the file was valid UTF-8, the output will be the same as the input.

Pierre FABIER

This message is quite old, but I think I can contribute to this problem :
First create a script named recodeifneeded :

#!/bin/bash
# Find the current encoding of the file
encoding=$(file -i "$2" | sed "s/.*charset=\(.*\)$/\1/")

if [ ! "$1" == "${encoding}" ]
then
# Encodings differ, we have to encode
echo "recoding from ${encoding} to $1 file : $2"
recode ${encoding}..$1 $2
fi

You can use it this way :

recodeifneeded utf-8 file.txt

So, if you like to run it recursively and change all *.txt files encodings to (let's say) utf-8 :

find . -name "*.txt" -exec recodeifneeded utf-8 {} \;

I hope this helps.

Home

linux - pscp and UTF-8 characters