linux - pscp and UTF-8 characters

07
2014-07
  • lucek

    I'm having trouble copying files with UTF-8 characters using pscp on Windows. I'm using command line for that and following command:

    chcp 65001
    pscp -scp -p -pw {pass} -batch "user@remote_host:/Справочник/file.txt" "E:\Справочник\file.txt"
    scp: E:/??????????/file.txt: Cannot create file
    

    As shown I get scp: E:/??????????/file.txt: Cannot create file error. How can I transfer files with UTF-8 characters in their path?

  • Answers
  • STTR

    Application registry setting UTF-8 font (example:cmd)

    Use path to pscp

    Way-1, powershell:

    Application path, change \ to _:

    AppUTF8Font.ps1:

    $app='.\%SystemRoot%_system32_cmd.exe'
    SL HKCU:\Console;NI $app;SL $app
    
    New-ItemProperty . FaceName -t STRING -va "Lucida Console"
    New-ItemProperty . FontFamily -t DWORD -va 0x00000036
    

    enter image description here

    Need allow local powershell script:

    powershell -command "Set-ExecutionPolicy RemoteSigned"
    

    And run:

    powershell .\AppUTF8Font.ps1
    

    Way-2, reg-file:

    Or use reg-file:

    REGEDIT4
    
    [HKEY_CURRENT_USER\Console\%SystemRoot%_system32_cmd.exe]
    "FaceName"="Lucida Console"
    "FontFamily"=dword:00000036
    

    Command line:

    REG IMPORT Cmd_UTF8.reg
    

    Delete setting reg-file:

    REGEDIT4
    
    [-HKEY_CURRENT_USER\Console\%SystemRoot%_system32_cmd.exe]
    

    Command line:

    REG IMPORT Cmd_UTF8_Delete.reg
    

  • Related Question

    linux - How to recode to UTF-8 conditionally?
  • Jonik

    I'm unifying the encoding of a large bunch of text files, gathered over time on different computers. I'm mainly going from ISO-8859-1 to UTF-8. This nicely converts one file:

    recode ISO-8859-1..UTF-8 file.txt
    

    I of course want to do automated batch processing for all the files, and simply running the above for each file has the problem that files whose already encoded in UTF-8, will have their encoding broken. (For instance, the character 'ä' originally in ISO-8859-1 will appear like this, viewed as UTF-8, if the above recode is done twice: � -> ä -> ä)

    My question is, what kind of script would run recode only if needed, i.e. only for files that weren't already in the target encoding (UTF-8 in my case)?

    From looking at recode man page, I couldn't figure out how to do something like this. So I guess this boils down to how to easily check the encoding of a file, or at least if it's UTF-8 or not. This answer implies you could recognise valid UTF-8 files with recode, but how? Any other tool would be fine too, as long as I could use the result in a conditional in a bash script...


  • Related Answers
  • Jonik

    This script, adapted from harrymc's idea, which recodes one file conditionally (based on existence of certain UTF-8 encoded Scandinavian characters), seems to work for me tolerably well.

    $ cat recode-to-utf8.sh 
    
    #!/bin/sh
    # Recodes specified file to UTF-8, except if it seems to be UTF-8 already
    
    result=`grep -c [åäöÅÄÖ] $1` 
    if [ "$result" -eq "0" ]
    then
        echo "Recoding $1 from ISO-8859-1 to UTF-8"
        recode ISO-8859-1..UTF-8 $1 # overwrites file
    else
        echo "$1 was already UTF-8 (probably); skipping it"
    fi
    

    (Batch processing files is of course a simple matter of e.g. for f in *txt; do recode-to-utf8.sh $f; done.)

    NB: this totally depends on the script file itself being UTF-8. And as this is obviously a very limited solution suited to what kind of files I happen to have, feel free to add better answers which solve the problem in a more generic way.

  • harrymc

    Both ISO-8859-1 and UTF-8 are identical on the first 128 characters, so your problem is really how to detect files that contain funny characters, meaning numerically encoded as above 128.

    If the number of funny characters is not excessive, you could use egrep to scan and find out which files need recoding.

  • user46971

    UTF-8 has strict rules about which byte sequences are valid. This means that if data could be UTF-8, you'll rarely get false positives if you assume that it is.

    So you can do something like this (in Python):

    def convert_to_utf8(data):
        try:
            data.decode('UTF-8')
            return data  # was already UTF-8
        except UnicodeError:
            return data.decode('ISO-8859-1').encode('UTF-8')
    

    In a shell script, you can use iconv to perform the converstion, but you'll need a means of detecting UTF-8. One way is to use iconv with UTF-8 as both the source and destination encodings. If the file was valid UTF-8, the output will be the same as the input.

  • Pierre FABIER

    This message is quite old, but I think I can contribute to this problem :
    First create a script named recodeifneeded :

    #!/bin/bash
    # Find the current encoding of the file
    encoding=$(file -i "$2" | sed "s/.*charset=\(.*\)$/\1/")
    
    if [ ! "$1" == "${encoding}" ]
    then
    # Encodings differ, we have to encode
    echo "recoding from ${encoding} to $1 file : $2"
    recode ${encoding}..$1 $2
    fi
    

    You can use it this way :

    recodeifneeded utf-8 file.txt
    

    So, if you like to run it recursively and change all *.txt files encodings to (let's say) utf-8 :

    find . -name "*.txt" -exec recodeifneeded utf-8 {} \;
    

    I hope this helps.