windows 7 - Arabic/Urdu text scrambled

07
2014-07
  • HappyDev

    I had created a file in notepad++, converted its encoding to Characterset->Arabic->ISO-8859-6

    and copied and pasted some text in Arabic then closed the file.

    but when I reopened it, all the text had converted into some wierd characters, something like:

    Ê?æ??? åèÇÕäÇÊ? æØÇå¬

    I also opened the file with microsoft word, and choose the encoding Arabic (Windows) , but it also didnt work.

    I really need this data back. I would be really grateful if anyone could tell me how to get proper text back.

  • Answers
  • Jukka K. Korpela

    The file hasn’t been scrambled. It’s just in ISO-8859-6 encoding, and Notepad++ cannot read it, even though it wrote it. Notepad++ can work with a few encodings only; the large menu for setting encoding is for output only.

    Microsoft Word can read the file, but you need to specify the encoding as Arabic (ISO) when opening it. This means ISO-8859-6, which is different from the Windows Arabic encoding, windows-1256.

    Alternatively, you could edit the file in Notepad++ and add the following lines at the start:

    <!doctype html>
    <title>Test</title>
    <meta charset=iso-8859-6>
    

    Then save it with the .html extension and open it in a web browser. You should now see Arabic text, which you can copy and paste.

    As a yet another option, download and install the BabelPad editor. Its Open command lets you select the encoding of the file being opened, with ISO-8859-6 as one of the alternatives.

    Note: There might be three odd-looking characters at the start of the file, namely the Byte Order Mark (BOM) as UTF-8 encoded, resulting from way the file was written. This reflects the shortcomings of Notepad++.

    In general, it is best to work with UTF-8 throughout if possible. This wastes some bytes but saves trouble.


  • Related Question

    windows - Parsing text files
  • Joe Philllips

    I encountered a situation tonight where I wanted to parse a text file. I had a very, very long word list that contained English words delimited by lines. I wanted to get rid of every word (or line) that was longer than 7 characters. This would be simple in Linux but I can't seem to find a simple solution in Windows XP. I tried using Notepad++ regular expression search, but that was a huge failure. I tried using the expression .{6,} without finding any matches. I'm really at a loss because I thought this sort of thing would be extremely easy and there would be tons of tools to accomplish a task like this. It seems like Notepad++ supports every other feature in the world except the very basic ones that seem the most obvious.

    Another one of my goals was to put some code before and after the word on each line.

    aardvark
    apple
    azolio
    

    would turn into

    INSERT INTO Words (word) VALUES ('aardvark');
    INSERT INTO Words (word) VALUES ('apple');
    INSERT INTO Words (word) VALUES ('azolio');
    

    What suggestions/tools/tips do you have to accomplish tasks similar to this in Windows XP?


  • Related Answers
  • bobbymcr

    To add the SQL text, you could try this command prompt one liner:

    (for /f %i in (words.txt) do @echo INSERT INTO Words ^(word^) VALUES ^('%i'^)) > words.sql

    To filter out lines in a text file longer than 7 characters, you could use another command line tool, findstr:

    findstr /v /r ^.........*$ words.txt > shorter-words.txt

    The /r option specifies that you want to use regex matching, and the /v option tells it to print lines that do not match. (Since it appears that findstr doesn't allow you to specify a character count range, I faked it with the "8 or more" pattern and the "do not match" option.)

  • John T

    Perl for sure, simply paste this script and run it in the same directory as the wordlist. Change your wordlist name to words.txt or alter the name in the script. You can redirect the output to a new file like so:

    words.pl > list.txt
    

    without further avail (whipped it together quick, can be chopped down a fair bit):

    open FILE, "words.txt" or die $!;
    
    my @words = <FILE>;
    
    foreach $word(@words)
    {
        print $word if(length($word) <= 8);
    }
    
  • nik

    You can get the GNUWin32 sed for Windows XP.
    Similarly AWK and Perl too.
    That is if you are used to Unix scripting (if so also consider Cygwin).

    Otherwise there is also PowerShell.

  • Peter Mortensen

    gVim is a worthy editing tool that has its origins in the venerable vi used on Unix systems. You will want to use the substitute command to do global search/replacements for each word.

    AWK and Perl are very powerful tools, but overkill for what you need. You'll enjoy gVim since it is an editor first and foremost. The thing that rocks with gVim is that you are only one keystroke away from giving it a search/substitute/replace command which can be specified with the robust regular expression format.
    Good luck.

  • Eli Bendersky

    Maybe this is better suited for StackOverflow, because the best advice I can give you is to learn one of the scripting languages to make such tasks easier. It's much better to know one powerful tool than dozens of little ones, IMHO, and it's an investment that pays off.

    Downloading Python and going through the tutorial will take a few hours, but afterwards such tasks will seem very easy to you. Better yet, you will learn to recognize tasks "looking for some programming" in other fields as well, and it will increase your productivity tenfold.

  • Yar

    Massively underestimated as a development tool is Microsoft Excel (or OpenOffice Spreadsheets). There is a max number of lines, but you might be able to take advantage of one of these tools.

    Then you can just use the left, mid, if, etc. functions in the Spreadsheet in formulas that go to the right of your lines. They will automatically get copied with relative references.

    Many times it's a lot easier than coding, unless you're a coder :) From there you can import, export, and do a lot of cool things even with text.

  • Umber Ferrule

    I would use TextPad for this.

    I've used it extensively for regular expressions in the past.

    I'd try finding something like:

      ^[[:alpha:]]{7,}\n
    

    And replacing with nothing.

  • Joel Coehoorn

    Your expression is wrong. You want this:

    ^.{0,6}$

  • iDevlop

    I second using Excel for this.

    Put all your words in column A.

    Put this formula in column B:

    =IF(LEN(A1)>7,"",CONCATENATE("INSERT INTO Words (word) VALUES ('",A1,"')"))

    Copy the formula to all rows.

    Each row in column B will have your sql insert command when the length of the word is less than 7. Otherwise it will be blank.

    If you want to remove the blank lines, copy and paste as values column B to another column, then just sort the column. The blank lines will be pushed to the bottom.

  • Peter Mortensen

    This can be done with a Perl one-liner (getting rid of every word longer than 7 characters):

    perl -nle "print if length($_) <= 7" "D:\temp2\input.txt" > ShortWords.txt
    

    Put this in a BAT file or execute directly from a command line window (Run/cmd).

    Perl is required to be installed. I use ActivePerl - it is very easy to install as it has a normal Windows installer. Direct download URL.

    For the second part of your question (generating the SQL commands): it is just an extension of the first Perl one-liner:

    perl -nle "print 'INSERT INTO Words (word) VALUES (\'' . $_ . '\');' if length($_) <= 7" "D:\temp2\input.txt" > SQLcommands.txt
    

    If it gets more complicated then it probably better with a normal Perl script, as suggested by John T.

  • ccpizza

    Believe it or not but Microsoft Word in fact has regular expressions too. CTR+H > More > Wild card. The search expression will probably be something like [.]{8+} - press F1 while the Search/Replace dialog is shown to see a description of Word's regexps.