32 bit - Why is sort.exe truncating large output on 32-bit Windows?

07
2014-07
  • Nick P

    We have a data file from a client which is 1,443,777,659 bytes in size.

    Sorted output has lines missing and is only 1,269,801,985 bytes in size.

    Sample command: sort -k 1,10 -T . -s -i file_to_sort.txt -o out.txt

    We've tried on 32-bit Win 7 and XP systems.

    We've tried the supplied sort.exe that comes with windows, as well as binaries from UnxUtils and Gnu coreutils.

    None give an error, however all result in the exact same output size. I've tried another freeware utility that works but is much slower.

    I believe this may be due to a 32-bit limitation, however the file size doesn't seem near any of the usual suspects, and these programs work by writing and merging together smaller files, none of which approach 2 GB in size.

    Any tips on how to get to the bottom of this? Thanks.

  • Answers
  • Nick P

    OK so the issue was not related to the size of the file at all. It seems to be that the file is opened in text mode, and contains a 0x1A (^Z or EOF on Windows) character near the end.

    Once it hits this character during input, it stops reading. There's no way around this as there is no flag to open the file as binary.

    I should have found this quicker, but it's not so easy to dig around a 1.5GB file :)

    Related query: http://stackoverflow.com/questions/13582804/why-can-windows-not-read-beyond-the-0x1a-eof-character-but-unix-can

  • Ярослав Рахматуллин

    You don't want to ignore non-printable characters if the file contains them. Drop the -i option and run with LC_ALL=C.

    e.g.

    export LC_ALL=C
    sort -k 1,10 -s <file_to_sort.txt >out.txt
    

  • Related Question

    linux - Gnu Tools for Windows
  • Madhur Ahuja

    I was looking for Gnu Tools for Windows and came across two links:

    http://unxutils.sourceforge.net/

    and

    http://gnuwin32.sourceforge.net/

    Does anyone know what is the difference between them and which one has more comprehensive or better tools ?


  • Related Answers
  • tapped-out

    Some solutions:

    Finally, if you want comprehensive, you're after Cygwin, which is the "standard" method of getting GNU tools on Windows, but is ... rather bulky. And the moment you have some third-party software installed, where the Windows binary was built using Cygwin and which bundles the DLL, you enter DLL hell.

  • bryan

    In a comment Madhur Ahuja asks if cygwin is portable - the answer is no and yes. The standard install of cygwin will NOT support portability it relies on a large set of files.

    BUT, if you only need a few of the tools that cygwin provides, for example sed, gawk and grep, you can put those and the files they depend on, on a USB and it will work.

    The below list of files, all located in the /bin dir of cygwin, will allow you to run - find, gawk, grep, ls and sed from a USB drive.

    cyggcc_s-1.dll
    cygicons-0.dll
    cygiconv-2.dll
    cygintl-8.dll
    cygpcre-0.dll
    cygreadline7.dll
    cygsigsegv-2.dll
    cygwin1.dll
    find.exe
    gawk.exe
    grep.exe
    ls.exe
    sed.exe
    
  • lang2

    Recently came to know gow[1]. Installed it and it works quite well. cygwin is too heavy.

    1. http://wiki.github.com/bmatzelle/gow/
  • jet

    MobaXterm is a rich ftp/ssh/VNC/RDP/Telnet/rsh client and X-server as well, but also has a lot of GNU tools built in to use interactively