linux - In place extract tar archive

25
2013-09
  • Charlie Somerville

    I have a little dilemma here...

    I needed to move about 70 GB worth of files from one of my servers to the other, so I decided that tarring them up and sending the archive would be the fastest way.

    However, the receiving server only has 5 GB of space left after receiving the tar archive.

    Is there some way I can extract the tar 'in-place'? I don't need to keep the archive after it has been extracted, so I was wondering if it is possible to do this.

    Edit: It should be noted that the archive has already been sent, and I'd like to avoid resending via a different method.

  • Answers
  • akira
    % tar czf - stuff_to_backup | ssh backupmachine tar xvzf -
    

    this translates to:

    • tar and compress 'stuff_to_backup' to stdout
    • login to 'backupmachine' via ssh
    • run 'tar' on the 'backupmachine' and untar the stuff coming in from stdin

    i personally would use 'rsync over ssh' to transfer the stuff because you can continue transfering stuff if the connection breaks:

    % rsync -ar --progress -e 'ssh' 'stuff_to_backup' user@backupmachine:/backup/
    

    which will transfer everything from 'stuff_to_backup' to the 'backup' folder on the 'backupmachine'. if the connection breaks, just repeat the command. if some files in 'stuff_to_backup' change, repeat the stuff, only the difference will be transfered.

  • YuppieNetworking

    If the other machine has ssh, I would recommend you rsync as another alternative that does not use a tar file:

    rsync -avPz /some/dir/ user@machine:/some/other/dir/
    

    And be careful with the leading /

    Edit update

    Well, I see how this is now a great pickle if you are not able to delete it and recommence with rsync. I would probably try a selective extract and delete from the tar.

    selective extract:

    $ tar xvf googlecl-0.9.7.tar googlecl-0.9.7/README.txt
    googlecl-0.9.7/README.txt
    

    selective delete:

    $ tar --delete --file=googlecl-0.9.7.tar googlecl-0.9.7/README.txt
    

    However, it seems that you will spend a lot of time coding a script for this...

  • Georges Dupéron

    Basically, what you need is the possibility to pipe the file into tar, and "lop" the front as you go.

    On StackOverflow, somebody asked how to truncate a file at front, but it seems it isn't possible. You could still fill the begining of the file with zeroes in a special way so the file becomes a sparse file, but I don't know how to do this. We can truncate the end of the file, though. But tar needs to read the archive forwards, not backwards.

    Solution 1

    A level of indirection solves every problem. First reverse the file in-place, then read it backwards (which will result in reading the original file forwards) and truncate the end of the reversed file as you go.

    You'll need to write a program (c, python, whatever) to exchange the begining and the end of the file, chunk by chunk, and then pipe these chunks to tar while truncating the file a chunk at a time. This is the basis for solution 2 which is maybe simpler to implement.

    Solution 2

    Another method is to split the file in small chunks in-place, then delete those chunks as we extract them. The code below has a chunk size of one megabyte, adjust depending on your needs. Bigger is faster but will take more intermediate space when splitting and during extraction.

    Split the file archive.tar :

    archive="archive.tar"
    chunkprefix="chunk_"
    # 1-Mb chunks :
    chunksize=1048576
    
    totalsize=$(wc -c "$archive" | cut -d ' ' -f 1)
    currentchunk=$(((totalsize-1)/chunksize))
    while [ $currentchunk -ge 0 ]; do
        # Print current chunk number, so we know it is still running.
        echo -n "$currentchunk "
        offset=$((currentchunk*chunksize))
        # Copy end of $archive to new file
        tail -c +$((offset+1)) "$archive" > "$chunkprefix$currentchunk"
        # Chop end of $archive
        truncate -s $offset "$archive"
        currentchunk=$((currentchunk-1))
    done
    

    Pipe those files into tar (note we need the chunkprefix variable in the second terminal) :

    mkfifo fifo
    # In one terminal :
    (while true; do cat fifo; done) | tar -xf -
    # In another terminal :
    chunkprefix="chunk_"
    currentchunk=0
    while [ -e "$chunkprefix$currentchunk" ]; do
        cat "$chunkprefix$currentchunk" && rm -f "$chunkprefix$currentchunk"
        currentchunk=$((currentchunk+1))
    done > fifo
    # When second terminal has finished :
    # flush caches to disk :
    sync
    # wait 5 minutes so we're sure tar has consumed everything from the fifo.
    sleep 300
    rm fifo
    # And kill (ctrl-C) the tar command in the other terminal.
    

    Since we use a named pipe (mkfifo fifo), you don't have to pipe all chunks at once. This can be useful if you're really tight on space. You can follow the following steps :

    • Move, say the last 10Gb chunks to another disk,
    • Start the extraction with the chunks you still have,
    • When the while [ -e … ]; do cat "$chunk…; done loop has finished (second terminal) :
    • do NOT stop the tar command, do NOT remove the fifo (first terminal), but you can run sync, just in case,
    • Move some extracted files which you know are complete (tar isn't stalled waiting for data to finish extracting these files) to another disk,
    • Move the remaining chunks back,
    • Resume extraction by running the while [ -e … ]; do cat "$chunk…; done lines again.

    Of course this is all haute voltige, you'll want to check everything is ok on a dummy archive first, because if you make a mistake then goodbye data.

    You'll never know if the first terminal (tar) has actually finished processing the contents of the fifo, so if you prefer you can run this instead, but you won't have the possibility to seamlessly exchange chunks with another disk :

    chunkprefix="chunk_"
    currentchunk=0
    while [ -e "$chunkprefix$currentchunk" ]; do
        cat "$chunkprefix$currentchunk" && rm -f "$chunkprefix$currentchunk"
        currentchunk=$((currentchunk+1))
    done | tar -xf -
    

    Disclaimer

    Note that for all this to work, your shell, tail and truncate must handle 64-bit integers correctly (you don't need a 64-bit computer nor operating system for that). Mine does, but if you run the above script on a system without these requirements, you'll loose all the data in archive.tar.

    And in any case something other than that goes wrong, you'll loose all the data in archive.tar anyway, so make sure you have a backup of your data.


  • Related Question

    unix - tar – extract discarding directory structure
  • Benji XVI

    unzip has a nifty option -j, whereby the directory structure of the archive is discarded, and all files are extracted into the same directory.

    Is there a way of making tar work in the same way? Nothing in the man page seems to indicate so.

    So, is there an alternative, preferably Free Software, tool that will do that?


  • Related Answers
  • quack quixote

    You can do it fairly easily in two steps. Adapt as necessary:

    $ mkdir /tmp/dirtree
    $ tar xfz /path/to/archive -C /tmp/dirtree
    $ find /tmp/dirtree -type f -exec mv -i {} . \;
    $ rm -rf /tmp/dirtree
    
  • mario

    GNU tar lives on featuritis, so naturally also has some options for that.
    http://www.gnu.org/software/tar/manual/html_node/transform.html

    If you just want to remove a few path segments, then --strip-components=n or --strip=n will often do:

     tar xvzf tgz --strip=1
    

    But it's also possible to regex-rewrite the files to be extracted (flags are --transform or --xform and accept ereg with the /x modifer):

     tar xvzf tgz --xform='s#^[^/]+#.#x'
    

    For listing a tar you need the additional --show-transformed option:

     tar tvzf tgz --show-transformed --strip=1 --xform='s/abc/xyz/x'
    

    I believe the rewriting options also work for packing, not just for extracting. But pax has obviously a nicer syntax.

  • hfs

    pax can do it:

    pax -v -r -s '/.*\///p' < archive.tar
    

    or

    zcat archive.tar.gz | pax -v -r -s '/.*\///p'
    

    You can check the name replacement operation first by omitting the -r option.

  • DaveParillo

    A possible solution that doesn't require installing anything.

    1. use a tar tvf to grab all the files from the tarball
    2. Extract those files individually - have tar extract to stdout & redirect to $filename

      tar -tvf $1 | grep -v "^d" | \
                    awk '{for(i=6;i<NF+1;i++) {printf "%s ",$i};print ""}' |\
                    while read filename
                    do
                       tar -O -xf $1 "$filename" > `basename "$filename"`
                    done
      

    save as extract.sh and run as extract.sh myfile.tar. It will also overwrite any duplicate filenames encountered in the directories pulled from the tarball.