linux - How to Remove the Last 2 Lines of a Very Large File

04
2013-09
  • Russ

    I have a very large file ~400G, and I need to remove the last 2 lines from it. I tried to use sed, but it ran for hours before I gave up. Is there a quick way of doing this, or am I stuck with sed?

  • Answers
  • Dennis Williamson

    I haven't tried this on a large file to see how fast it is, but it should be fairly quick.

    To use the script to remove lines from the end of a file:

    ./shorten.py 2 large_file.txt
    

    It seeks to the end of the file, checks to make sure the last character is a newline, then reads each character one at a time going backwards until it's found three newlines and truncates the file just after that point. The change is made in place.

    Edit: I've added a Python 2.4 version at the bottom.

    Here is a version for Python 2.5/2.6:

    #!/usr/bin/env python2.5
    from __future__ import with_statement
    # also tested with Python 2.6
    
    import os, sys
    
    if len(sys.argv) != 3:
        print sys.argv[0] + ": Invalid number of arguments."
        print "Usage: " + sys.argv[0] + " linecount filename"
        print "to remove linecount lines from the end of the file"
        exit(2)
    
    number = int(sys.argv[1])
    file = sys.argv[2]
    count = 0
    
    with open(file,'r+b') as f:
        f.seek(0, os.SEEK_END)
        end = f.tell()
        while f.tell() > 0:
            f.seek(-1, os.SEEK_CUR)
            char = f.read(1)
            if char != '\n' and f.tell() == end:
                print "No change: file does not end with a newline"
                exit(1)
            if char == '\n':
                count += 1
            if count == number + 1:
                f.truncate()
                print "Removed " + str(number) + " lines from end of file"
                exit(0)
            f.seek(-1, os.SEEK_CUR)
    
    if count < number + 1:
        print "No change: requested removal would leave empty file"
        exit(3)
    

    Here's a Python 3 version:

    #!/usr/bin/env python3.0
    
    import os, sys
    
    if len(sys.argv) != 3:
        print(sys.argv[0] + ": Invalid number of arguments.")
        print ("Usage: " + sys.argv[0] + " linecount filename")
        print ("to remove linecount lines from the end of the file")
        exit(2)
    
    number = int(sys.argv[1])
    file = sys.argv[2]
    count = 0
    
    with open(file,'r+b', buffering=0) as f:
        f.seek(0, os.SEEK_END)
        end = f.tell()
        while f.tell() > 0:
            f.seek(-1, os.SEEK_CUR)
            print(f.tell())
            char = f.read(1)
            if char != b'\n' and f.tell() == end:
                print ("No change: file does not end with a newline")
                exit(1)
            if char == b'\n':
                count += 1
            if count == number + 1:
                f.truncate()
                print ("Removed " + str(number) + " lines from end of file")
                exit(0)
            f.seek(-1, os.SEEK_CUR)
    
    if count < number + 1:
        print("No change: requested removal would leave empty file")
        exit(3)
    

    Here is a Python 2.4 version:

    #!/usr/bin/env python2.4
    
    import sys
    
    if len(sys.argv) != 3:
        print sys.argv[0] + ": Invalid number of arguments."
        print "Usage: " + sys.argv[0] + " linecount filename"
        print "to remove linecount lines from the end of the file"
        sys.exit(2)
    
    number = int(sys.argv[1])
    file = sys.argv[2]
    count = 0
    SEEK_CUR = 1
    SEEK_END = 2
    
    f = open(file,'r+b')
    f.seek(0, SEEK_END)
    end = f.tell()
    
    while f.tell() > 0:
        f.seek(-1, SEEK_CUR)
        char = f.read(1)
        if char != '\n' and f.tell() == end:
            print "No change: file does not end with a newline"
            f.close()
            sys.exit(1)
        if char == '\n':
            count += 1
        if count == number + 1:
            f.truncate()
            print "Removed " + str(number) + " lines from end of file"
            f.close()
            sys.exit(0)
        f.seek(-1, SEEK_CUR)
    
    if count < number + 1:
        print "No change: requested removal would leave empty file"
        f.close()
        sys.exit(3)
    
  • user31894

    you can try GNU head

    head -n -2 file
    
  • Zac Thompson

    The problem with sed is that it is a stream editor -- it will process the entire file even if you only want to make modifications near the end. So no matter what, you are creating a new 400GB file, line by line. Any editor that operates on the whole file will probably have this problem.

    If you know the number of lines, you could use head, but again this creates a new file instead of altering the existing one in place. You might get speed gains from the simplicity of the action, I guess.

    You might have better luck using split to break the file into smaller pieces, editing the last one, and then using cat to combine them again, but I'm not sure if it will be any better. I would use byte counts rather than lines, otherwise it will probably be no faster at all -- you're still going to be creating a new 400GB file.

  • timday

    I see my Debian Squeeze/testing systems (but not Lenny/stable) include a "truncate" command as part of the "coreutils" package.

    With it you could simply do something like

    truncate --size=-160 myfile
    

    to remove 160 bytes from the end of the file (obviously you need to figure out exactly how many characters you need to remove).

  • leeand00

    Try VIM...I'm not sure if it will do the trick or not, as I've never used it on such a big file, but I've used it on smaller larger files in the past give it try.

  • Blackbeagle

    What kind of file and in what format? May be easier to use something like Perl dependent on what kind of file it is - text, graphics, binary? How is it formatted - CSV, TSV...

  • timday

    If you know the size of the file to the byte (400000000160 say) and you know that you need to remove exactly 160 characters to strip the last two lines, then something like

    dd if=originalfile of=truncatedfile ibs=1 count=400000000000
    

    should do the trick. It's been ages since I used dd in anger though; I seem to remember things go faster if you use a bigger block size, but whether you can do that depends on whether the lines you want to drop are at a nice multiple.

    dd has some other options to pad text records out to a fixed size which might be useful as a preliminary pass.

  • timday

    If "truncate" command isn't available on your system (see my other answer), look at the "man 2 truncate" for the system call to truncate a file to a specified length.

    Obviously you need to know how many characters you need to truncate the file to (size minus the length of the problem two lines; don't forget to count any cr/lf characters).

    And make a backup of the file before you try this!

  • Justin Smith
    #!/bin/sh
    
    ed "$1" << HERE
    $
    d
    d
    w
    HERE
    

    changes are made in place. This is simpler and more efficient than the python script.

  • tponthieux

    Modified the accepted answer to solve a similar problem. Could be tweaked a little bit to remove n lines.

    import os
    
    def clean_up_last_line(file_path):
        """
        cleanup last incomplete line from a file
        helps with an unclean shutdown of a program that appends to a file
        if \n is not the last character, remove the line
        """
        with open(file_path, 'r+b') as f:
            f.seek(0, os.SEEK_END)
    
            while f.tell() > 0: ## current position is greater than zero
                f.seek(-1, os.SEEK_CUR)
    
                if f.read(1) == '\n':
                    f.truncate()
                    break
    
                f.seek(-1, os.SEEK_CUR) ## don't quite understand why this has to be called again, but it doesn't work without it
    

    And the corresponding test:

    import unittest
    
    class CommonUtilsTest(unittest.TestCase):
    
        def test_clean_up_last_line(self):
            """
            remove the last incomplete line from a huge file
            a line is incomplete if it does not end with a line feed
            """
            file_path = '/tmp/test_remove_last_line.txt'
    
            def compare_output(file_path, file_data, expected_output):
                """
                run the same test on each input output pair
                """
                with open(file_path, 'w') as f:
                    f.write(file_data)
    
                utils.clean_up_last_line(file_path)
    
                with open(file_path, 'r') as f:
                    file_data = f.read()
                    self.assertTrue(file_data == expected_output, file_data)        
    
            ## test a multiline file
            file_data = """1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b
    1362358458954466,2013-03-03 16:54:18,34.5,3.0,b
    1362358630923094,2013-03-03 16:57:10,34.5,50.0,b
    136235"""
    
            expected_output = """1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b
    1362358458954466,2013-03-03 16:54:18,34.5,3.0,b
    1362358630923094,2013-03-03 16:57:10,34.5,50.0,b
    """        
            compare_output(file_path, file_data, expected_output)
    
            ## test a file with no line break
            file_data = u"""1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b"""
            expected_output = "1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b"
            compare_output(file_path, file_data, expected_output)
    
            ## test a file a leading line break
            file_data = u"""\n1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b"""
            expected_output = "\n"
            compare_output(file_path, file_data, expected_output)
    
            ## test a file with one line break
            file_data = u"""1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b\n""" 
            expected_output = """1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b\n""" 
            compare_output(file_path, file_data, expected_output)
    
            os.remove(file_path)
    
    
    if __name__ == '__main__':
        unittest.main()
    

  • Related Question

    linux - Removing newlines from an RTF file using sed
  • Spidey

    I have an RTF file which is formatted like so:

        Lorem ipsum dolor sit amet, consectetur adipiscing elit.\par
    Nullam vitae sem porttitor urna pellentesque gravida. Nulla\par
    consequat purus vel est vehicula porttitor.\par
        Maecenas pharetra metus in enim sollicitudin sollicitudin.\par
    Etiam et odio tellus, eget placerat enim. Aliquam sem purus,\par
    gravida sed feugiat eget, consectetur quis nisl.\par
    

    (\par added for brevity)

    As you can see, newlines have been inserted to fit a page's width. The problem arises when I try to read the text on my iPhone, which has a different line length. The lines break and readability is hindered.

    The ideal solution would be one that converts the file to a single line for each paragraph, while keeping the newline and indent for new paragraphs.

    So far I've tried parsing the file with sed but was unable to create a multiline regex. Ideally, I want to replace all "\r\n"s with " ", unless the next line begins with a space.

    Is there a better solution for this? If not, how can I do it using sed?


  • Related Answers
  • Peter Boughton

    This regex will match what you want:

    \r\n(?! )
    


    So to use that with sed:

    sed 's/\r\n(?! )/ /g' filename.rtf
    


    Except, it appears that sed doesn't support negative lookahead, and requires backslashed parens, so you can instead use:

    sed 's/\r\n\([^ ]\)/ \1/g' filename.rtf
    
  • Spidey

    The solution lied in a tool I haven't given serious thought - awk

    awk 'BEGIN { FS="\\\\par" } ; /^    / {print "\\par" $1} /^[^ ]/ {print " " $1}'
    

    This will go over the file, with \par as the field seperator, and will print a \par before any line that starts with 4 spaces (which marks the beginning of a new paragraph), and remove (or simply won't print) it when it starts with anything but a space.

    Now what we have is a file with \par only where legal line breaks should be. The next step would be to remove all newlines altogether, to get rid of rogue line breaks:

    tr -d '\r\n'
    

    And then feed the result to sed to replace \par with \par\r\n, practically adding a newline where a \par is.

    sed 's/\\par/\\par\r\n/g'
    

    And done.

    The only real issue I've found with this method is that it ruined the RTF header. No problem, I just copied over the header from the original file.

    Another smaller issue was that chapter titles were being printed inline with previous paragraphs. This is because chapter titles do not start with a space yet should be considered a paragraph. In my case, chapters were marked like so:

    CHAPTER THIRTY-TWO
    Chapter's Name

    So a quick sed took care of them:

    sed 's/\s*\(CHAPTER [[:upper:]-]* \)\(.*\\par\)/\\par\r\n\\par\r\n\\par\r\n\1\\par\r\n\2\\par\r\n/'
    

    I now have my book in proper format, which makes it readable on other devices (such as my iPod).