grep - How to remove lines from large text file using bash

16
2014-04
  • forestclown

    I got a huge text file (log file) in my CentOS which I would like to remove top part of, probably couple of thousand lines each day. (Or probably just split into two)

    I have search this site and found that most using grep, sed to remove the lines but output to another file. Not sure if it is possible that using shell script (bash) that I can update the file in place? instead of:

    sed current file > new file
    cp new file > current file
    

    Thanks!

  • Answers
  • Eroen

    sed --in-place $filter $file

  • jfgagne

    There is no simple way to remove lines from the beginning of file !

    Even by using sed -i, you create a new file as shown with the following commands (> is my prompt):

    > echo "Helo World" > toto
    > ls -i toto
    147543 toto
    > sed -i -e 's/Helo/Hello/' toto
    > ls -i toto
    147292 toto
    

    Notice that the inode number is not the same. This means that you create a new file with the same name, not that you modify the file in place.

    This is important if your log file is open by a program while you perform this operation. If it is, you will create a new file while the program holding the file will keep writing to the old file. To show this, let's try the following:

    for f in $(seq 1 100); do date; echo $f; sleep 1; done > file1&
    ln file1 file2
    sleep 5
    sed -i -e '1,10d' file1
    ls -l file1 file2
    sleep 5
    ls -l file1 file2
    

    The 2nd ls will show the same size for file1 and a growing size for file2. If I had not done a ln before executing sed, the original file would have keep growing without being accessible via the file system hierarchy. This would result in use space on disk as shown by df but not shown by du. More information can be found here and here.

    Log rotation is your friend here, but it cannot be done without help from the logging program. There should be a way to tell the program to close and reopen the file, so the new file would be used, but the log written after the beginning of the sed and the end of reopening the file could be lost. If you do not want to loose logs, you can copy the file first, ask for the program to reopen the file, and then modify the copied file. This is what logrotate allows you to do with minimal scripting.

    You can read more on this subject here (apache 1.3), here (apache 2.4) and here (bind 9).

  • technosaurus

    I got a huge text file (log file) in my CentOS which I would like to remove top part of

    you can use tail to generate an new file containing only the last N lines

    tail -n logfile >newlogfile
    zcat logfile > $(date +%Y%m%d)logfile.gz && mv -f newlogfile logfile 2>/dev/null
    

    , probably couple of thousand lines each day. (Or probably just split into two)

    You can get the number of lines in the file with:

    NUMLINES=$(awk 'END{print NR}' logfile)
    #do some integer math and split with head and tail
    

    I have search this site and found that most using grep, sed to remove the lines but output to another file. Not sure if it is possible that using shell script (bash) that I can update the file in place? instead of:

    yes, you can use sed to delete the first n lines

    #remove the first 10 lines
    sed -i '1,10d' logfile
    
  • Kaz

    Set up a cron job to rotate the log? Hmm?

    http://linuxcommand.org/man_pages/logrotate8.html


  • Related Question

    sed - grepping a substring from a grep result
  • user17245

    Given a log file, I will usually do something like this:

    grep 'marker-1234' filter_log
    

    What is the difference in using '' or "" or nothing in the pattern?

    The above grep command will yield many thousands of lines; what I desire. Within those lines, There is usually one chunk of data I am after. Sometimes, I use awk to print out the fields I am after. In this case, the log format changes, I can't rely on position exclusively, not to mention, the actual logged data can push position forward.

    To make this understandable, lets say the log line contained an IP address, and that was all I was after, so I can later pipe it to sort and unique and get some tally counts.

    An example may be:

    2010-04-08 some logged data, indetermineate chars - [marker-1234] (123.123.123.123) from: [email protected] to [email protected] [stat-xyz9876]
    

    The first grep command will give me many thousands of lines like the above, from there, I want to pipe it to something, probably sed, which can pull out a pattern within, and print only the pattern.

    For this example, using an the IP address would suffice. I tried. Is sed not able to understand [0-9]{1,3}. as a pattern? I had to [0-9][0-9][0-9]. which yielded strange results until the entire pattern created.

    This is not specific to an IP address, the pattern will change, but I can use that as a learning template.

    Thank you all.


  • Related Answers
  • Chris S

    I don't know what OS you're on, but on FreeBSD 7.0+ grep has a -o option to return only the part that matches the pattern. So you could
    grep "marker-1234" filter_log | grep -oE "[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}"

    Returns a list of just IP addresses from the 'filter_log"...

    This works on my system, but again, I don't know what your version of grep supports.

  • user31894

    you can do all these in just one awk command. No need to use any other tools

    $ awk '/marker-1234/{for(o=1;o<=NF;o++){if($o~/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/)print $o }  }' file
    (123.123.123.123)
    
  • Dennis Williamson

    You can shorten the second grep a little like this:

    grep -Eo '([0-9]{1,3}\.){3}[0-9]{1,3}'
    

    To answer your first question, double quotes allow the shell to do various things like variable expansion, but protect some metacharacters from needing to be escaped. Single quotes prevent the shell from doing those expansions. Using no quotes leaves things wide open.

    $ empty=""
    $ text1="some words"
    $ grep $empty some_file
    (It seems to hang, but it's just waiting for input since it thinks "some_file" is 
    the pattern and no filename was entered, so it thinks input is supposed to come
    from standard input. Press Ctrl-d to end it.)
    $ grep "$empty" some_file
    (The whole file is shown since a null pattern matches everything.)
    $ grep $text1 some_file
    grep: words: No such file or directory
    some_file:something
    some_file:some words
    (It sees the contents of the variable as two words, the first is seen as the 
    pattern, the second as one file and the filename as a second file.)
    $ grep "$text1" some_file
    some_file:some words
    (Expected results.)
    $ grep '$text1' some_file
    (No results. The variable isn't expanded and the file doesn't contain a
    string that consists of literally those characters (a dollar sign followed
    by "text1"))
    

    You can learn more in the "QUOTING" section of man bash

  • Josh K

    Look up the xargs command. You should be able to do something like:

    grep 'marker-1234' filter_log|xargs grep "("|cut -c1-15

    This may not be it exactly, but xargs is the command you want to use