linux - How to search a text file for strings between two tokens in Ubuntu terminal and save the output?

08
2014-07
  • Blue

    How can I search a text file for this pattern in Ubuntu terminal and save the output as a text file?

    I'm looking for everything between the string "abc" and the string "cde" in a long list of data.

    For example:

    blah blah abc fkdljgn cde blah
    blah blah blah blah blah abc skdjfn cde blah
    

    In the example above I would be looking for an output such as this:

    fkdljgn
    skdjfn
    

    It is important that I can also save the data output as a text file.

    Can I use grep or agrep and if so, what is the format?

  • Answers
  • terdon

    To get the output you show, you could run

    grep -Po 'abc \K.*(?= cde)'  file.txt > outfile.txt
    

    The P activates Perl Compatible Regular Expressions which have support for lookarounds and \K which means "discard anything matched up to this point". The -ocauses grep to only print the matched portion of the line so, combined with the positive lookahead (?=cde) and the \K, it will print only the characters between the abc and cde. The > outfile.txt will save the result in the file outfile.txt.

    Some other approaches:

    • sed

      sed -r 's/.*abc (.+) cde.*/\1/' file.txt > outfile.txt
      

      Here, the parentheses capture the pattern and you can then refer to it as \1. The 's/source/replacement/' is the substitution operator and it replaces source with replacement. In this case, it will simply delete everything except whatever is between abc and cde.

    • perl

      perl -pe 's/.*abc (.+) cde.*/$1/' file.txt > outfile.txt
      

      Same as above really, the -p means "read the input file line by line, apply the script given as -e and print.

    • awk

       awk -F'abc|cde' '{print $2}' file.txt > outfile.txt
      

      The idea here is to set the field delimiters to either abc or cde. Assuming these strings are unique in each line, the 2nd field will be the one between the two. This, however, includes leading and trailing spaces, to remove them pass through another awk:

      awk -F'abc|cde' '{print $2}' file | awk '{print $1}'
      
    • GNU awk (gawk). The above works perfectly in gawk as well, I am including this in case you want to do something more complex and need to be able to capture patterns.

      gawk '{print gensub(/.*abc (.*) cde.*/,"\\1", "g",$0);}' file.txt > outfile.txt
      

      This is the same basic idea as the perl and sed ones but using gawk's gensub() function.

  • Professor FartSparkle

    You want to use a regular expression for that. I'm not that experienced with UNIX regex but something like this should work

    grep -Po '(?<=abc ).*(?= cde)' test.txt > output.txt

    Edit: The syntax error came from missing quotes, though the old suggestion didn't work you rather want to use (?<=xxx) this is called a zero-width look-behind assertion and without < you do a look ahead. -P to activate perl style regex and -o to only print the matches.

    Tried this and working fine with a text file containing abc mymatch cde.


  • Related Question

    linux - Under *nix, how can I find a string within a file within a directory?
  • Questioner

    I'm using ubuntu linux, and I use bash from with a terminal emulator every day for many tasks.

    I would like to know how to find a string or a substring within a file that is within a particular directory.

    If I was knew the file which contained my target substring, I would just cat the file and pipe it through grep, thus:

    cat file | grep mysubstring
    

    But in this case, the pesky substring could be anywhere within a known directory.

    How do I hunt down my substring ?


  • Related Answers
  • outis

    Use a shell wildcard:

    grep mysubstring *
    

    If you want to search subdirectories, use the -r option to recurse into them:

    grep -r myssubstring .
    
  • Marcelo Cantos
    find -type f | xargs grep mysubstring
    

    These commands (find, xargs and grep) have lots of options, so you can tune this operation substantially.

  • Zhang Yining

    say I want to find all the python code files that contain the text "wiki" under the directory "~/projects", here is the script:

    grep -lir "wiki" ~/projects/**/*.py
    

    adjust the script to your specific requirements.

  • Andy Lester

    No matter how you do it, don't cat files into grep. Your original version of

    cat file | grep mysubstring
    

    is more correctly done as

    grep mysubstring file
    
  • geek

    If you don't need to do it in batch mode, you could install midnight commander (mc), it can search for strings in files.

  • Sverre Marvik

    If you want to find all files with a certain String recursive from "current" dir, use:

    find . -type f -exec grep -l mysubstring {} \;
    

    (should work on most *nix')