linux - How do you remove all occurrences of values in one list from another list?

06
2013-09
  • barrrista

    I have a list of symbols such as...

    wer
    sfe
    efo
    

    How do I remove all instances of these (unique) symbols from another list of (non-unique) symbols?

    So in the following list, the lines starting with wer would be removed twice, and sfe once:

    wer-alskjdfi
    efr-4siosejf
    rte-alskjdfs
    wer-alskjsef
    sfe-ooskjdfi
    

    Every other line should be untouched, with the symbol and characters after "-" remaining:

    efr-4siosejf
    rte-alskjdfs
    

    I need to do this using sed/awk/grep/bash or other command line tools. I know how to write a sed command to search and remove one value at a time, but how do I do this for 100k+ values?

  • Answers
  • Scott

    What if file 2 has characters after each of those symbols?  I want to do the same but keep the trailing characters.

    OK, make a copy of file2 that has only the field that you want to filter on.  And, if the current file2 has the “non-unique symbol” immediately followed by the “trailing characters” (e.g., efr-42, rte-17, etc.), make another copy of file2 where they are space-separated.  Here are example commands based on the example data you provided:

    sed 's/\(...\).*/\1/'        file2.sorted > file2.symbol_only
    sed 's/\(...\)\(.*\)/\1 \2/' file2.sorted > file2.separated
    

    or

    sed 's/\([^-]*\)-.*/\1/'        file2.sorted > file2.symbol_only
    sed 's/\([^-]*\)\(-.*\)/\1 \2/' file2.sorted > file2.separated
    

    … based on the new data that you added to your question.  Then use comm as before:

    comm -13 file1.sorted file2.symbol_only > file2.no_match
    

    … and join the symbols up with the trailing characters:

    join file2.no_match file2.separated
    

    If necessary, use another sed to remove the spaces you added.


    It occurs to me that you could build on this trick to get the output file back into file2’s original order.

    1. Produce a copy of the original file2 with line numbers.
    2. Shuffle the line numbers to the right of the symbols.
    3. (the above, starting with the sort commands)
    4. Sort the output on the original line number.
    5. Strip out the line numbers.

    Let me know if you need help with this.

  • glenn jackman

    Assuming your lists reside in files

    awk -F- 'NR==FNR {exclude[$1]++; next} !($1 in exclude)' list_of_symbols filename
    

    grep is also an option

    grep -v -f <(sed 's/^/^/' list_of_symbols) filename
    

    The sed bit adds a regexp anchor to the beginning of each line.

  • Scott

    Do you need to retain the order of your second file?  Can you state a maximum number of times that a line can be repeated?  If the answers to both questions are “no”, I’d suggest comm:

    sort file1 file1 > file1.sorted     sort file2 > file2.sorted
    -------------------------------     -------------------------
    efo                                 efr
    efo                                 rte
    sfe                                 sfe
    sfe                                 wer
    wer                                 wer
    wer
    
    comm -13 file1.sorted file2.sorted
    efr
    rte
    

    Include enough copies of file1 in file1.sorted to cover the maximum number of occurrences of any string in file2.

  • Fred

    Without knowing anything about SED etc, the basic design in my personal pseudocode is:

    sort the list of strings to be removed (List A)

    sort the list of strings which contains items to be removed (List B)

    For each item in List A

    Repeat until Item (List B) > Item (List A)
        if the Item (List B) equals Item (List A) 
            remove item (List B)
        next Item (List B)
    Next Item (List A)
    

    Note: "Removing" an item might be problematical - better to replace this line with one adding the item to a new


  • Related Question

    bash - how to use grep, sed, and awk to parse tags?
  • mechko

    I want to write a script that finds a open/close tag pair in a text file and prepends a fixed string to each line between the pair. I figure I use grep to find the tag line numbers and either awk or sed to place the tags, however, I'm not sure how exactly to do it.

    Can someone help?


  • Related Answers
  • mpez0

    In awk:

    START                  {noprefix="true"}
    /<close tag regex>/    {noprefix="true"}
    noprefix=="false"      {print "prefix", $0}
    noprefix=="true"       {print $0}
    /<open tag regex>/     {noprefix="false"}
    
  • 9tat

    It should be done by one of the traditionally syntax aware languages (yacc etc). Doing it with grep and the like may be okay for specific cases but regexp simply is not powerful enough to catch the subtleties of HTML

  • user18151

    You should consider using yacc for it. It is NOT possible to do this with sed, awk or grep without a considerable amount of effort. As for learning yacc, it wouldn't take more time than it did for learning sed/awk/grep. And it will be really easy that way.