linux - How do you remove all occurrences of values in one list from another list?
2013-09
I have a list of symbols such as...
wer
sfe
efo
How do I remove all instances of these (unique) symbols from another list of (non-unique) symbols?
So in the following list, the lines starting with wer
would be removed twice, and sfe
once:
wer-alskjdfi
efr-4siosejf
rte-alskjdfs
wer-alskjsef
sfe-ooskjdfi
Every other line should be untouched, with the symbol and characters after "-" remaining:
efr-4siosejf
rte-alskjdfs
I need to do this using sed/awk/grep/bash or other command line tools. I know how to write a sed command to search and remove one value at a time, but how do I do this for 100k+ values?
What if file 2 has characters after each of those symbols? I want to do the same but keep the trailing characters.
OK, make a copy of file2
that has only the field that you want to filter on.
And, if the current file2
has the “non-unique symbol” immediately followed
by the “trailing characters” (e.g., efr-42
, rte-17
, etc.),
make another copy of file2
where they are space-separated.
Here are example commands based on the example data you provided:
sed 's/\(...\).*/\1/' file2.sorted > file2.symbol_only
sed 's/\(...\)\(.*\)/\1 \2/' file2.sorted > file2.separated
or
sed 's/\([^-]*\)-.*/\1/' file2.sorted > file2.symbol_only
sed 's/\([^-]*\)\(-.*\)/\1 \2/' file2.sorted > file2.separated
… based on the new data that you added to your question.
Then use comm
as before:
comm -13 file1.sorted file2.symbol_only > file2.no_match
… and join the symbols up with the trailing characters:
join file2.no_match file2.separated
If necessary, use another sed
to remove the spaces you added.
It occurs to me that you could build on this trick to get the output file back into file2
’s original order.
- Produce a copy of the original
file2
with line numbers. - Shuffle the line numbers to the right of the symbols.
- (the above, starting with the
sort
commands) - Sort the output on the original line number.
- Strip out the line numbers.
Let me know if you need help with this.
Assuming your lists reside in files
awk -F- 'NR==FNR {exclude[$1]++; next} !($1 in exclude)' list_of_symbols filename
grep is also an option
grep -v -f <(sed 's/^/^/' list_of_symbols) filename
The sed bit adds a regexp anchor to the beginning of each line.
Do you need to retain the order of your second file?
Can you state a maximum number of times that a line can be repeated?
If the answers to both questions are “no”, I’d suggest comm
:
sort file1 file1 > file1.sorted sort file2 > file2.sorted
------------------------------- -------------------------
efo efr
efo rte
sfe sfe
sfe wer
wer wer
wer
comm -13 file1.sorted file2.sorted
efr
rte
Include enough copies of file1
in file1.sorted
to cover the maximum number of occurrences of any string in file2
.
Without knowing anything about SED etc, the basic design in my personal pseudocode is:
sort the list of strings to be removed (List A)
sort the list of strings which contains items to be removed (List B)
For each item in List A
Repeat until Item (List B) > Item (List A)
if the Item (List B) equals Item (List A)
remove item (List B)
next Item (List B)
Next Item (List A)
Note: "Removing" an item might be problematical - better to replace this line with one adding the item to a new
I want to write a script that finds a open/close tag pair in a text file and prepends a fixed string to each line between the pair. I figure I use grep to find the tag line numbers and either awk or sed to place the tags, however, I'm not sure how exactly to do it.
Can someone help?
In awk:
START {noprefix="true"}
/<close tag regex>/ {noprefix="true"}
noprefix=="false" {print "prefix", $0}
noprefix=="true" {print $0}
/<open tag regex>/ {noprefix="false"}
It should be done by one of the traditionally syntax aware languages (yacc etc). Doing it with grep and the like may be okay for specific cases but regexp simply is not powerful enough to catch the subtleties of HTML
You should consider using yacc for it. It is NOT possible to do this with sed, awk or grep without a considerable amount of effort. As for learning yacc, it wouldn't take more time than it did for learning sed/awk/grep. And it will be really easy that way.