linux - Listing all words in a text file and finding the most frequent word
2014-04
I have a file which contains lines.
How do I write a script that will find and print out every word in the file, one word per line.
Then find and print out the most occurring word (case sensitive) and the number of occurrences of that word in the file.
#!/bin/bash
cat /dev/stdin
printf "\n"
sort | uniq -c | sort -nr
This simple script will act as a word frequency counter just by using sort
and uniq
and piping them together. First it prints from the stdin
using cat
to show the input. Then it prints a newline. Lastly it sorts stdin
, counts the number of unique words with uniq -c
, then sorts the list again but with the n and r options to order the list numerically and reverse the list so that the most frequent words appear first. Since it reads from the standard input stream call it like this: script < inputfile
.
That should give you a start to work with :
#!/usr/bin/perl use strict; use warnings; #Read the file open my $in, '; close $in; #Split the lines of the file into an array of words my @words; foreach my $line (@lines) { push @words, (split(/[^\W]/, $line)); } #Count the occurrences of each word (to evolve into a MapReduce fashion if the file is tremendously big) my %word_count; foreach my $word (@words) { $word_count{$word}++; } #Find the word with the most occurrences my $most_frequent_word=''; my $max=0; foreach $word (keys %word_count) { if ($word_count{$word} -gt $max) { $max = $word_count{$word}; $most_frequent_word = $word; } } #Print results print "Most frequent word : $most_frequent_word\n"; print "Occurrences : $max\n";
#!/bin/bash
file=$1
declare -A count
for word in $(< "$file"); do
echo $word
(( count[$word]++ ))
done
max=0
for word in "${!count[@]}"; do
if (( ${count[$word]} > $max )); then
max=${count[$word]}
max_word=$word
fi
done
echo "most seen word: '$max_word', seen $max times"
Notes:
$(<file)
is a bash shorthand for$(cat file)
-- it returns the contents of the file- because
$(<file)
is not itself double-quoted, the shell will split it into words, and thefor
loop will iterate over the words. - you need bash version 4 for associative arrays
<opinion>
I don't know why people complain about perl syntax being ugly: do you see how you have to handle arrays in bash?</opinion>
A shell oneliner:
cat file.txt | sed -r 's/[[:space:]]+/\n/g' | sed '/^$/d' | sort | uniq -c | sort -n | tail -n1
I need to run ./pythonScript keyword one time for each keyword in a text file, how can I do this from a gnome terminal? (without having to modify the pythonScript)
pseudo code:
for each keyword in file:
./pythonScript keyword
waitfor(pythonScript to finish)
Assuming there is one keyword per line, here's a pure shell, portable solution:
while read -r line; do
./pythonScript "$line"
done <file
Here's a slightly simpler Linux solution:
<file xargs -d '\n' -n 1 ./pythonScript
Both solutions allow any character other than newline to appear in a keyword.
Is there a keyword per line from the file? If so,
while read keyword
do
./pythonScript $keyword
done < file
If you have GNU Parallel http:// www.gnu.org/software/parallel/ installed you can do this:
cat file | parallel ./pythonScript
This will run the jobs in parallel can be very useful if you run on a multicore machine.
Watch the intro video for GNU Parallel to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ