linux - Listing all words in a text file and finding the most frequent word

linux shell-script

06
2014-04

SIlent pain

I have a file which contains lines.

How do I write a script that will find and print out every word in the file, one word per line.

Then find and print out the most occurring word (case sensitive) and the number of occurrences of that word in the file.

Answers

Roberto Gomez

#!/bin/bash
cat /dev/stdin
printf "\n"
sort | uniq -c | sort -nr

This simple script will act as a word frequency counter just by using sort and uniq and piping them together. First it prints from the stdin using cat to show the input. Then it prints a newline. Lastly it sorts stdin, counts the number of unique words with uniq -c, then sorts the list again but with the n and r options to order the list numerically and reverse the list so that the most frequent words appear first. Since it reads from the standard input stream call it like this: script < inputfile.

Kwaio

That should give you a start to work with :

#!/usr/bin/perl
use strict;
use warnings;

#Read the file
open my $in, ';
close $in;

#Split the lines of the file into an array of words
my @words;
foreach my $line (@lines)
{
    push @words, (split(/[^\W]/, $line));
}

#Count the occurrences of each word (to evolve into a MapReduce fashion if the file is tremendously big)
my %word_count;
foreach my $word (@words)
{
    $word_count{$word}++;
}

#Find the word with the most occurrences
my $most_frequent_word='';
my $max=0;
foreach $word (keys %word_count)
{
    if ($word_count{$word} -gt $max)
    {
        $max = $word_count{$word};
        $most_frequent_word = $word;
    }
}
#Print results
print "Most frequent word : $most_frequent_word\n";
print "Occurrences : $max\n";

glenn jackman

#!/bin/bash
file=$1
declare -A count
for word in $(< "$file"); do
    echo $word
    (( count[$word]++ ))
done

max=0
for word in "${!count[@]}"; do
    if (( ${count[$word]} > $max )); then
        max=${count[$word]}
        max_word=$word
    fi
done
echo "most seen word: '$max_word', seen $max times"

Notes:

$(<file) is a bash shorthand for $(cat file) -- it returns the contents of the file
because $(<file) is not itself double-quoted, the shell will split it into words, and the for loop will iterate over the words.
you need bash version 4 for associative arrays
<opinion> I don't know why people complain about perl syntax being ugly: do you see how you have to handle arrays in bash? </opinion>

eldering

A shell oneliner:

cat file.txt | sed -r 's/[[:space:]]+/\n/g' | sed '/^$/d' | sort | uniq -c | sort -n | tail -n1

Related Answers

Gilles

Assuming there is one keyword per line, here's a pure shell, portable solution:

while read -r line; do
    ./pythonScript "$line"
done <file

Here's a slightly simpler Linux solution:

<file xargs -d '\n' -n 1 ./pythonScript

Both solutions allow any character other than newline to appear in a keyword.

medina

Is there a keyword per line from the file? If so,

while read keyword
do
    ./pythonScript $keyword
done < file

Ole Tange

If you have GNU Parallel http:// www.gnu.org/software/parallel/ installed you can do this:

cat file | parallel ./pythonScript

This will run the jobs in parallel can be very useful if you run on a multicore machine.

Watch the intro video for GNU Parallel to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ

Home

linux - Listing all words in a text file and finding the most frequent word