linux - Listing all words in a text file and finding the most frequent word

06
2014-04
  • SIlent pain

    I have a file which contains lines.

    How do I write a script that will find and print out every word in the file, one word per line.

    Then find and print out the most occurring word (case sensitive) and the number of occurrences of that word in the file.

  • Answers
  • Roberto Gomez
    #!/bin/bash
    cat /dev/stdin
    printf "\n"
    sort | uniq -c | sort -nr
    

    This simple script will act as a word frequency counter just by using sort and uniq and piping them together. First it prints from the stdin using cat to show the input. Then it prints a newline. Lastly it sorts stdin, counts the number of unique words with uniq -c, then sorts the list again but with the n and r options to order the list numerically and reverse the list so that the most frequent words appear first. Since it reads from the standard input stream call it like this: script < inputfile.

  • Kwaio

    That should give you a start to work with :

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    #Read the file
    open my $in, ';
    close $in;
    
    #Split the lines of the file into an array of words
    my @words;
    foreach my $line (@lines)
    {
        push @words, (split(/[^\W]/, $line));
    }
    
    #Count the occurrences of each word (to evolve into a MapReduce fashion if the file is tremendously big)
    my %word_count;
    foreach my $word (@words)
    {
        $word_count{$word}++;
    }
    
    #Find the word with the most occurrences
    my $most_frequent_word='';
    my $max=0;
    foreach $word (keys %word_count)
    {
        if ($word_count{$word} -gt $max)
        {
            $max = $word_count{$word};
            $most_frequent_word = $word;
        }
    }
    #Print results
    print "Most frequent word : $most_frequent_word\n";
    print "Occurrences : $max\n";
    
  • glenn jackman
    #!/bin/bash
    file=$1
    declare -A count
    for word in $(< "$file"); do
        echo $word
        (( count[$word]++ ))
    done
    
    max=0
    for word in "${!count[@]}"; do
        if (( ${count[$word]} > $max )); then
            max=${count[$word]}
            max_word=$word
        fi
    done
    echo "most seen word: '$max_word', seen $max times"
    

    Notes:

    • $(<file) is a bash shorthand for $(cat file) -- it returns the contents of the file
    • because $(<file) is not itself double-quoted, the shell will split it into words, and the for loop will iterate over the words.
    • you need bash version 4 for associative arrays
    • <opinion> I don't know why people complain about perl syntax being ugly: do you see how you have to handle arrays in bash? </opinion>
  • eldering

    A shell oneliner:

    cat file.txt | sed -r 's/[[:space:]]+/\n/g' | sed '/^$/d' | sort | uniq -c | sort -n | tail -n1
    

  • Related Question

    linux - How to loop a script execution with param from a text file?
  • ldabl

    I need to run ./pythonScript keyword one time for each keyword in a text file, how can I do this from a gnome terminal? (without having to modify the pythonScript)

    pseudo code:

    for each keyword in file:
      ./pythonScript keyword
      waitfor(pythonScript to finish)
    

  • Related Answers
  • Gilles

    Assuming there is one keyword per line, here's a pure shell, portable solution:

    while read -r line; do
        ./pythonScript "$line"
    done <file
    

    Here's a slightly simpler Linux solution:

    <file xargs -d '\n' -n 1 ./pythonScript
    

    Both solutions allow any character other than newline to appear in a keyword.

  • medina

    Is there a keyword per line from the file? If so,

    while read keyword
    do
        ./pythonScript $keyword
    done < file
    
  • Ole Tange

    If you have GNU Parallel http:// www.gnu.org/software/parallel/ installed you can do this:

    cat file | parallel ./pythonScript
    

    This will run the jobs in parallel can be very useful if you run on a multicore machine.

    Watch the intro video for GNU Parallel to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ