computes the most frequent used words of a text file

cat WAR_AND_PEACE_By_LeoTolstoi.txt | tr -cs "[:alnum:]" "\n"| tr "[:lower:]" "[:upper:]" | awk '{h[$1]++}END{for (i in h){print h[i]" "i}}'|sort -nr | cat -n | head -n 30
using cat WAR_AND_PEACE_By_LeoTolstoi.txt | tr -cs "[:alnum:]" "\n"| tr "[:lower:]" "[:upper:]" | sort -S16M | uniq -c |sort -nr | cat -n | head -n 30 ("sort -S1G" - Linux/GNU sort only) will also do the job but as some drawbacks (caused by space/time complexity of sorting) for bigger files...
Sample Output
# get some input http://www.gutenberg.org
$ cat WAR_AND_PEACE_By_LeoTolstoi.txt | tr -cs "[:alnum:]" "\n"| tr "[:lower:]" "[:upper:]" | awk '{h[$1]++}END{for (i in h){print h[i]" "i}}'|sort -nr | cat -n | head -n 30 
     1  34720 THE
     2  22300 AND
     3  16753 TO
     4  15007 OF
     5  10608 A
     6  10004 HE
     7  9036 IN
     8  8204 THAT
     9  7984 HIS
    10  7359 WAS
    11  5710 WITH
    12  5617 IT
    13  5365 HAD
    14  4725 HER
    15  4697 NOT
    16  4637 HIM
    17  4547 AT
    18  4524 I
    19  4414 S
    20  4054 BUT
    21  4035 AS
    22  4014 ON
    23  3871 YOU
    24  3555 FOR
    25  3488 SHE
    26  3347 IS
    27  2842 SAID
    28  2813 ALL
    29  2709 FROM
    30  2458 BY

11
By: cp
2010-07-05 06:39:20

What Others Think

i think there's database of most common used words in English, so we can remove those words from this list and see the frequently used words specifically by Tolstoi
alperyilmaz · 419 weeks and 6 days ago
right; do that it is fun! C.
cp · 419 weeks and 5 days ago
Very similar to my own script: http://l0b0.wordpress.com/2010/05/31/tag-cloud-shell-script/
l0b0 · 419 weeks and 5 days ago
The awk part is needlessly complicated. You can use "uniq -c" for that.
inof · 419 weeks and 4 days ago
uniq -c can not do this! have a look in my description.
cp · 419 weeks and 4 days ago

What do you think?

Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?

You must be signed in to comment.

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands



Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: