count occurences of each word in novel David Copperfield

wget -q -O- http://www.gutenberg.org/dirs/etext96/cprfd10.txt | sed '1,419d' | tr "\n" " " | tr " " "\n" | perl -lpe 's/\W//g;$_=lc($_)' | grep "^[a-z]" | awk 'length > 1' | sort | uniq -c | awk '{print $2"\t"$1}'
This command might not be useful for most of us, I just wanted to share it to show power of command line. Download simple text version of novel David Copperfield from Poject Gutenberg and then generate a single column of words after which occurences of each word is counted by sort | uniq -c combination. This command removes numbers and single characters from count. I'm sure you can write a shorter version.
Sample Output
aback	1
abandon	6
abandoned	13
abase	1
abased	1
abashed	7
abated	1
abatement	1
...

-4
2009-05-04 16:00:39

These Might Interest You

  • Count the occurences of the word 'Berlekamp' in the DJVU files that are in the current directory, printing file names from the one having the least to the most occurences.


    0
    find ./ -iname "*.djvu" -execdir perl -e '@s=`djvutxt \"$ARGV[0]\"\|grep -c Berlekamp`; chomp @s; print $s[0]; print " $ARGV[0]\n"' '{}' \;|sort -n
    unixmonkey4437 · 2010-04-07 11:15:26 0
  • Bases word count on the genreated PDF file; so make sure to update this first. The PDF file also includes references and output of any macros. Show Sample Output


    0
    pdftotext file.pdf - | wc -w
    computermacgyver · 2013-06-01 16:29:04 0
  • Faster then other method using wget For obtain all commands use nu=`curl http://www.commandlinefu.com/commands/browse |grep -o "Terminal - All commands -.*results$" | grep -oE "[[:digit:],]{4,}" | sed 's/,//'`; curl http://www.commandlinefu.com/commands/browse/sort-by-votes/plaintext/[0-"$nu":25] | grep -vE "_curl_|\.com by David" > clf-ALL.txt For more version specific nu=`curl http://www.commandlinefu.com/commands/browse |grep -o "Terminal - All commands -.*results$" | grep -oE "[[:digit:],]{4,}" | sed 's/,//'`; curl http://www.commandlinefu.com/commands/browse/sort-by-votes/plaintext/[0-"$nu":25] | grep -vE "_curl_|\.com by David" > clf-ALL_"$nu".txt Also download dirctly from my dropbox My drop box invitaion link is http://db.tt/sRdJWvQq . Use it and get free 2.5 GB space. Show Sample Output


    2
    curl http://www.commandlinefu.com/commands/browse/sort-by-votes/plaintext/[0-9000:25] | grep -vE "_curl_|\.com by David" > clf-ALL.txt
    totti · 2011-11-08 12:19:48 0
  • Requires wdiff. Prints the word-by-word diff with the old version highlighted in red, and the new in green. Change the colors by altering 41m and 42m. 45m is more of a magenta and may be easier to read.


    0
    wdiff -n -w $'\033[30;41m' -x $'\033[0m' -y $'\033[30;42m' -z $'\033[0m' oldversion.txt newversion.txt
    abracadabra · 2011-11-10 18:35:41 0

What Others Think

?Is a joke? curl http://www.gutenberg.org/dirs/etext96/cprfd10.txt|awk -v RS='[^a-zA-Z0-9]' /./'{a[$1]++}END{for (i in a) print a[i], i|"sort -n"}'
point_to_null · 476 weeks and 2 days ago
I like the posted command better than the one by point_to_null. While point_to_null's is simpler and shorter, it does not strip out numbers and single characters. The download stats are nice, but not really an improvement.
jestin · 476 weeks and 2 days ago
@point_to_null: wow! i wouldn't imagine this can be done with as short command as yours. you must be a commandline guru..
alperyilmaz · 476 weeks and 2 days ago
Nice one! point_to_null gets points for doing it with fewer pipes. alperyilmaz gets points for using more tools though (in order: wget, sed, tr, tr, perl, grep, awk, sort, uniq, awk) showing people what piping really means.
bwoodacre · 476 weeks and 2 days ago
what the...
linuxrawkstar · 476 weeks and 2 days ago
...exactly. This doesn't make much sense. Any explanation why someone would want to count the words in this novel ?
Alanceil · 476 weeks and 1 day ago
@Alanceil something like this could be modified to find occurrences of words or letters following each other to create a Markov chain modeled after a given text. That would be useful for a text generator if you needed one...
leon · 476 weeks ago
tag clouds, semantic analysis, you know, stuff like that.
mondotofu · 438 weeks and 3 days ago

What do you think?

Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?

You must be signed in to comment.

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands



Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: