count occurences of each word in novel David Copperfield

wget -q -O- http://www.gutenberg.org/dirs/etext96/cprfd10.txt | sed '1,419d' | tr "\n" " " | tr " " "\n" | perl -lpe 's/\W//g;$_=lc($_)' | grep "^[a-z]" | awk 'length > 1' | sort | uniq -c | awk '{print $2"\t"$1}'

This command might not be useful for most of us, I just wanted to share it to show power of command line. Download simple text version of novel David Copperfield from Poject Gutenberg and then generate a single column of words after which occurences of each word is counted by sort | uniq -c combination. This command removes numbers and single characters from count. I'm sure you can write a shorter version.

Sample Output

aback	1
abandon	6
abandoned	13
abase	1
abased	1
abashed	7
abated	1
abatement	1
...

-4

By: alperyilmaz

2009-05-04 16:00:39

awk grep perl sed sort tr uniq wget sort perl sed awk uniq wget tr

Submit An Alternative

What Others Think

?Is a joke?

curl http://www.gutenberg.org/dirs/etext96/cprfd10.txt|awk -v RS='[^a-zA-Z0-9]' /./'{a[$1]++}END{for (i in a) print a[i], i|"sort -n"}'

point_to_null · 781 weeks and 2 days ago

I like the posted command better than the one by point_to_null. While point_to_null's is simpler and shorter, it does not strip out numbers and single characters. The download stats are nice, but not really an improvement.

jestin · 781 weeks and 2 days ago

@point_to_null: wow! i wouldn't imagine this can be done with as short command as yours. you must be a commandline guru..

alperyilmaz · 781 weeks and 2 days ago

Nice one! point_to_null gets points for doing it with fewer pipes. alperyilmaz gets points for using more tools though (in order: wget, sed, tr, tr, perl, grep, awk, sort, uniq, awk) showing people what piping really means.

bwoodacre · 781 weeks and 2 days ago

what the...

linuxrawkstar · 781 weeks and 2 days ago

...exactly. This doesn't make much sense. Any explanation why someone would want to count the words in this novel ?

Alanceil · 781 weeks and 1 day ago

@Alanceil something like this could be modified to find occurrences of words or letters following each other to create a Markov chain modeled after a given text. That would be useful for a text generator if you needed one...

leon · 781 weeks ago

tag clouds, semantic analysis, you know, stuff like that.

mondotofu · 743 weeks and 3 days ago

rahimhh21 · 85 weeks and 2 days ago

Perfecthomepugs · 77 weeks and 1 day ago

pugpuppies95 · 39 weeks and 1 day ago

What do you think?

Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?

You must be signed in to comment.

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands

Similar Commands

Grep through the text of djvu files and format results

Count words in a TeX/LaTeX document.

Prints total line count contribution per user for an SVN repository

Replace all occurences of a pattern with another one from previous command

Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for:

» all commands
» commands with 3 up-votes (commandlinefu3)
» commands with 10 up-votes (commandlinefu10)