Fast grepping (avoiding UTF overhead)

export LANG=C; grep string longBigFile.log
greps using only ascii, skipping the overhead of matching UTF chars. Some stats: $ export LANG=C; time grep -c Quit /var/log/mysqld.log 7432 real 0m0.191s user 0m0.112s sys 0m0.079s $ export LANG=en_US.UTF-8; time grep -c Quit /var/log/mysqld.log 7432 real 0m13.462s user 0m9.485s sys 0m3.977s Try strace-ing grep with and without LANG=C

2009-07-14 12:48:02

These Might Interest You

  • This invokes tar on the remote machine and pipes the resulting tarfile over the network using ssh and is saved on the local machine. This is useful for making a one-off backup of a directory tree with zero storage overhead on the source. Variations on this include using compression on the source by using 'tar cfvp' or compression at the destination via ssh user@host "cd dir; tar cfp - *" | gzip - > file.tar.gz

    ssh user@host "cd targetdir; tar cfp - *" | dd of=file.tar
    bwoodacre · 2009-03-18 07:43:22 3
  • will purge: only installed apps: /^ii/!d avoiding current kernel stuff: /'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d using app names: s/^[^ ]* [^ ]* \([^ ]*\).*/\1/ avoiding stuff without a version number: /[0-9]/!d

    dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d' | xargs sudo apt-get -y purge
    plasticdoc · 2009-06-19 10:11:00 0
  • I needed a way to search all files in a web directory that contained a certain string, and replace that string with another string. In the example, I am searching for "askapache" and replacing that string with "htaccess". I wanted this to happen as a cron job, and it was important that this happened as fast as possible while at the same time not hogging the CPU since the machine is a server. So this script uses the nice command to run the sh shell with the command, which makes the whole thing run with priority 19, meaning it won't hog CPU processing. And the -P5 option to the xargs command means it will run 5 separate grep and sed processes simultaneously, so this is much much faster than running a single grep or sed. You may want to do -P0 which is unlimited if you aren't worried about too many processes or if you don't have to deal with process killers in the bg. Also, the -m1 command to grep means stop grepping this file for matches after the first match, which also saves time. Show Sample Output

    sh -c 'S=askapache R=htaccess; find . -mount -type f|xargs -P5 -iFF grep -l -m1 "$S" FF|xargs -P5 -iFF sed -i -e "s%${S}%${R}%g" FF'
    AskApache · 2009-10-02 05:03:10 0
  • will show: installed linux headers, image, or modules: /^ii/!d avoiding current kernel: /'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d only application names: s/^[^ ]* [^ ]* \([^ ]*\).*/\1/ avoiding stuff without a version number: /[0-9]/!d Show Sample Output

    dpkg -l 'linux-*' | sed '/^ii/!d;/'"$(uname -r | sed "s/\(.*\)-\([^0-9]\+\)/\1/")"'/d;s/^[^ ]* [^ ]* \([^ ]*\).*/\1/;/[0-9]/!d'
    plasticdoc · 2009-06-19 10:23:38 1

What Others Think

Tried, saw no effect.
penpen · 466 weeks and 5 days ago
No difference here either in Fedora 11, or Ubuntu 9.04.
flatcap · 466 weeks and 5 days ago
Maybe you're still in LANG=C ;) Try setting another LANG and grep.
ioggstream · 466 weeks and 5 days ago
I tried setting LANG to both values and after the cache gets hot, I see no difference under ubuntu.
bwoodacre · 466 weeks and 4 days ago
test made on RHEL. The same applies to many *nixes see it seems fixed on ubuntu 9.04
ioggstream · 466 weeks and 4 days ago
you dont have to actually do the export. If you remove the export and the semi-colon around the LANG=C the LANG envirnoment variable will become C for as long as the grep command runs. echo $LANG gives en_US.utf8 LANG=C grep 'foo' /var/log/whatever.log.0 runs in a quicker mode on some distros echo $LANG still gives en_US.utf8;
coffeeaddict_nl · 462 weeks and 5 days ago

What do you think?

Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?

You must be signed in to comment.

What's this? is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands

Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.


Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: