List all duplicate directories

find . -type d| while read i; do echo $(ls -1 "$i"|wc -m) $(du -s "$i"); done|sort -s -n -k1,1 -k2,2 |awk -F'[ \t]+' '{ idx=$1$2; if (array[idx] == 1) {print} else if (array[idx]) {print array[idx]; print; array[idx]=1} else {array[idx]=$0}}'
Very quick! Based only on the content sizes and the character counts of filenames. If both numbers are equal then two (or more) directories seem to be most likely identical. if in doubt apply: diff -rq path_to_dir1 path_to_dir2 AWK function taken from here:
Sample Output
215 1320 ./hwebcam048.kopija
215 1320 ./hwebcam048
24 16 ./ac3dlx/lib/tk8.5/ttk/CVS
24 16 ./ac3dlx/lib/tk8.4/CVS
24 16 ./ac3dlx/tcl/CVS

By: knoppix5
2014-02-25 22:50:09

What Others Think

*flatcap shrieks in terror After looking at the command for five minutes, I ran it on my work directory. There are no duplicate dirs in my work area, but your command says otherwise. Most of the false dupes were .git dirs. Problem 1: counting the number of chars in filenames (in the root) is a very poor measure of "sameness". My git repos all look the same and so do many of my test dirs (example.c Makefile). Risk of false positives (high) Problem 2: "du -s" is bad in both directions. It will match different dirs of the same size AND it WON'T match identical dirs in some circumstances. Create a directory with 1000 files in, then delete 999 of them. Now copy that directory. Most likely "du -s dir_one dir_two" will show different sizes. Risk of false positives (high). Risk of false negatives (medium). Problem 3: awk uses $1$2 as its index. This means that $1=1 $2=23 will match $1=12 $2=3. This is unlikely to be seen in sorted numbers, but it is possible. Risk of false positive (very low). Now to the commands. The sort command can be simplifed: sort -sn -k1,2 You're echoing the results of ls and du, therefore the numbers will be space-delimited. So: awk -F space is OK awk -F' ' I mentioned the index being risky, so this is safer: { idx=$1"."$2; A simple dot to separate the two numbers. awk doesn't care about the type of idx. Now the rest of the awk program. It looks like it was copied from the awk one-liner to "uniq" the input. It's storing every line in array, when it only needs to keep the previous one (the input is sorted). { new=$1"."$2; if (new == old) { if (oldline) { print oldline; oldline = ""; } print; } else { old = new; oldline = $0; } } This does the same thing, but using less memory. new = numbers from current line old = number from previous line (empty to start) oldline = copy of previous entire line (in case the new one matches) To save space :-) it can be condensed to: {n=$1"."$2;if(n==o){if(l){print l;l="";}print;}else{o=n;l=$0;}} Leaving the new command: find . -type d|while read i; do echo $(ls -1 "$i"|wc -m) $(du -s "$i"); done|sort -sn -k1,2^Cwk -F' ' '{n=$1"."$2;if(n==o){if(l){print l;l="";}print;}else{o=n;l=$0;}}' Enjoy :-)
flatcap · 394 weeks and 3 days ago
Thank you flatcap for your constructive critics. I saw lacking of any command regarding search for directory dupes. Above command can give a vague starting point in estimating candidates for more accurate test with: diff -rq path_to_dir1 path_to_dir2 if disk space is going to be exhausted (=some hard links being considered).
knoppix5 · 394 weeks and 3 days ago
I had a good think about how I'd solve the problem. But I didn't come up with any reliable solutions that didn't involve a lot of processing power. I'll keep thinking...
flatcap · 394 weeks and 3 days ago
Any fs property which can be expressed digitally will do. What about time ...echo $(ls -1 "$i"|wc -m)...|wc -l 159 real 0m1.104s user 0m0.088s sys 0m0.112s time ...echo $(ls "$i"| tee >(egrep -c a)...|wc -l 109 real 0m2.105s user 0m0.060s sys 0m0.144s 109 lines output vs 159, risk of false positive lower(?).
knoppix5 · 394 weeks and 2 days ago
knoppix5 · 394 weeks and 2 days ago
actually echo $(ls "$i"| tee >(egrep -c a) >(egrep -c e)|tail -2|tr -d '\n')
knoppix5 · 394 weeks and 2 days ago

What do you think?

Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?

You must be signed in to comment.

What's this? is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands

Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.


Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: