Commands tagged duplicate files (3)

  • Finds duplicates based on MD5 sum. Compares only files with the same size. Performance improvements on: find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate The new version takes around 3 seconds where the old version took around 17 minutes. The bottle neck in the old command was the second find. It searches for the files with the specified file size. The new version keeps the file path and size from the beginning.


    2
    find -not -empty -type f -printf "%-30s'\t\"%h/%f\"\n" | sort -rn -t$'\t' | uniq -w30 -D | cut -f 2 -d $'\t' | xargs md5sum | sort | uniq -w32 --all-repeated=separate
    fobos3 · 2014-10-19 02:00:55 10
  • * Find all file sizes and file names from the current directory down (replace "." with a target directory as needed). * sort the file sizes in numeric order * List only the duplicated file sizes * drop the file sizes so there are simply a list of files (retain order) * calculate md5sums on all of the files * replace the first instance of two spaces (md5sum output) with a \0 * drop the unique md5sums so only duplicate files remain listed * Use AWK to aggregate identical files on one line. * Remove the blank line from the beginning (This was done more efficiently by putting another "IF" into the AWK command, but then the whole line exceeded the 255 char limit). >>>> Each output line contains the md5sum and then all of the files that have that identical md5sum. All fields are \0 delimited. All records are \n delimited.


    0
    find . -type f -not -empty -printf "%-25s%p\n"|sort -n|uniq -D -w25|cut -b26-|xargs -d"\n" -n1 md5sum|sed "s/ /\x0/"|uniq -D -w32|awk -F"\0" 'BEGIN{l="";}{if(l!=$1||l==""){printf "\n%s\0",$1}printf "\0%s",$2;l=$1}END{printf "\n"}'|sed "/^$/d"
    alafrosty · 2013-10-22 13:34:19 7
  • To allow recursivity : find -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c1-7,41- Display only filenames : find -maxdepth 1 -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c43- Show Sample Output


    0
    find -maxdepth 1 -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c1-7,41-
    MaDCOw · 2017-02-09 11:36:31 18

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands


Check These Out

Replicate a directory structure dropping the files

diff files while disregarding indentation and trailing white space
**NOTE** Tekhne's alternative is much more succinct and its output conforms to the files actual contents rather than with white space removed My command on the other hand uses bash process substitution (and "Minimal" Perl), instead of files, to first remove leading and trailing white space from lines, before diff'ing the streams. Very useful when differences in indentation, such as in programming source code files, may be irrelevant

find out how many days since given date
You can also do this for seconds, minutes, hours, etc... Can't use dates before the epoch, though.

Graphically show percent of mount space used
Automatically drops mount points that have non-numeric sizes (e.g. /proc). Tested in bash on Linux and AIX.

Adequately order the page numbers to print a booklet
Useful if you don't have at hand the ability to automatically create a booklet, but still want to. F is the number of pages to print. It *must* be a multiple of 4; append extra blank pages if needed. In evince, these are the steps to print it, adapted from https://help.gnome.org/users/evince/stable/duplex-npage.html.en : 1) Click File ▸ Print. 2) Choose the General tab. Under Range, choose Pages. Type the numbers of the pages in this order (this is what this one-liner does for you): n, 1, 2, n-1, n-2, 3, 4, n-3, n-4, 5, 6, n-5, n-6, 7, 8, n-7, n-8, 9, 10, n-9, n-10, 11, 12, n-11... ...until you have typed n-number of pages. 3) Choose the Page Setup tab. - Assuming a duplex printer: Under Layout, in the Two-side menu, select Short Edge (Flip). - If you can only print on one side, you have to print twice, one for the odd pages and one for the even pages. In the Pages per side option, select 2. In the Page ordering menu, select Left to right. 4) Click Print.

Sort files by date
Show you the list of files of current directory sorted by date youngest to oldest, remove the 'r' if you want it in the otherway.

Convert CSV to JSON
Replace 'csv_file.csv' with your filename.

reverse-i-search: Search through your command line history
"What it actually shows is going to be dependent on the commands you've previously entered. When you do this, bash looks for the last command that you entered that contains the substring "ls", in my case that was "lsof ...". If the command that bash finds is what you're looking for, just hit Enter to execute it. You can also edit the command to suit your current needs before executing it (use the left and right arrow keys to move through it). If you're looking for a different command, hit Ctrl+R again to find a matching command further back in the command history. You can also continue to type a longer substring to refine the search, since searching is incremental. Note that the substring you enter is searched for throughout the command, not just at the beginning of the command." - http://www.linuxjournal.com/content/using-bash-history-more-efficiently

Get the list of local files that changed since their last upload in an S3 bucket
Can be useful to granulary flush files in a CDN after they've been changed in the S3 bucket.

print DateTimeOriginal from EXIF data for all files in folder
see output from `identify -verbose` for other keywords to filter for (e.g. date:create, exif:DateTime, EXIF:ExifOffset).


Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: