Commands tagged duplicate files (3)

  • Finds duplicates based on MD5 sum. Compares only files with the same size. Performance improvements on: find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate The new version takes around 3 seconds where the old version took around 17 minutes. The bottle neck in the old command was the second find. It searches for the files with the specified file size. The new version keeps the file path and size from the beginning.


    2
    find -not -empty -type f -printf "%-30s'\t\"%h/%f\"\n" | sort -rn -t$'\t' | uniq -w30 -D | cut -f 2 -d $'\t' | xargs md5sum | sort | uniq -w32 --all-repeated=separate
    fobos3 · 2014-10-19 02:00:55 1
  • * Find all file sizes and file names from the current directory down (replace "." with a target directory as needed). * sort the file sizes in numeric order * List only the duplicated file sizes * drop the file sizes so there are simply a list of files (retain order) * calculate md5sums on all of the files * replace the first instance of two spaces (md5sum output) with a \0 * drop the unique md5sums so only duplicate files remain listed * Use AWK to aggregate identical files on one line. * Remove the blank line from the beginning (This was done more efficiently by putting another "IF" into the AWK command, but then the whole line exceeded the 255 char limit). >>>> Each output line contains the md5sum and then all of the files that have that identical md5sum. All fields are \0 delimited. All records are \n delimited.


    0
    find . -type f -not -empty -printf "%-25s%p\n"|sort -n|uniq -D -w25|cut -b26-|xargs -d"\n" -n1 md5sum|sed "s/ /\x0/"|uniq -D -w32|awk -F"\0" 'BEGIN{l="";}{if(l!=$1||l==""){printf "\n%s\0",$1}printf "\0%s",$2;l=$1}END{printf "\n"}'|sed "/^$/d"
    alafrosty · 2013-10-22 13:34:19 0
  • To allow recursivity : find -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c1-7,41- Display only filenames : find -maxdepth 1 -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c43- Show Sample Output


    0
    find -maxdepth 1 -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c1-7,41-
    MaDCOw · 2017-02-09 11:36:31 0

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands


Check These Out

Output system statistics every 5 seconds with timestamp
See man vmstat for information about the statistics. This does the same thing without the timestamp: $vmstat 5

Slow Down Command Output
( Or $ ls -lat|lolcat -a if you like it in technicolor - apt install lolcat if needed )

Use the builtin ':' bash command to increment variables
I just found another use for the builtin ':' bash command. It increments counters for me in a loop if a certain condition is met... : [arguments] No effect; the command does nothing beyond expanding arguments and performing any specified redirections. A zero exit code is returned.

Mount a Windows share on the local network (Ubuntu) with user rights and use a specific samba user

Start screen with name and run command
Runs an instance of screen with name of "name_me" and command of "echo "hi"" To reconnect to screen instance later use: screen -r name_me

Quick and dirty RSS
runs an rss feed through sed replacing the closing tags with newlines and the opening tags with white space making it readable.

Prevent shell autologout
Unset TMOUT or set it to 0 in order to prevent shell autologout. TMOUT is the number of seconds after which the present shell will be killed if it has been idle for that long.

Press enter and take a WebCam picture.
This command takes a 1280x1024 p picture from the webcam. If prefer it smaller, try changing the -s parameter: qqvga is the tiniest, vga is 640x480, svga is 800x600 and so on. Get your smile on and press enter! :)

delete all leading and trailing whitespace from each line in file

Find duplicate UID in /etc/passwd
You can use only awk


Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: