This can be much faster than downloading one or both trees to a common servers and comparing the files there. After, only those files could be copied down for deeper comparison if needed. Show Sample Output
This is mostly for my own notes but this command will compute a md5 message digest from the command line. You can also replace md5sum with other checksum commands (e.g., sha1sum) Show Sample Output
Improvement of the command "Find Duplicate Files (based on size first, then MD5 hash)" when searching for duplicate files in a directory containing a subversion working copy. This way the (multiple dupicates) in the meta-information directories are ignored.
Can easily be adopted for other VCS as well. For CVS i.e. change ".svn" into ".csv":
find -type d -name ".csv" -prune -o -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type d -name ".csv" -prune -o -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Show Sample Output
an alternative
Generates the md5 hash, without the trailing " -" and with the output "broken" into pairs of hexs. Show Sample Output
md5 checksum check Show Sample Output
Same result with simpler regular expression.. Show Sample Output
extracts path to each md5 checksum file, then, for each path, cd to it, check the md5sum, then cd - to toggle back to the starting directory. greps at the end to remove cd chattering on about the current directory.
This is a modified version of the OP, wrapped into a bash function. This version handles newlines and other whitespace correctly, the original has problems with the thankfully rare case of newlines in the file names. It also allows checking an arbitrary number of directories against each other, which is nice when the directories that you think might have duplicates don't have a convenient common ancestor directory.
Replace "user/sbin/sshd" with the file you would like to check. If you are doing this due to intrusion, you obviously would want to check size, last modification date and md5 of the md5sum application itself. Also, note that "/var/lib/dpkg/info/*.md5sums" files might have been tampered with themselves. Neither to say, this is a useful command. Show Sample Output
* Find all file sizes and file names from the current directory down (replace "." with a target directory as needed). * sort the file sizes in numeric order * List only the duplicated file sizes * drop the file sizes so there are simply a list of files (retain order) * calculate md5sums on all of the files * replace the first instance of two spaces (md5sum output) with a \0 * drop the unique md5sums so only duplicate files remain listed * Use AWK to aggregate identical files on one line. * Remove the blank line from the beginning (This was done more efficiently by putting another "IF" into the AWK command, but then the whole line exceeded the 255 char limit). >>>> Each output line contains the md5sum and then all of the files that have that identical md5sum. All fields are \0 delimited. All records are \n delimited.
To allow recursivity :
find -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c1-7,41-
Display only filenames :
find -maxdepth 1 -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c43-
Show Sample Output
Stat -c %n #list files. A find command is also useful Tee #use stdout, but reseend to next comand. Can be other Tee ad infinitum xargs #use de name of files to execute md5 and sha diggest.
This example is taken from Cygwin running on Win7Ent-64. Device names will vary by platform. Both commands resulted in identical files per the output of md5sum, and ran in the same time down to the second (2m45s), less than 100ms apart. I timed the commands with 'time', which added before 'dd' or 'readom' gives execution times after the command completes. See 'man time' for more info...it can be found on any Unix or Linux newer than 1973. Yeah, that means everywhere. readom is supposed to guarantee good reads, and does support flags for bypassing bad blocks where dd will either fail or hang. readom's verbosity gave more interesting output than dd. On Cygwin, my attempt with 'readom' from the first answer actually ended up reading my hard drive. Both attempts got to 5GB before I killed them, seeing as that is past any CD or standard DVD. dd: 'bs=1M' says "read 1MB into RAM from source, then write that 1MB to output. I also tested 10MB, which shaved the time down to 2m42s. 'if=/dev/scd0' selects Cygwin's representation of the first CD-ROM drive. 'of=./filename.iso' simply means "create filename.iso in the current directory." readom: '-v' says "be a little noisy (verbose)." The man page implies more verbosity with more 'v's, e.g. -vvv. dev='D:' in Cygwin explicitly specifies the D-drive. I tried other entries, like '/dev/scd0' and '2,0', but both read from my hard drive instead of the CD-ROM. I imagine my LUN-foo (2,0) was off for my system, but on Cygwin 'D:' sort of "cut to the chase" and did the job. f='./filename.iso' specifies the output file. speed=2 simply sets the speed at which the CD is read. I also tried 4, which ran the exact same 2m45s. retries=8 simply means try reading a block up to 8 times before giving up. This is useful for damaged media (scratches, glue lines, etc.), allowing you to automatically "get everything that can be copied" so you at least have most of the data. Show Sample Output
Nothing special about hashing here, only the use of cut, I think, could result at fewer keystrokes. Show Sample Output
commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.
Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.
» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10
Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):
Subscribe to the feed for: