Re: Searching a drive and copying files



in message <336A5DA6-5A43-44C0-8961-139C81702AB3@xxxxxxxxxxxxxxxxx>,
wrote Joshua Lewis thusly...

I need to search my drive for all pictures on my system and copy
them to a networked system using sftp or ssh or what not. There
will be duplicate names on the drive so I was hoping to have dups
placed in a separate folder.

Unison, net/unison port, should be able to handle the duplicates
based on file checksum. (I personally have not used it much, so i
cannot answer any other queried about it; refer to its fine man
page.)


Due to my for lack of a better term stupidity when I first got
my camera I will probably have instances when there will be three
or four duplicates. can help me out with that it would be great.
...
My goal is to find all my pictures and compare them then delete
the dups that don't look that good. A daunting task as I have 20
GB of data. I bet 10 GB are dups.

A checksum-based management of duplicates will help with the files
with identical contents, but not with files that differ even a bit.

Perl program below -- a modified version of Randal Schwartz's
version[0] -- uses md5(1) to identify duplicates (as in identical
files), failing that, Image::Magick based on fuzz factor. When it
finds duplicates, it asks to enter the item number from the file
list to be deleted.

[0] Article "Finding similar images",
http://www.stonehenge.com/merlyn/LinuxMag/col50.html


To be able to run, it needs Image::Magick (graphics/ImageMagick
port), Cache::FileCache (devel/p5-Cache-Cache), List::Util
(lang/p5-Scalar-List-Utils), File::Copy & File::Path.

Mind that it, rather Image::Magick, may consume all of your memory
and/or temporary fs if you run it on all the files at once.

If you are good in Perl, you could modify the program to move the
duplicates in a directory (instead of deleting), and possibly not to
ask to take the particular action (if as you say you would have a
boat load of duplicates).

Without further interruptions, program follows ...

#!perl

# This is a modified version of Randal Schwartz's ...
#
# http://www.stonehenge.com/merlyn/LinuxMag/col50.html
#
# ... as it uses checksum (MD5 for now) to detect identical files, failing that
# uses Image::Magick.

use warnings; use strict;

$|++;

use Image::Magick;
use Cache::FileCache;
use File::Copy qw( move );
use File::Path qw( mkpath );
use List::Util qw( reduce );

use Carp qw(carp);

use Getopt::Long qw( :config gnu_compat no_ignore_case no_debug );

# User option; permitted average deviation in the vector elements.
my $fuzz = 15;

# User option; if defined, rename corrupt images into this dir.
my $corrupt_dir = "CORRUPT";
{
my $usage;
GetOptions
(
'h|usage|help' => \$usage
, 'f|fuzz=i' => \$fuzz
, 'c|corrupt=s' => \$corrupt_dir
, 'nc|nocorrupt' => sub { undef $corrupt_dir; }
)
or usage( 1 );

usage( 0 ) if $usage;

# Check if any arguments remain which will be file names
usage( 1, "No file(s) or directory(ies) given." ) unless scalar @ARGV;
}

sub warnif;

my $cache = Cache::FileCache->new
( {
namespace => 'image.cache'
, cache_root => ( glob( "~/log/misc" ) )[ 0 ]
}
);

my @buckets;

FILE: while ( @ARGV )
{
my $file = shift;
next FILE if -l $file;
if ( -d $file )
{
opendir DIR, $file or next FILE;
unshift @ARGV, map { m/^\./ ? () : "$file/$_"; } sort readdir DIR;
next FILE;
}

next FILE unless -f _ or -d _;

my ( @stat ) = stat _ or die "should not happen: $!";

# dev/ino/mtime
my $key = "@stat[ 0, 1, 9 ]";

my @vector;

#print "$file ";
if ( my $data = $cache->get( $key ) )
{
#print "... is cached\n";
@vector = @$data;
}
else
{
my $image = Image::Magick->new;
if ( my $x = $image->Read( $file ) )
{
if ( defined $corrupt_dir and $x =~ m/corrupt|unexpected end-of-file/i )
{
print "$file ";
print "... renaming into $corrupt_dir\n";

-d $corrupt_dir
or mkpath $corrupt_dir, 0, 0700
or die "Cannot mkpath $corrupt_dir: $!";

move $file, $corrupt_dir or warn "Cannot rename: $!";
}
else
{
print "$file ";
print "... skipping ( $x )\n";
}
next FILE;
}

#print "is ", join( "x", $image->Get( 'width', 'height' ) ), "\n";
warnif $image->Normalize();
warnif $image->Resize( geometry => '4x4!' );
warnif $image->Set( magick => 'rgb' );
@vector = unpack "C*", $image->ImageToBlob();
$cache->set( $key, [ @vector ] );
}
BUCKET: for my $bucket ( @buckets )
{
my $error = 0;
INDEX: for my $index ( 0 .. $#vector )
{
$error += abs( $bucket->[ 0 ][ $index ] - $vector[ $index ] );
next BUCKET if $error > $fuzz * @vector;
}
push @$bucket, $file;

#print "linked ", join( ", ", @$bucket[ 1 .. $#$bucket ] ), "\n";
next FILE;
}
push @buckets, [ [ @vector ], $file ];
}

# Connect images only, no interactive process
#exit;

for my $bucket ( @buckets )
{
my @names = @$bucket;
shift @names; # first element is vector
next unless @names > 1; # skip unique images
my $images = Image::Magick->new;
$images->Read( @names );

compare_as_text( $images );

my $sums = collect_md5sum( $images );
{
# Silence warning about single use of $b.
no warnings 'once';
compare_as_image( $images )
unless reduce { $a eq $b ? $a : 0 } @$sums;
}

print "Delete? [picture number] ";

my $img_count = scalar @{ $images };
my @dead;
chomp( my $dead = <STDIN> );
@dead =
$dead =~ m/^ \s* [*+] $/x ? ( 1 .. $img_count )
: $dead =~ m/^ \s* - \d+ $/x ? ( $img_count + $dead + 1 .. $img_count )
: grep { $_ >= 1 and $_ <= $img_count } $dead =~ /(\d+)/g;

for ( @dead )
{
my $dead_name = $images->[ $_ - 1 ]->Get( 'base-filename' );

warn "rm $dead_name\n";
unlink $dead_name or warn "Cannot rm $dead_name: $!";
warn "\n";
}
}

sub compare_as_text
{
my $images = shift;

my $frmt = "%d: %s\n -- %dx%d %0.3f kB\n";

foreach my $img ( 0 .. scalar @$images - 1 )
{
printf $frmt , ( $img + 1 ), $images->[ $img ]->Get( 'base-filename' )
, $images->[ $img ]->Get( 'width' ), $images->[ $img ]->Get( 'height' )
, ( $images->[ $img ]->Get( 'filesize' ) / 1024 )
;
}
}

sub collect_md5sum
{
my $images = shift;
my @md5;
foreach ( 0 .. scalar @$images - 1 )
{
my $name = $images->[ $_ ]->Get( 'base-filename' );
push @md5, ( split ' ', qx/ md5 $name / )[ 3 ];
}
return [ @md5 ];
}

sub compare_as_image
{
my $images = shift;
my $montage =
$images->Montage
( geometry => '370x500' , tile => '2x2' , label => "[%p] %i %wx%h %b" );

print "processing...\n";
$montage->Display();
}

sub warnif
{
my $value = shift;
carp $value if $value;
}

sub usage
{
my ( $exit, $message ) = @_;

print STDERR $message, "\n" if $exit && $message;

my $old_fd = select( $exit == 0 ? \*STDOUT : \*STDERR );
print <<"_USAGE_";
similar-image - Keep|Delete similar looking images

similar-image [ -fuzz <avg dev> ]
[ -corrupt <directory> | -nocorrupt ]
< files directories >

This program takes the following options ...

-f | -fuzz Permitted average deviation in the vector
elements; (set value: $fuzz).

-c | -corrupt Move corrupt images into this directory; (set
value: $corrupt_dir).

-nc | -nocorrupt Do not define a corrupted-image directory (so that
files are not moved).
_USAGE_

select $old_fd;
exit( $exit );
}
__END__


- Parv

--

_______________________________________________
freebsd-questions@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: Umlaut/accented character problem with perl/Tk
    ... sub get_encode_file{ ... my $filename = shift; ... However, the images were not ...
    (comp.lang.perl.tk)
  • Re: simplify this if loop
    ... > shift @group; ... > there is duplicates and I feel there is better way to write a faster ... > more cleaner code. ... sub cal_avg { ...
    (comp.lang.perl.misc)
  • Re: Having Problem With Access Developers Handbook Form Scaling Tool
    ... frmImage now gets the focus when it is initially displayed. ... The table has three small bitmap images in it. ... >> Private Sub Form_Open ... On Error GoTo HandleErrors ...
    (microsoft.public.access.formscoding)
  • Re: Page Break Question
    ... Sub In_Line ... Options.PictureWrapType = wdWrapMergeFront ... Word MVP web sitehttp://word.mvps.org ... can paste more images into the document. ...
    (microsoft.public.word.docmanagement)
  • Re: More elegant code
    ... I just didn't move duplicates but it would be easy to add a -something after ... It's an image database. ... BrowseForSourceFolder creates the New Images list. ... folder is browsed to to find the collection folder for the project. ...
    (microsoft.public.access.formscoding)

Quantcast