Perl Challenge - Directory Iterator

https://stackoverflow.com/questions/161687

03-07-2019
|

Question

You sometimes hear it said about Perl that there might be 6 different ways to approach the same problem. Good Perl developers usually have well-reasoned insights for making choices between the various possible methods of implementation.

So an example Perl problem:

A simple script which recursively iterates through a directory structure, looking for files which were modified recently (after a certain date, which would be variable). Save the results to a file.

The question, for Perl developers: What is your best way to accomplish this?

Solution

This sounds like a job for File::Find::Rule:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;  # Causes built-ins like open to succeed or die.
              # You can 'use Fatal qw(open)' if autodie is not installed.

use File::Find::Rule;
use Getopt::Std;

use constant SECONDS_IN_DAY => 24 * 60 * 60;

our %option = (
    m => 1,        # -m switch: days ago modified, defaults to 1
    o => undef,    # -o switch: output file, defaults to STDOUT
);

getopts('m:o:', \%option);

# If we haven't been given directories to search, default to the
# current working directory.

if (not @ARGV) {
    @ARGV = ( '.' );
}

print STDERR "Finding files changed in the last $option{m} day(s)\n";


# Convert our time in days into a timestamp in seconds from the epoch.
my $last_modified_timestamp = time() - SECONDS_IN_DAY * $option{m};

# Now find all the regular files, which have been modified in the last
# $option{m} days, looking in all the locations specified in
# @ARGV (our remaining command line arguments).

my @files = File::Find::Rule->file()
                            ->mtime(">= $last_modified_timestamp")
                            ->in(@ARGV);

# $out_fh will store the filehandle where we send the file list.
# It defaults to STDOUT.

my $out_fh = \*STDOUT;

if ($option{o}) {
    open($out_fh, '>', $option{o});
}

# Print our results.

print {$out_fh} join("\n", @files), "\n";

OTHER TIPS

Where the problem is solved mainly by standard libraries use them.

File::Find in this case works nicely.

There may be many ways to do things in perl, but where a very standard library exists to do something, it should be utilised unless it has problems of it's own.

#!/usr/bin/perl

use strict;
use File::Find();

File::Find::find( {wanted => \&wanted}, ".");

sub wanted {
  my (@stat);
  my ($time) = time();
  my ($days) = 5 * 60 * 60 * 24;

  @stat = stat($_);
  if (($time - $stat[9]) >= $days) {
    print "$_ \n";
  }
}

There aren't six ways to do this, there's the old way, and the new way. The old way is with File::Find, and you already have a couple of examples of that. File::Find has a pretty awful callback interface, it was cool 20 years ago, but we've moved on since then.

Here's a real life (lightly amended) program I use to clear out the cruft on one of my production servers. It uses File::Find::Rule, rather than File::Find. File::Find::Rule has a nice declarative interface that reads easily.

Randal Schwartz also wrote File::Finder, as a wrapper over File::Find. It's quite nice but it hasn't really taken off.

#! /usr/bin/perl -w

# delete temp files on agr1

use strict;
use File::Find::Rule;
use File::Path 'rmtree';

for my $file (

    File::Find::Rule->new
        ->mtime( '<' . days_ago(2) )
        ->name( qr/^CGItemp\d+$/ )
        ->file()
        ->in('/tmp'),

    File::Find::Rule->new
        ->mtime( '<' . days_ago(20) )
        ->name( qr/^listener-\d{4}-\d{2}-\d{2}-\d{4}.log$/ )
        ->file()
        ->maxdepth(1)
        ->in('/usr/oracle/ora81/network/log'),

    File::Find::Rule->new
        ->mtime( '<' . days_ago(10) )
        ->name( qr/^batch[_-]\d{8}-\d{4}\.run\.txt$/ )
        ->file()
        ->maxdepth(1)
        ->in('/var/log/req'),

    File::Find::Rule->new
        ->mtime( '<' . days_ago(20) )
        ->or(
            File::Find::Rule->name( qr/^remove-\d{8}-\d{6}\.txt$/ ),
            File::Find::Rule->name( qr/^insert-tp-\d{8}-\d{4}\.log$/ ),
        )
        ->file()
        ->maxdepth(1)
        ->in('/home/agdata/import/logs'),

    File::Find::Rule->new
        ->mtime( '<' . days_ago(90) )
        ->or(
            File::Find::Rule->name( qr/^\d{8}-\d{6}\.txt$/ ),
            File::Find::Rule->name( qr/^\d{8}-\d{4}\.report\.txt$/ ),
        )
        ->file()
        ->maxdepth(1)
        ->in('/home/agdata/redo/log'),

) {
    if (unlink $file) {
        print "ok $file\n";
    }
    else {
        print "fail $file: $!\n";
    }
}

{
    my $now;
    sub days_ago {
        # days as number of seconds
        $now ||= time;
        return $now - (86400 * shift);
    }
}

File::Find is the right way to solve this problem. There is no use in reimplementing stuff that already exists in other modules, but reimplementing something that is in a standard module should really be discouraged.

Others have mentioned File::Find, which is the way I'd go, but you asked for an iterator, which File::Find isn't (nor is File::Find::Rule). You might want to look at File::Next or File::Find::Object, which do have an iterative interfaces. Mark Jason Dominus goes over building your own in chapter 4.2.2 of Higher Order Perl.

My preferred method is to use the File::Find module as so:

use File::Find;
find (\&checkFile, $directory_to_check_recursively);

sub checkFile()
{
   #examine each file in here. Filename is in $_ and you are chdired into it's directory
   #directory is also available in $File::Find::dir
}

There's my File::Finder, as already mentioned, but there's also my iterator-as-a-tied-hash solution from Finding Files Incrementally (Linux Magazine).

I wrote File::Find::Closures as a set of closures that you can use with File::Find so you don't have to write your own. There's a couple of mtime functions that should handle

use File::Find;
use File::Find::Closures qw(:all);

my( $wanted, $list_reporter ) = find_by_modified_after( time - 86400 );
#my( $wanted, $list_reporter ) = find_by_modified_before( time - 86400 );

File::Find::find( $wanted, @directories );

my @modified = $list_reporter->();

You don't really need to use the module because I mostly designed it as a way that you could look at the code and steal the parts that you wanted. In this case it's a little trickier because all the subroutines that deal with stat depend on a second subroutine. You'll quickly get the idea from the code though.

Good luck,

Using standard modules is indeed a good idea but out of interest here is my back to basic approach using no external modules. I know code syntax here might not be everyone's cup of tea.

It could be improved to use less memory via providing an iterator access (input list could temporarily be on hold once it reaches a certain size) and conditional check can be expanded via callback ref.

sub mfind {
    my %done;

    sub find {
        my $last_mod = shift;
        my $path = shift;

        #determine physical link if symlink
        $path = readlink($path) || $path;        

        #return if already processed
        return if $done{$path} > 1;

        #mark path as processed
        $done{$path}++;

        #DFS recursion 
        return grep{$_} @_
               ? ( find($last_mod, $path), find($last_mod, @_) ) 
                : -d $path
                   ? find($last_mod, glob("$path/*") )
                       : -f $path && (stat($path))[9] >= $last_mod 
                           ? $path : undef;
    }

    return find(@_);
}

print join "\n", mfind(time - 1 * 86400, "some path");

I write a subroutine that reads a directory with readdir, throws out the "." and ".." directories, recurses if it finds a new directory, and examines the files for what I'm looking for (in your case, you'll want to use utime or stat). By time the recursion is done, every file should have been examined.

I think all the functions you'd need for this script are described briefly here: http://www.cs.cf.ac.uk/Dave/PERL/node70.html

The semantics of input and output are a fairly trivial exercise which I'll leave to you.

I'm riskying to get downvoted, but IMHO 'ls' (with appropriate params) command does it in a best known performant way. In this case it might be quite good solution to pipe 'ls' from perl code through shell, returning results to an array or hash.

Edit: It could also be 'find' used, as proposed in comments.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow