Perl script to find all unowned files and directories on Unix - How can I optimize further? [closed]

https://stackoverflow.com/questions/7867686

11-02-2021
|

Question

Following my findings and suggestions in my other post How to exclude a list of full directory paths in find command on Solaris, I have decided to write a Perl version of this script and see how I could optimize it to run faster than a native find command. So far, the results are impressive!

The purpose of this script is to report all unowned files and directories on a Unix system for audit compliance. The script has to accept a list of directories and files to exclude (either by full path or wildcard name), and must take as little processing power as possible. It is meant to be run on hundreds of Unix system that we (the company I work for) support, and has be able to run on all those Unix systems (multiple OS, multiple platforms: AIX, HP-UX, Solaris and Linux) without us having to install or upgrade anything first. In other words, it has to run with standard libraries and binaries we can expect on all systems.

I have not yet made the script argument-aware, so all arguments are hard-coded in the script. I plan on having the following arguments in the end and will probably use getopts to do it:

-d = comma delimited list of directories to exclude by path name
-w = comma delimited list of directories to exclude by basename or wildcard
-f = comma delimited list of files to exclude by path name
-i = comma delimited list of files to exclude by basename or wildcard
-t:list|count = Defines the type of output I want to see (list of all findinds, or summary with count per directory)

Here is the source I have done so far:

#! /usr/bin/perl
use strict;
use File::Find;

# Full paths of directories to prune
my @exclude_dirs = ('/dev','/proc','/home');

# Basenames or wildcard names of directories I want to prune
my $exclude_dirs_wildcard = '.svn';

# Full paths of files I want to ignore
my @exclude_files = ('/tmp/test/dir3/.svn/svn_file1.txt','/tmp/test/dir3/.svn/svn_file2.txt');

# Basenames of wildcard names of files I want to ignore
my $exclude_files_wildcard = '*.tmp';
my %dir_globs = ();
my %file_globs = ();

# Results will be sroted in this hash
my %found = ();

# Used for storing uid's and gid's present on system
my %uids = ();
my %gids = ();

# Callback function for find
sub wanted {
    my $dir = $File::Find::dir;
    my $name = $File::Find::name;
    my $basename = $_;

    # Ignore symbolic links
    return if -l $name;

    # Search for wildcards if dir was never searched before
    if (!exists($dir_globs{$dir})) {
        @{$dir_globs{$dir}} = glob($exclude_dirs_wildcard);
    }
    if (!exists($file_globs{$dir})) {
        @{$file_globs{$dir}} = glob($exclude_files_wildcard);
    }

    # Prune directory if present in exclude list
    if (-d $name && in_array(\@exclude_dirs, $name)) {
        $File::Find::prune = 1;
        return;
    }

    # Prune directory if present in dir_globs
    if (-d $name && in_array(\@{$dir_globs{$dir}},$basename)) {
        $File::Find::prune = 1;
        return;
    }

    # Ignore excluded files
    return if (-f $name && in_array(\@exclude_files, $name));
    return if (-f $name && in_array(\@{$file_globs{$dir}},$basename));

    # Check ownership and add to the hash if unowned (uid or gid does not exist on system)
    my ($dev,$ino,$mode,$nlink,$uid,$gid) = stat($name);
    if (!exists $uids{$uid} || !exists($gids{$gid})) {
        push(@{$found{$dir}}, $basename);
    } else {
        return
    }
}

# Standard in_array perl implementation
sub in_array {
    my ($arr, $search_for) = @_;
    my %items = map {$_ => 1} @$arr;
    return (exists($items{$search_for}))?1:0;
}

# Get all uid's that exists on system and store in %uids
sub get_uids {
    while (my ($name, $pw, $uid) = getpwent) {
        $uids{$uid} = 1;
    }
}

# Get all gid's that exists on system and store in %gids
sub get_gids {
    while (my ($name, $pw, $gid) = getgrent) {
        $gids{$gid} = 1;
    }
}

# Print a list of unowned files in the format PARENT_DIR,BASENAME
sub print_list {
    foreach my $dir (sort keys %found) {
        foreach my $child (sort @{$found{$dir}}) {
            print "$dir,$child\n";
        }
    }
}

# Prints a list of directories with the count of unowned childs in the format DIR,COUNT
sub print_count {
    foreach my $dir (sort keys %found) {
        print "$dir,".scalar(@{$found{$dir}})."\n";
    }
}

# Call it all
&get_uids();
&get_gids();

find(\&wanted, '/');
print "List:\n";
&print_list();

print "\nCount:\n";
&print_count();

exit(0);

If you want to test it on your system, simply create a test directory structure with generic files, chown the whole tree with a test user you create for this purpose, and then delete the user.

I'll take any hints, tips or recommendations you could give me.

Happy reading!

Solution

Try starting with these, then see if there's anything more you can do.

Use hashes instead of the arrays that need to be searched using in_array(). This is so you can do a direct hash lookup in one step instead of converting the entire array to a hash for every iteration.
You don't need to check for symlinks because they will be skipped since you have not set the follow option.
Maximise your use of _; avoid repeating IO operations. _ is a special filehandle where the file status information is cached whenever you call stat() or any file test. This means you can call stat _ or -f _ instead of stat $name or -f $name. (Calling -f _ is more than 1000x faster than -f $name on my machine because it uses the cache instead of doing another IO operation.)

Use the Benchmark module to test out different optimisation strategies to see if you actually gain anything. E.g.

use Benchmark;
stat 'myfile.txt';
timethese(100_000, {
    a => sub {-f _},
    b => sub {-f 'myfile.txt'},
});

A general principle of performance tuning is find out exactly where the slow parts are before you try to tune it (because the slow parts might not be where you expect them to be). My recommendation is to use Devel::NYTProf, which can generate an html profile report for you. From the synopsis, on how to use it (from the command line):
```
# profile code and write database to ./nytprof.out
perl -d:NYTProf some_perl.pl

# convert database into a set of html files, e.g., ./nytprof/index.html
# and open a web browser on the nytprof/index.html file
nytprofhtml --open
```

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow