Perl File Find Breath Depth First Search Algorithms

| 0 Comments | 0 TrackBacks
Every administrator on the Unix command line faces the task of searching the file system based on a criteria. For example you might want to find all non directory files under /home that are larger than 10 MB. You would simply run this command:

find /home -size +10240000c -type f

But what if you wanted to perform multiple tasks on every file over 10MB. Perhaps you'd want to compress them, rename them and then copy them to another server. You would write a BASH script that would do all that and then call find like this:

find /home -size +10240000c -type f -exec gz_mv_scp.bash {} \;

The find command would call gz_mv_scp.bash, passing every file above 10MB in size as its argument. That works well but you will end up with a mess because you'll need to work with more files. And your problems will grow if you want to make any changes to it.

I just don't like this solution. The find command is great for simple tasks. But it first requires you to fully grasp all the details of it. As an example, what if you wanted to find all files above 10MB in size but then you didn't want to search under a directory that started with the letters abc? Is there an exclusion list as there is to TAR and rsync? And getting exclusion lists to work with these commands can be tricky. You better test and hope that what you thought would match as a path to exclude is what tar and rsync also think.

red_eye_frog.gif
Let's make things even more complicated. What if your copying of all those large files have a different destination based on what the name of the file is. I've been in situations where an scp command would hang forever because it was going through a firewall that would cause it. If you have 24 files to copy and the 6th one hangs forever, you'd want to quit out of this one and copy the rest successfully. How to do this requires advanced knowledge of parallel processing in Unix which i'll write about soon.

I hope that i've been able to make the case to convince you to move away from Unix's find command for tasks that are not simple. I will show at the end of this article that you will be able to scrape a remote FTP server's directory structure and recreate it locally. Something that the FTP protocol does not support. I actually had to do this once while migrating a piece of software from one server to another. Update: this writing is done and published: FTP retrieve all files and subdirectories

This is where Perl's File Find (File::Find) module comes in. It is meant to be simple to use and there's even a utility that can, or i should say attempts to translate your unix find command to Perl's File::find syntax. I tried it and realized that I was flying both blind and deaf.

So I thought of writing for myself what is the obvious. A perl script that does what unix's find command does. I don't need all of the options available up front. What is needed is how to traverse the directories. You will have total control. In each directory you will read in the regular files and do what ever you like with it.

How to traverse a directory tree is quite simple. There are two methods: Depth First Search or DFS, and Breath First Search or BFS for short. In text books, these are sometimes referred to by Inorder traversal and Post order traversal. I prefer the former terms as they are also used in  graphics. You can view a directory tree as a directed graph in which all vertices have exactly one entry edge and 0 or 1 exiting edge (leafs have no exiting edge and the root has no entry edge). Also, edges from a vertex to itself are forbidden. Let's look at a directory structure represented graphically:

DFS.png
#!/usr/bin/perl
use strict;
use warnings;

sub BFS{
       my $start_dir = shift;
       my @queue = ($start_dir);
       while(scalar(@queue) > 0){
             my @tmp_queue;
             foreach my $dir (@queue){
                   print $dir."\n";
                   my ($files,$dirs)=get_dirs_files($dir);
                   map { &process_file($_);} @$files;
                   push @tmp_queue, @$dirs;
             }
             @queue = @tmp_queue;
        }
 }

sub DFS{
     my $start_dir = shift;
     my @queue = ($start_dir);

     while(scalar(@queue) > 0){
             my $dir = pop(@queue); 
             print "$dir\n";

             my ($files, $dirs)= get_dirs_files($dir);
             push @queue, @$dirs;
             map { &process_file($_);} @$files;
     }
}

sub get_dirs_files{
     my $sdir = shift;

     opendir(my $dh, $sdir) || die "can't opendir $sdir : $!";
     my @entries = grep {!( /^\.$/ || /^\.\.$/)} readdir($dh);
     @entries =  map { "$sdir/$_" } @entries; #change to absolute paths
     closedir $dh;

     my @files =  grep( -f $_ , @entries);
     my @dirs = grep(-d $_, @entries);
     return (\@files,\@dirs);
}

sub process_file{ # This is your custom subroutine to perform on each file
    my $f = shift;
    print "processing file $f\n";
}
BFS('1');
DFS('1');

A quick note on recursion. Text books will always show the elegant recursive solutions to these algorithms. But in practice you should always apply the iterative one unless a recursion is the only solution. That's because recursive code is slower and much more expensive memory wise. Every time you recall your subroutine, you are doubling the memory requirements of your code because a new process image is created by the kernel.

There you have the code showing both BFS and DFS methods of traversing the directory structure starting at directory 1. Those "print $dir" statements will show the progress of the traversal. You can confirm the inorder and postorder nature of subs BFS and DFS by creating a directory structure that would exactly represent the graph above and run this algorithm on it. Let's do that now.

mkdir -p 1/2/3/4    1/2/5/6     1/2/5/7     1/2/8
mkdir -p 1/9/10     1/9/11/12
mkdir -p 1/13/14/15     1/13/14/16     1/13/14/17/18/19     1/13/14/17/20     1/13/14/21

To make the mkdir's quick, use the -p option and follow to the leaf nodes each time. Now run BFS(1) and you'll see an output like this:
1
1/13
1/2
1/9
1/13/14
1/2/3
1/2/5
1/2/8
1/9/10
1/9/11
1/13/14/15
1/13/14/16
1/13/14/17
1/13/14/21
1/2/3/4
1/2/5/6
1/2/5/7
1/9/11/12
1/13/14/17/18
1/13/14/17/20
1/13/14/17/18/19

You see that all siblings are seen first, and then their children. Under 1, the directories 13,2 and 9 are first visited. Then their children. Then their children of children. We don't know which child is visited first. It doesn't matter. We leave that to Perl's internal storage handling of the array and the output of the foreach loop. What is important is that all SIBLINGS are visited BEFORE any CHILDREN. 

In the case of DFS, we note that all children of a node are visited first, before any sibling of a node.
1
1/9
1/9/11
1/9/11/12
1/9/10
1/2
1/2/8
1/2/5
1/2/5/7
1/2/5/6
1/2/3
1/2/3/4
1/13
1/13/14
1/13/14/21
1/13/14/17
1/13/14/17/20
1/13/14/17/18
1/13/14/17/18/19
1/13/14/16
1/13/14/15

This output represents what is in the picture above. Except that 1/9 is visited first, but in our graph 1/2 is visited first. That is not a problem because what is important is this: all children of 1/9 are then visited BEFORE 9's siblings 1/2 and 1/13. Once 1/2 is visited, then all of its children are visited BEFORE the remaining sibling 1/13. The same rule applies as the algorithm goes down further levels. You see that once 1/13/14/17 is seen, its three children 1/13/14/17/20, 1/13/14/17/18 and 1/13/14/17/18/19 are visited first, before its three siblings 1/13/14/21, 1/13/14/15 and 1/13/14/16. Where as in the BFS algorithm the opposite is true  (Once a node is visited, its siblings are visited first, then its children).

If you run the unix find command on the root directory node 1, you'll see a similar output to our DFS. That's because the unix find command uses the Depth First Search algorithm. DFS require less memory to execute (no tmp_queue) and no inner loop. Not that you'll run out of memory if you were using BFS. In 99% of the cases you'll be traversing the entire tree anyways so it won't make a lot of difference which one is used.

No TrackBacks

TrackBack URL: http://www.farhadsaberi.com/cgi-bin/mt/mt-tb.cgi/2

Leave a comment

About this Entry

This page contains a single entry by Farhad Saberi published on October 8, 2010 11:19 PM.

Install CPAN modules under a different directory than the default was the previous entry in this blog.

Perl FTP Retrieve All Files and Sub Directories is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.