Perl Binary Heap

| 0 Comments | 0 TrackBacks
Introduction:

Using Perl to implement the Binary Heap, one of computer science's most beautiful concepts, is not has hard as one might think. It is a good contrast to last weekend's halloween celebrations. I looked on CPAN and of course and gladly the Heap is implemented. But then I thought about using it. I'm sure that the author(s) has made it as easy as possible for me but I quickly get annoyed at trying to figure out whether what I think this Heap is doing is what it really does.

Further more, what if I want to change the Heap so that the data it stores are not simple integers but references to other objects? Since I know about binary heaps, I can create and manipulate my own rather than try and figure out someone else's implementation, and perhaps change the class to suit my need. So let's write our own Binary Heap. Minimum or Maximum doesn't matter as the code will be almost identical. The only difference is flipping the comparison criteria.

Binary Heap is in every computer science text book. At the time in my University  we learned it using C++. Since I sadly and disappointedly do not work in C++ it makes sense that I translate the same code into Perl. The book is Algorithms and Data Structures with C++ by Weiss.

BinaryHeapBird.jpg
It is assumed that you are already familiar with Binary Heaps. This article is not intended to teach what it is as there are text books for that but it is to show its implementation in Perl using an unusual example from the systems administrator's or Applications Support Engineer's point of view. The place where Binary Heaps are mainly used is in event driven simulators.

Quickly, a Minimum Binary Heap is a structure in which you store the smallest value at the root of the tree. This means that you can always retrieve the minimum value in the list in a constant time of 1 (i.e.: you don't have to search for it. You know that it is always at the root position). A Maximum binary heap is the opposite. The largest value is at the root position. To insert a value into the Heap takes O(log N) time worst case which is quite fast. This is done by inserting the new value from the bottom of the tree and going up (percolating up). If the new value is a new minimum then it will end up at the root in O(log N) time. When we delete the minimum value (the root), we replace it by taking the last element in the tree, place it at the root and percolate it down. The purpose of all this percolating down or up is to keep the integrity of the heap order so that the algorithm works.

State the problem:

Find the top 20 largest files.

Obvious and easy solution:

This is really easy:   
du -kx * | sort -n | tail -20

But that's really bad. When du it sees a directory it sums up the contents of everything under it instead of crawling and looking at individual files only. So let's try the find command:

find . -type f -ls | sort -n -k7 | tail -20

What if you have 50 million directories containing 100 trillion files (if you've got the inodes for it). I bet that's peanuts at Google Inc. But that's not the point and with the above solution you will first have to store all 100 trillion file sizes and their paths in find's output buffer before you pass all of them to sort !! Admittedly sort is extremely fast and it's in C. But that's not the most efficient way memory wise. It will most likely work and probably you won't run out of memory as memory is really cheap today. When the time comes i'll do a memory and CPU comparison to see.

Still, the problem that I have is that there's no control over what's going on. You run the command and let it run wild. What if you're doing this on a production server and have to go really slow? The closest that you can come to throttling shell commands is to renice them (see man renice). So let's look at another solution that will use the smallest amount of memory and gives you the flexibility to modify, add exceptions, throttle, or do anything else you desire. I came up with this and want to document it here.

Combine DFS with a Minimum Binary Heap

In the article directory tree traversals using DFS or BFS I showed how you can search a directory structure and have full control over each file you come across. We'll use DFS here (or BFS, it doesn't matter which) to look at each and every file and we'll use a 20 element Minimum Binary Heap to store the 20 largest files.

We will simply keep inserting into the heap when we visit a file that is larger than the current min. If the heap size is less than 20 we'll just toss it in the heap. If the heap size is already 20 then we'll do the insert by deleting the root note and inserting the new one. Insertion happens by adding the new node at the end (bottom) of the heap and percolating it up.

I know that if there are 100 files of exactly the same size who happen to be the largest as well, then keeping the top 20 will ignore the other 80 that are also the largest. That's something that we need to keep in mind while limiting our heap size to 20 only. But in real life as an administrator if you find a lot of files that are exactly the same byte size then you can already draw significant conclusions into what's happening with the application.

Let's explain how the heap is represented in the computer's memory. It is really an array. In perl an array is zero indexed as you already know. But the heap's representation starts at index 1. Each node in the heap is at array index i where:

The array index 0 is unused.
The root node is always at i=1  (where the minimum value is)
For every node i, its left child is at position 2i
For every node i, its right child is at position 2i+1
For every node i, its parent is at position i/2

Let's show the Perl class that implements the Heap and explain it after.

package Heap;
use strict; 
use warnings;
sub new {
    my $class = shift;
    my $self = { "_Heap" => [], "OrderOK" => 1 , CurrentSize => 0 };
    bless($self); return $self;
}

sub FindMin {
    my $self = shift;
    $self->FixHeap() if $self->{OrderOK} == 0;
    my $t = $self->{_Heap}->[1]->[0];
    return $self->{_Heap}->[1] || undef; #returns an array ref.
}

sub Toss {
    my $self = shift;
    my $node_aref = shift;

    # remember that idx=0 is unused. So can't call perl's push.
    my $idx = ++$self->{CurrentSize} ;
    $self->{_Heap}->[$idx] = $node_aref;
  
    my $parent_aref= $self->{_Heap}->[ int($self->{CurrentSize} / 2) ] ;
    return unless ($parent_aref); #may be the root, so no parent defined.
    my $parent_file_sz =  $parent_aref->[0];
    my $newfile_sz = $node_aref->[0];
 
    $self->{OrderOK}= 0 if ($newfile_sz < $parent_file_sz); #flip this for a maximum heap
}
sub Insert{
    my $self= shift;
    my $x = shift; # percolate x up as needed
    my $newfile_sz = $x->[0];

    # If Order is not OK, just toss it.
    if ($self->{OrderOK} == 0){
           $self->Toss($x); return;
    }

    # Create a new empty (Hole) position at the end of Heap 
    my $Hole =  ++$self->{CurrentSize};
    for( ; ($Hole > 1) && 
            ($newfile_sz < $self->{_Heap}->[ int($Hole /2) ]->[0]); $Hole = int($Hole / 2) ){
                    $self->{_Heap}->[ $Hole ] = $self->{_Heap}->[ int($Hole / 2) ]; 
    }
    $self->{_Heap}->[$Hole] = $x;
}

sub DeleteMin{
    my $self=shift;
    
    # remove the last entry in Heap and put it at root position
    $self->{_Heap}->[1] = $self->{_Heap}->[ $self->{CurrentSize} ];
    delete $self->{_Heap}->[ $self->{CurrentSize} ];
    --$self->{CurrentSize};
    $self->PercolateDown(1); # fix heap order
}

sub PercolateDown{
    my $self = shift;
    my $Hole = shift;
    my $tmp_aref = $self->{_Heap}->[$Hole];
    my $tmp_sz = $self->{_Heap}->[$Hole]->[0]; #this is what our comparison is based on

    my $Child;
    for(; ($Hole * 2) <= $self->{CurrentSize}; $Hole = $Child){
          $Child = $Hole * 2; #get left Child
          my $Child_sz = $self->{_Heap}->[$Child]->[0];

          # if $Child is not the last element, check if the right child value is smaller.
          if( ($Child != $self->{CurrentSize}) && 
              ($self->{_Heap}->[$Child + 1]->[0] < $Child_sz)){
                        $Child++; # become right child since it exists and it is smaller.
          } 
          if( $self->{_Heap}->[$Child]->[0] < $tmp_sz){
                $self->{_Heap}->[$Hole] = $self->{_Heap}->[$Child];
          }else{
                last;
          }
    }
    $self->{_Heap}->[$Hole] = $tmp_aref;
}

sub FixHeap{
    my $self = shift;
    for(my $i = int(($self->{CurrentSize}) / 2);   $i > 0; $i--) {
         $self->PercolateDown($i);
    }
    $self->{OrderOK} = 1;
}
sub print_heap{
    my $self = shift;
    for(my $i=1; $i <  scalar @{$self->{_Heap}}; $i++){
           print "\t @{ $self->{_Heap}->[$i] } \n";
    }
}
1;

Isn't seeing that famous 1; at the end a relief? I want to first show a visualization of the above Heap class.

heap_example.png
In this picture you can visualize the hash key {_Heap} that is defined in sub new. Each node of the heap tree is shown by the circles. And each node is a reference to an array with idx=0 as the byte size of the file and wth idx=1 as the file path.

Our decision criteria for moving nodes around to maintain the heap order is done on the byte size of each file. Knowing this and this picture should make it easier to read the class code.

Let me now show the post order directory search algorithm, or DFS, that utilizes this heap class in order to keep the 20 largest files in the heap. Once it is done, it will just print out the entire heap array showing its treasure.

#!/usr/bin/perl
use strict;
use warnings;
use Heap;

my $Heap = new Heap();
my $HeapMaxSize = 20;

DFS('/home/farhad'); 
$Heap->print_heap();

sub DFS{
   my $start=shift;
   my @queue = ($start);
   while(scalar(@queue) > 0){
       my $dir = pop(@queue); 
       my ($files, $dirs)= get_dirs_files($dir);

       push @queue, @$dirs;
       map { &process_file($_) } @$files;
   }
}

sub get_dirs_files{
    my $sdir = shift;
    opendir(my $dh, $sdir) || die "can't opendir $sdir : $!";
    my @entries = grep {!( /^\.$/ || /^\.\.$/)} readdir($dh);

    @entries =  map { "$sdir/$_" } @entries;
    closedir $dh;
    my @files =  grep( -f $_ , @entries);
    my @dirs = grep(-d $_, @entries);
    return (\@files,\@dirs);
}
sub process_file{
   my $f = shift;
   my $file_sz = (stat($f))[7];
 
   # Create the node that you might Toss or Insert into Heap 
   my $node = [ $file_sz , $f ];

   if ( $Heap->{CurrentSize} < $HeapMaxSize ) {
        $Heap->Toss($node);
   }else{
        my $min_node = $Heap->FindMin(); #undef if Heap is empty

        if (defined $min_node && $file_sz > $min_node->[0] ){
              $Heap->DeleteMin();
              $Heap->Insert($node);
        } 
   }
}

The Heap utilization takes place inside sub process_file(). Our HeapMaxSize is 20 and all of the first 20 files we see we just toss it into the heap and don't really care if the heap order is maintained. But starting with the 21st file we start comparing. If the new file has a size ($file_az) bigger than the minimum in the Heap, then we need to get rid of the minimum and replace it with this new one. So we call DeleteMin() which will delete the root of the array @($Heap->{_Heap}} and will insert the new node.

No TrackBacks

TrackBack URL: http://www.farhadsaberi.com/cgi-bin/mt/mt-tb.cgi/4

Leave a comment

About this Entry

This page contains a single entry by Farhad Saberi published on November 7, 2010 10:48 AM.

Perl FTP Retrieve All Files and Sub Directories was the previous entry in this blog.

Infix to Postfix Conversion Stack calculator is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.