Swapping two variables in place without using a temporary variable (storage) seems to be a favourite question for some eccentric interviewers. I've been asked it on three separate occasions. The first two times I didn't know the answer and didn't bother looking it up. The third time I realized that perhaps I should figure this out.

The actual answer will be in the context of reversing all the words in a sentence without using a temporary variable. For example, the sentence "last night the DJ saved my life" should be reversed to "life my saved DJ the night last".

The swapping of two variables using no extra storage is done using a technique called the XOR swap algorithm. It is explained by many people over the internet. This is wikipedia's: http://en.wikipedia.org/wiki/XOR_swap_algorithm

so if A=1 0 1 1 0 1 0 0 1 1, and B= 1 1 0 1 1 0 1 0 0 0, then the swap happens in three XOR operations:

A = A ^ B;
B = A ^ B;
A = A ^ B;

The hat sign, ^, is the Perl notation for XOR. Just to show the three XOR operations:

# First step is to XOR the two variables and store the result in A.
A = A ^ B = 1 0 1 1 0 1 0 0 1 1 ^ 1 1 0 1 1 0 1 0 0 0 = 0 1 1 0 1 1 1 0 1 1;

# Now use the new A and XOR it with B, replacing B.
B = A ^ B = 0 1 1 0 1 1 1 0 1 1 ^ 1 1 0 1 1 0 1 0 0 0 = 1 0 1 1 0 1 0 0 1 1;

# Finally, XOR A and B again, storing the result in A.
A = A ^ B = 0 1 1 0 1 1 1 0 1 1 ^ 1 0 1 1 0 1 0 0 1 1 = 1 1 0 1 1 0 1 0 0 0;

Now A and B are swapped. The reversing of the words in a sentence happens in two steps. First reverse all letters within the sentence. Second, reverse all letters within each word. For the above example, the first step produces this string:

e f i l   y m   d e v a s   J D   e h t   t h g i n   t s a l

Reversing each word gives the desired output "life my saved DJ the night last".

I ignore the beginning and ending white spaces and only swap non-white space characters. Then to figure out each word I use a two variables as cursors, i and j. Where the integer i denotes the beginning position for the reversing and j the ending position. So a reverse happens between i and j. And i should never be bigger than j.

#!/usr/bin/perl


use strict;

use warnings;


sub reverse_{

   my $s = shift;

   my $i = shift;

   my $j = shift;


   for( ; $i < $j ; $i++, $j--){ 

       $s->[$i] = $s->[$i] ^ $s->[$j];

       $s->[$j] = $s->[$i] ^ $s->[$j];

       $s->[$i] = $s->[$i] ^ $s->[$j];

    }

}


sub main{

   my $s = $ARGV[0] || 'last night the DJ saved my life';


   my @s = split(//, $s);

   my $ssize = scalar(@s);


   # ignore beginning white spaces

   #

   my $i=0;

   while(1){

      if($i > $#s){ #string is all white spaces

          print "string is empty: '@s'\n"

          exit;

      }elsif($s[$i] =~ /\s/){

           $i++; 

      }else{

           last;

      }

   }


   # ignore trailing white spaces

   #

   my $j = $#s;

   while(1){

      if($s[$j] =~ /\s/){

          $j--;

      }else{

          last;

      }

   }

   # reverse @s until the end

   reverse_(\@s,$i, $j) if($j > $i);

   print "all letters reversed: @s\n";

   

   while($i < scalar(@s)){

        if($s[$i] =~ /\s/){

            $i++;

            next

        }

        if($i == $#s){ last; }

        $j=$i + 1;


        while (1){

             if($j == $#s){

                  last;

             }elsif($s[$j] =~ /\s/){

                  $j--; 

                  last;

             }else{

                  $j++;

             }

        }

        if($s[$j] =~ /\s/){

             print $s[$j];

             $j--;

        }

        reverse_(\@s,$i, $j) if ($j > $i);

        $i = $j + 1;

   }

   foreach my $w (@s){

      print "$w";

   }

   print "\n";

}

main();


farhadsaberi$ ./reverse_words.pl "it's the economy stupid"
all letters reversed: d i p u t s   y m o n o c e   e h t   s ' t i
stupid economy the it's

farhadsaberi$ ./reverse_words.pl "   the yellow     balloon "
all letters reversed:       n o o l l a b           w o l l e y   e h t  
    balloon     yellow the 


The second example shows that the beginning and ending white spaces as well white spaces between words are preserved.

Perl detach process daemon

| 0 Comments | 0 TrackBacks

This is how a running process can detach itself from the shell and become a daemon. Not just from the shell but sometimes from another parent process such as an HTTP server. One characteristic of a daemon is that its parent process ID will be one, that of the init process. It would be a good thing to check that your PPID is not one before attempting to daemonize yourself.


#!/usr/bin/perl

use POSIX 'setsid';


defined (my $kid = fork) or die "Cannot fork: $!\n";

if ($kid) {

    # Parent runs this block

    exit(0);

} else {


    close STDIN;

    close STDOUT;

    close STDERR;

    setsid or die "Can't start a new session: $!";

    umask(0027); # create files with perms -rw-r----- 

    chdir '/' or die "Can't chdir to /: $!";


    open STDIN'<', '/dev/null' or die $!;

    open STDOUT, '>', '/dev/null' or die $!;

    open STDERR, '>>', '/tmp/mysql_install.log';


    defined (my $grandkid = fork) or die "Cannot fork: $!\n";

    if($grandkid){

        exit(0);

    }else{

         # my daemon is here.

        sleep(60);

        system("rpm -e --nodeps mysql-5.0.77-4.el5_4.1");

        system("yum install -y MySQL-client-5.5.22-1.linux2.6");

        system("yum install -y MySQL-devel-5.5.22-1.linux2.6");

        system("yum install -y MySQL-shared-compat-5.5.22-2.linux2.6");

    }

}



But in the above code, the parent is unaware of the detached process's ID. The parent of the daemon will be init and init will be responsible for cleaning after it. But if we want to still know what PID it is so that we can explicitly send a signal to it from the grandparent the detached grandchild will have somehow communicate his process ID back to the grandparent.


#!/usr/bin/perl

use POSIX 'setsid';


pipe(READER, WRITER) || die "pipe failed: $!\n";

defined (my $kid fork) or die "Cannot fork: $!\n";

if ($kid) {

    # Parent runs this block

   close(WRITER);

   my $line = <READER>;

   chomp $line;

   $daemon_pid = $line;

   close(READER) || warn "kid existed $?";

   waitpid($kid,0);

    exit(0);

else {

    close(READER);

    close STDIN;

    close STDOUT;

    close STDERR;

    setsid or die "Can't start a new session: $!";

    umask(0027); # create files with perms -rw-r----- 

    chdir '/' or die "Can't chdir to /: $!";


    open STDIN,  '<''/dev/null' or die $!;

    open STDOUT'>''/dev/null' or die $!;

    open STDERR'>>''/tmp/mysql_install.log';


    defined (my $grandkid fork) or die "Cannot fork: $!\n";

    if($grandkid){

        print WRITER "$grandkid\n";

        close(WRITER);

        exit(0);

    }else{

         # my daemon is here.

        sleep(60);

        system("rpm -e --nodeps mysql-5.0.77-4.el5_4.1");

        system("yum install -y MySQL-client-5.5.22-1.linux2.6");

        system("yum install -y MySQL-devel-5.5.22-1.linux2.6");

        system("yum install -y MySQL-shared-compat-5.5.22-2.linux2.6");

    }

}


We used pipes for this purpose. Notice that the grandchild ends it's PID with a newline when writing it to the grandparent. That's because the reader < and > will block until it reads a newline character.

Hierarchical data representation in SQL Netsted Sets

| 0 Comments | 0 TrackBacks
Storing hierarchical data in a relational database is not obvious. There are a few techniques and I studied one that is called Netsted Sets and popularized by Joe Celko in his book Trees and Hierarchies in SQL for Smarties. I found this technique in an online article a long time ago and studied it and have implemented it at work and it is always useful so short of buying that book, I will spend the time to explain it here. That article did not go into detail as I will here, in fact I don't see anyone else spending so much time detailing this one out. I have not read Joe Celko's book and wonder if there are any discrepancies.

Let's show what we're talking about:

sql_hierarchy_tree.png

Nested Sets representation in SQL is a form of Preorder Tree Traversal where every child of a node is visited first, depth first. Every node is represented by a left and a right number.  Our goal is to show the above tree, which is stored in a mySQL table for example, as this:

sql_hierarchy_side_view.gifThis is always very useful. Examples of websites that can use this are forums or a javascript menu. Forums can use this for parent to child relationships of reply messages.

Ok let's explain what's happening in the above tree with the red arrows. In a Preorder Tree Traversal we visit every child of a node first, and then the sibling of that child. So starting at node 'a', we visit 'b', then the children of 'b' which is 'c' and 'd', then continue depth first so next is 'e' and 'f'. Now that all children of 'b' are seen we can come back up and go to root's next sibling which is 'g'. Once all children of elephant_eye.jpg'g' are visited, 'h', we continue with the next child of root which is 'i'. Going left to right and depth first, nodes 'j' then 'k' are visited and since 'k' has a child, we have to visit it first before going to 'm'. We finally come back up from the right side to root's right.

So again, we go left to right, visit every left of a left node first, going depth first, and then once all left and depth first are visited, we can continue to visit the right node.

Understanding this left and depth first visiting order is crucial in understanding how every node in this technique is stored in a relational database table.

First let's show how a tree is built. This will show the SQL's involved in inserting a new node. Then we'll show the code that draws the tree, that is the depth of the indentation shown above. Then how to delete a node, how to find the path from one node up to its root. And then some basic facts and calculations such as how to determine the number of children of a node. (think how a comments section of a site knows how many replies a comment has received without actually retrieving all children of that node).

When the root node is born it looks like thissql_hierarchy_root_node.jpg
A SQL INSERT statement with left=1 and right=2 will create the root node. Then every child node will require the the updating of the left and right nodes of every node that would appear *after* that node. Let's show this:

sql_hierarchy_addNode_b.jpgYou can see that it required the right value of the root node to be incremented by 2. The four sql statements show what needs to be done and they are the same for every node added. In your HTML page you will obviously know the node a user clicks on to add a child value and that node has a left (parent left or plft) and right (prgt) values. Now let's add a child node 'c' to the parent node 'b':
sql_hierarchy_addNode_c.jpgwhat's a bit more interesting is the simplification of the m value. It is just the parent's right value minus one. The m value, which is really a reference point to know which left and right values need incrementing during a new node addition, can be further simplified. Let's show this by adding a new node 'd' to the parent 'b':
sql_hierarchy_addNode_d.jpgSo far only right values have been updated. Let's skip ahead a few steps and show a new node 'x' being added that affects the updating of left values as well.
sql_hierarchy_addNote_x.png
Now that new node insertion is cleared up, let's show how we would retrieve the nodes from the table and display them in proper order and indent a child node beneath its parent according to its depth. Just like a java script menu or the forum on the reddit website where replies to a message are indented below it. Let's do this for the same eight node tree above.

The piece of code that does it is this. There is a stack involved. In this php code it is the array $right. Also, the same code would apply whether you would display a subtree or the entire tree starting at root. That's why we specify the node's plft or prgt. The returning tree would be its subtree. (I have a php class that handles my SQL statements which is irrelevant here). I'm not an HTML DIV expert that's why the &nbsp; are used for indentation :-)

$right = array();  // define stack

$ans = $db->executeQuery("select txt,lft,rgt from T where lft between $plft AND $prgt ORDER by lft asc");

      while ($row = mysql_fetch_array($ans)){
          
                if(count($right)>0){
                          while($right[count($right)-1] < $row['rgt']){
                                    array_pop($right);
                          }
                }
                // indent txt according to its depth in the tree
               echo str_repeat('&nbsp;&nbsp;&nbsp;&nbsp;',count($right));
               echo $row['txt'] . "<br />";

               //add this node to the stack
               $right[] = $row['rgt'];
       }


In a few words, you define an empty stack (called $right). Then you loop over the rows retrieved. For each node, if the stack is not empty, keep popping stack until the top of the stack has a value that is smaller than the current node. Then what remains in the stack, that is the size of the stack, will define the depth of the node. For root this will be zero. For root's immediate children, b, d and e, the size of the stack (the size of the array $right) will be 1. So once the depth of the node is determined, you print the node txt and then you add this node's right value on top of the stack. Let's show this loop and its stack graphically:
sql_hierarchy_draw_tree_a.jpgThe first time we enter the loop the stack is empty so nothing's popped and the size of the stack being zero defines that the root node will be indented zero times. 16 is pushed onto the stack $right and now we come back to the top of the while loop. The next node is 'b' (ORDER by lft ASC);sql_hierarchy_draw_tree_b.png Since 'b's right is smaller than the top of the stack, nothing's popped. The size of the stack remains 1 so the indentation of 'b' will be 1. Then 5 is pushed onto stack. Next node is 'c':
sql_hierarchy_draw_tree_c.jpg
Current right value is 4. Nothing's popped from the stack because 4 is smaller than the top of the stack. The resulting size of the stack, 2 (values 16 and 5), is the depth of 'c'. Then 4 is pushed onto stack. The next node d is where we seeing the popping of the stack. Node c is at depth 2 and we see d we'll pop 5 and 4 because they are smaller than d's right value 9. This gives a stack size of 1, (value 16), defining correctly d's depth of 1.
sql_hierarchy_draw_tree_d.jpgThe same process continues with the rest of the nodes x, e, f and g, in that order because the SQL statement was ORDER by lft ASC.

Now that we know how to add to the tree and how to draw it, let's show how to delete a node from it. Remember the property of the PREORDER traversal of this tree. Every child's left value will be higher than that of its parent. And every left child of a parent will have a smaller left value than its right sibling. Look at the tree drawn at the top to see that 'c' has a bigger left value than its parent 'b.' Also that 'b' has a smaller left value than its sibling 'g.' What it all means is that when you delete a node, you have to update every other node that would be seen AFTER the deleted node in a preorder traversal. Or, update every node that would have a left and right value higher than the deleted node.

While showing how to delete a node or an entire sub tree, we'll also learn how to count the number of child nodes of any node. Given lft and rgt of a node, the count of its children is going to be ⌊ ( rgt - lft ) / 2 ⌋

For the tree at the very top of this article, we see that node i has 4 children. Or, ⌊ ( 25 - 16 ) / 2 ⌋ = 4. When we're going to delete a node, obviously all of its children are going to be deleted as well. Every node coming after the group of nodes deleted (because of pre order traversal) will have to have their left and right values updated. By how much is what we have to calculate. The number would have to be relative to how many nodes are going to disappear. Let's show the formula by deleting node "5 d 10" from our original tree at the top.
sql_hierarchy_delete_node.jpgSo 6 is the number that we have to reduce the left and right values of all nodes that come *after* node 'd' in our preorder traversal. You already know your deleting node's left and right values of 5 and 10:

DELETE from T where lft >= 5 and rgt <= 10.
UPDATE T set lft=lft - 6 where lft > 5.
UPDATE T set rgt=rgt - 6 where rgt > 10.

So far we've seen how to add a node, delete a node of a sub tree in case the node is not a leaf node, and how to draw the tree. We also saw how to calculate how many children a node has. Another cool property this technique allows us to find is the ancestry of a leaf node up to the root of the tree:
sql_hierarchy_node_ancestry.jpg
That's the path to a node. This SQL SELECT statement on any node will return all nodes that are between it and the root node. To draw it you would "ORDER by lft ASC" and use the same loop to draw the tree as discussed above.

One other property you can see is the discovery of a leaf node. If right minus left equals 1 then the node is a leaf node.

Finally, there comes the problem of knowing all immediate children of a node. That is, only one level deep. For 'a' it would be b,g and i. For 'i' it would be j,k and m. You can write a script to keep calling select statements starting from left node 'b', then selecting 'g' and then selecting 'i'. That's because it is obvious that the immediate left child of a parent will have its left number be the parent's plus 1. (b's 2 is a's 1 + 1) (j's 17 is i's 16 + 1). And then for every node, the left value of its immediate sibling is going to be that node's right value plus one. (g's 12 is b's 11 + 1) (k's 19 is j's 18 + 1). So knowing the two above rules one could write a select statement in a loop to keep fetching first the immediate left child and then the next sibling of that child until you get a node who's right value is the parent's right minus one. (i's 25 is a's 26 - 1) (m's right is i's 25 - 1). Not knowing how many siblings there are, calling select in a loop can become overwhelming.

The other option to knowing every node's immediate children in one SQL query is to add a depth column to the original table. This is also explained in an article on Wikipedia on Nested Sets model. The new depth column would look like this for the above example. 

txt left right depth
a1260
b2111
c342
d5102
e673
f893
g12151
h13142
...
You know when you have inserted your very first node, the root with lft=1 and rgt=2. You specify its depth to be zero. After that, the node's immediate child is inserted with a depth of "depth plus one." To retrieve all nodes just one level deeper one more constraint can be added:

SELECT txt,lft,rgt FROM T where lft between $plft AND $prgt AND depth=ParentDepth + 1 ORDER by lft ASC;

I hope this was of any use. If anyone sees a mistake please do let me know.

Follow Symbolic Link Tree pstree

| 0 Comments | 0 TrackBacks
Just like the native unix command pstree, or ptree, for processes, l want to follow a symbolic link and display its tree. A C developer once complained to me that he's lost in his home directory. His setup was such that every time he created a new version of any source file, a new symbolic link was made. I've no idea how but he told me that the links were so deep that he could no longer find the actual file he was opening. This was three years prior to this writing and this simple problem stuck to my mind. I can't believe that there isn't a native Unix command that follows each link and displays where it points to, kind of like pstree.

I have to admit that the script is simple and yet it took me a while to write it. Much longer than I thought so I must be a bad coder. But I think that I have a few legitimate complaints which contributed to my slow down. Here are my complaints which all evolve around Perl functions that deal with file and directories but what if they are used on a symbolic link instead. Consider the following files:

mac:links fsaberi$ pwd
/Users/fsaberi/tmp/links
mac:links fsaberi$ ls -l
total 24
lrwxr-xr-x  1 fsaberi  staff  1 Nov 21 13:41 a -> b
lrwxr-xr-x  1 fsaberi  staff  1 Nov 21 13:41 b -> c
lrwxr-xr-x  1 fsaberi  staff  1 Nov 21 13:41 c -> d

notice how file d doesn't exist. If I were to use Perl's -e file existence test on file a, it would return false because it will follow the link to b then to c and then when it seed that d doesn't exist it will say that a also doesn't exist. To me that's a wrongful interpretation of the existence of "a" itself.

If I were to use module Cwd's abs_path function on a, it would tell me that its absolute path is /Users/fsaberi/tmp/links/d. It followed it all the way. What I was expecting is the absolute path of "a" itself: /Users/fsaberi/tmp/links/a. The symbolic Link a is a valid unix file and it does have an absolute path as well. I wish there were options to Perl saying "don't follow symlinks."

Now consider this, where v circularly points back to itself:

04372B-fsaberi:links fsaberi$ ls -l
total 40
lrwxr-xr-x  1 fsaberi  staff  1 Nov 21 14:11 v -> w
lrwxr-xr-x  1 fsaberi  staff  1 Nov 21 14:11 w -> x
lrwxr-xr-x  1 fsaberi  staff  1 Nov 21 14:12 x -> y
lrwxr-xr-x  1 fsaberi  staff  1 Nov 21 14:12 y -> z
lrwxr-xr-x  1 fsaberi  staff  1 Nov 21 14:13 z -> v

If I were to open v, it would fail. It doesn't just look at v itself and try to open it allowing me to read "w" which is the string content of the file "v." I guess it's by design that you have to first check if it's a link with -l and then use readlink instead. Also, -e on the circular link v would return false! But that's a philosophical question. Does "v" not exist? In my opinion it does exist and -e should return true. Whether it is a symlink who down the road badly points to itself or not was not part of the original question. In my understanding, -e should return true if there exists an inode for that path. Perl doesn't agree with me though.

Even -d can trick you. If you test a link that eventually points to a directory, -d will return true. Again it follows the link. I expected it to say that this file is not a directory.

The program sltree.pl named after symbolic link tree outputs this on file v:

mac:links fsaberi$ sltree.pl /Users/fsaberi/tmp/links/v
/Users/fsaberi/tmp/links/v
  -> /Users/fsaberi/tmp/links/w
   -> /Users/fsaberi/tmp/links/x
    -> /Users/fsaberi/tmp/links/y
     -> /Users/fsaberi/tmp/links/z
      -> /Users/fsaberi/tmp/links/v (circular link)

or making the links a bit more complicated. 

mac:links fsaberi$ ls -l
total 40
lrwxr-xr-x  1 fsaberi  staff   1 Nov 21 14:11 v -> w
lrwxr-xr-x  1 fsaberi  staff   1 Nov 21 14:11 w -> x
lrwxr-xr-x  1 fsaberi  staff  21 Nov 21 14:46 x -> ../../../../tmp/ooo/o
lrwxr-xr-x  1 fsaberi  staff   1 Nov 21 14:12 y -> z
lrwxr-xr-x  1 fsaberi  staff  12 Nov 21 14:49 z -> /usr/bin/dig
mac:links fsaberi$ ls -l  ../../../../tmp/ooo
total 8
lrwxr-xr-x  1 fsaberi  wheel  34 Nov 21 14:48 o -> ../../../Users/fsaberi/tmp/links/y

mac:links fsaberi$ perl ../../Documents/workspace/sltree/sltree.pl ../../tmp/links/v
/Users/fsaberi/tmp/links/v
  -> /Users/fsaberi/tmp/links/w
   -> /Users/fsaberi/tmp/links/x
    -> /tmp/ooo/o
     -> /Users/fsaberi/tmp/links/y
      -> /Users/fsaberi/tmp/links/z
       -> /usr/bin/dig


where link "v" eventually points to the program dig. Here is the source:

#!/usr/bin/perl

use strict; use warnings;

use Cwd;

use File::Basename;


my $start_node = $ARGV[0];

unless($start_node){

print "Must provide a file as argument: $0 <file>\n"; exit;

}


my $tree = DFS_G($start_node);

print $tree->[0] ."\n";

for(my $i=1; $i < scalar(@$tree) ; $i++){

print ' ' x $i;

print " -> ". $tree->[$i] ."\n";

}


# normalizeNewLink_relative_to_OrigLink, returns an absolute path with the . and ..'s

# removed. If path is not absolute, then it will be relative to another path. Meaning

# that this other path will be prepended to the original path first.

# If that "relative to" path is not given, then it will be relative to current working

# directory. (ugliest function name ever! sorry about that.)

sub normalizeNewLink_relative_to_OrigLink{

my $newlink = shift;

my $relativeTo = shift || undef;

my $origlink = $newlink; # store orig path in case an error message needs it.


unless(defined $relativeTo){

$relativeTo = getcwd();

}

my ($filename, $directories, $suffix) ;

if(-l $relativeTo){

($filename, $directories, $suffix) = fileparse($relativeTo);

$relativeTo = $directories;

}

$relativeTo =~ s/\/$//;


# this is how i have to get abs_path so that perl doesn't fail in case it 

        # is a bad link.

unless($newlink =~ /^\//){

($filename, $directories, $suffix) = fileparse($newlink);

$filename .= $suffix;

$newlink = $relativeTo . '/'. $directories . $filename;

}


if($newlink eq '/'){

return $newlink;

}

my @dirs = split('/', $newlink);

my @newlink;

for(my $i = 0 ; $i < scalar(@dirs); $i++){

if ($dirs[$i] eq '.' || $dirs[$i] eq ''){

next;

}elsif($dirs[$i] eq '..'){

pop @newlink if (scalar(@newlink) > 0);

}else{

push @newlink, $dirs[$i];

}

}

my $l = scalar(@newlink);

my $p;

if($l == 0){

return "/"; # we were given something weird like /.././../../

}elsif($l == 1){

$p = '/' . $newlink[0];

}elsif($l > 1){

$newlink[0] = '/'. $newlink[0] if ($newlink[0] !~ /^\\/);

$p = join '/', @newlink;

}else{

die "Error: Could not normalize $origlink\n";

}

return $p;

}


sub DFS_G{

    my $node = shift;

    my %visited;

    my @tree;


    $node = normalizeNewLink_relative_to_OrigLink($node);

    push @tree, $node;

    $visited{$node} =1;


    my $newNode;

while(1){

   if(-l $node){

$newNode = readlink($node);

$newNode normalizeNewLink_relative_to_OrigLink($newNode,$node);

if(exists $visited{$newNode}){

push @tree, $newNode;

$tree[ $#tree ] .= ' (circular link)';

return \@tree;

}else{

$visited{$newNode} = 1;

push @tree, $newNode;

}

    }else{

# it is not a link. It is either a directory or some file. 

                # Figure out if it exists. I can use -e safely if it is NOT

                # a symbolic link, which it isn't here.

unless(-e $tree[ $#tree ]){

$tree[ $#tree ] .= ' (broken link)';

return \@tree;

    }

    $node = $newNode;

}

}


I think that what I first tried to mean by the function name DFS_G was depth first search graph :-) well, this is an extremely simplified form of a DFS on a graph. Every node has exactly one child. And we stop at a broken link, or when we first see a node already visited (circular link). So there is no need for gray and black coloring neither as is the technique in a graph traversal.


I have tested the script somewhat but you never know what can go wrong so if ever anyone uses this and finds a bug please do let me know.

Non Blocking Multiple Parallel Processing

| 0 Comments | 0 TrackBacks

Parallel processing at the user level became a critical part of my job when I encountered two situations. One is when I'm working with thousands of similar servers that I need to query by SSH'ing to, or by any other method. The second is when I had to load test LAMP servers which is done by taking an existing http access log file and re-requesting every query.


The first, querying servers, is much easier to deal with. There are so many examples out there which copy from each other the gist of their parallel processing model. Parallel_processing_Cheetah_Speed.jpgFor each server they fork a new child until a number of maximum children is reached. Then when one child finishes and exists, a new process is forked with the next server in the queue as its argument. Maximum performance is not a requirement here as long as the thousand server is queried within a reasonable amount of time. Say five to twenty minutes.

The problem is with the second task, that of application load testing. If for every line of an Apache access log you were to fork a separate child then your own load testing server would come to a halt before a decently programmed mod_perl/php application running under Apache would. For querying servers I can fork 500 times and it might take my script half an hour to complete and that's fine. But for load testing the purpose is to run hundreds of queries as fast as possible. Remember that forking is the most expensive call a kernel performs. There's no way you can load test for performance and fork continuously at the same time. I realized that I needed to modify the parallel processing model so that there's no forking involved while the requests are going on. I chose the image for this blog appropriately, that of a Cheetah. Just as a Cheetah is the fast animal on land, so is Pcmd.pm the fastest parallel processed implementation that I could come up with.

Not only I must never fork, but I need a way to collect data from each child for statistics purposes. For example I want to know how long the average HTTP GET request is taking and which URI took the longest. That's how I came up with Pcmd which stands for Parallel Command. It first spawns MaxChildren processes and then for every available input (server IP's or http access log entry), it assigns one to each child in the READy status. Once a child has been assigned it is then placed in the RUNNing status. While a child is in RUNNing status the parent attempts to read from it while not blocking on the child's file handle. Once a complete message is returned by the child it is put in DONE status. The output of a DONE child is processed and the child is put back once again in READy status where it is available to receive another input from the parent. This bidirectional message passing using BSD's socketpair and a precise handling of each child's status, and a robust protocol for passing the messages to and from each child makes this model of parallel processing work so that no additional forking is needed. The same initially spawned children can handle as many inputs as there are.

Download Pcmd.pm here.  Instead of pasting the entire source, i'll show bits of it and explain why it's there and what it does. I'll specifically explain:

1- The different hash structures used to keep track of a child's status and outputs.
2- The different hash structures used to keep track of the message passing.
3- How to fork and create a non-blocking bidirectional socketpair pipes between the parent and the child process. 
4- How the parent reads from multiple non blocking handles.
5- How message passing from and to the child is implemented such that it is robust and eliminates confusion.
6- two client examples using Pcmd.pm that put it all together.

1- The parent process loops around $children amount of time and calls the function make_child who will create the communication pipes using socketpair, fork off a process and give this new process the CHILD end of the socketpair stream. The parent has to keep track of each child's communication pipe, birth time, child's output (answer back to parent), and child's status which can be READY, RUNN or DONE.

Available within the class:
$self->{children_s}->{$pid}; # status of $pid.
$self->{children_f}->{$pid} # The CHILD end of the socketpair stream the parent uses to R/W to $pid

Defined within run():
%children_t; # {pid} = birth time in seconds.
%children_a; # {$pid} = array ref to a list of strings. They are answers read back from $pid
Exported:
$self->{timedout_answers} = []; # if a child is killed because it stayed RUNNing more than $self->{timeout} seconds, any amount of incomplete message read from that child will be stored here.

2- Message passing to and from the child can be tricky. When the parent is reading from the child's pipe, how does he know that the child is done sending information and that it should be placed in the DONE status and no longer read from? To achieve this I borrowed the idea of how HTTP POST protocol does this. In this article I explained HTTP POST. In the comments below EOM stands for EndOfMessage.

my %children_Mstate; # {pid}= Message states while looking for EOM '0\r\n\r\n'
my %children_Mstring;# {pid}= Message string to hold 0, then \r, then \n, looking for EOM.
my %children_Msize;  # {pid}= just like HTTP POST's proto, this tells our loop how many chars to read

There are two subroutines used to read messages. The parent always uses read_from_child() and the child always uses read_from_parent(). The only difference between the two is that the child will loop until a full message is read from the parent (the child blocks on the read). Whereas the parent will read from the same non-blocking child pipe until a block occurs. As soon as the parent can't read anymore it returns and moves on to the next child. But if a full message is read, the child is placed in the DONE status by read_from_child().

3 - The easiest way to have a bi-directional conversation with a child is to use socketpair, IMHO. Spawning a child and establishing a bi-directional non-blocking streams between the child process and the parent is done in subroutine make_child. I've commented within the code what's happening so no extra paragraphs should be needed:

my $CHILD  = gensym; # parent uses this handle to talk to the child

my $PARENT = gensym; # child uses this handle to talk to the parent

socketpair($CHILD, $PARENT, AF_UNIX, SOCK_STREAM, PF_UNSPEC) ||  die "socketpair: $!";

  

my  $flags = 0;

fcntl($CHILD, F_GETFL, $flags) or die "Couldn't get flags for HANDLE : $!\n";

$flags |= O_NONBLOCK;

fcntl($CHILD, F_SETFL, $flags) or die "Couldn't set flags for HANDLE: $!\n";


$flags = 0;

fcntl($PARENT, F_GETFL, $flags) or die "Couldn't get flags for HANDLE : $!\n";

$flags |= O_NONBLOCK;

fcntl($PARENT, F_SETFL, $flags) or die "Couldn't set flags for HANDLE: $!\n";


# Once a filehandle is set for non-blocking I/O, the sysread or syswrite calls that 

# would block will instead return undef and set $! to EAGAIN:

my $old_fh = select($CHILD);

$|=1;

select($old_fh);

$old_fh = select($PARENT);

$|=1;

select($old_fh);


$CHILD->autoflush(1);

$PARENT->autoflush(1);

    

if (my $pid = fork()){

# parent records the child's birth and returns

close($PARENT); # parent doesn't need PARENT handle.

        $self->{children_f}->{$pid} = $CHILD;

return $pid;

}else{

die "cannot fork: $!" unless defined $pid;

close $CHILD; # child doesn't need CHILD handle. Uses PARENT to talk to parent.

$SIG{'INT'} = 'DEFAULT';

$SIG{'TERM'} = 'DEFAULT';

$self->child($PARENT);

close($CHILD); # in case child() doens't close the handle.

exit(0); # in case child doesn't exit.

}



4 and 5 - For each child created, the parent holds a non blocking end of a socketpair pipe in %children_f where each keys is a PID with its value being the pipe to that PID. Foreach PID, if it is in the RUNN status, then an attempt to read from its output is made. This attempt cannot block on a child that is hung because that would starve all other children. In order to be able to read and parse a string according to a set of rules, I don't see other options but to read one character at a time from the file handle.

The parent is expecting a child who will send its output mimicking HTTP POST's protocol. I first expect a number which terminates by a newline character. This number designates how many characters the child is sending. He's telling me regardless of the character you have to read this many characters. This sequence of number, newline, message, number, newline, message ... continues until the number is a zero followed by these tour characters: \r\n\r\n. To achieve the parsing of each message, the read goes through six different states, 0,1,5,6,7 and 8. It starts at state 0 and once state 8 is reached then a full child message has been read. The child is placed in the DONE status and the message is passed to the defined parent callback subroutine. These states are obviously stored in each PID's $children_Mstate{$pid} structure. The message parsing loop is this:

while($self->{children_s}->{$pid} ne 'DONE'){

    my $rv = sysread($self->{children_f}->{$pid}, $buf, $BUFSIZ);

if (!defined($rv) && $! == EAGAIN) {

last; # would block

}else{

# if state 0, read msg size until \n then set state to 1

if($children_Mstate->{$pid} == 0 && $buf eq "\n"){

$children_Mstate->{$pid} = 1;

}elsif($children_Mstate->{$pid} == 0 && $buf == 0 && \

                                               ($children_Msize->{$pid} == 0)){

  # if Msize ==0, it means that the 0 in buf is in first pos.

                  # Msg length integer does NOT have leading zeros. This is

                  # how i distinguish the zero in 40 with the 0 in '0\r\n\r\n'

        $children_Mstate->{$pid} = 5;

}elsif($children_Mstate->{$pid} == 0){

$children_Msize->{$pid} .= $buf;

}


# if state 1, read msg body until Msize is 0 then go back to state 0

elsif($children_Mstate->{$pid} == 1 && $children_Msize->{$pid} > 0){

$children_Mstring->{$pid} .= $buf;

$children_Msize->{$pid} -= 1;


#if this was the last char to read, go back to state 0.

if($children_Msize->{$pid} == 0){

    push @{$children_a->{$pid}}, $children_Mstring->{$pid};

    $children_Mstring->{$pid} = '';

    $children_Mstate->{$pid} = 0;

}

}elsif($children_Mstate->{$pid} == 5 && $buf eq "\r"){

$children_Mstate->{$pid} = 6;

}elsif($children_Mstate->{$pid} == 6 && $buf eq "\n"){

$children_Mstate->{$pid} = 7;

}elsif($children_Mstate->{$pid} == 7 && $buf eq "\r"){

$children_Mstate->{$pid} = 8;

}elsif($children_Mstate->{$pid} == 8 && $buf eq "\n"){

   # End of msg successfully reached. Child is DONE so parent 

                   # shouldn't attempt to continue reading.

                        $self->{children_s}->{$pid} = 'DONE';

                        $children_Mstate->{$pid} = 0;

}else{

my $state = $children_Mstate->{$pid};

my $str = $children_Mstring->{$pid};

print "ERROR read_from_child $pid impossible state ";

print "$state with but \"$buf\" Mstring: \"$str\"\n";

print "Please report this as a bug\n";

last;

}

}

}



The parent reading from a child and the child reading from the parent follow the same message passing mechanism. Both subroutines read_from_child and read_from_parent are identical except that when the parent sees EAGAIN is stops, with the state of that PID as is so that the read resumes at a subsequent time, and control is returned back to the main loop inside run(). But the child will keep on reading until a full message is passed because it doesn't have anything else to do :-)

6 - let's just give a very simple example of using Pcmd.pm. The synopsis actually suffices. We'll define the parent and child subroutines that Pcmd object uses as callbacks. We'll define a number of IP addresses in a text file called ip.txt. Pcmd reads ip.txt and assigns one IP to each child. You define what the child will do with this IP. Then when the child returns, its answer is passed to the parent callback for some post processing. The parent callback should keep things to a very minimum because the parent callback actually blocks the main running loop inside Pcmd. It should mostly store the answers and process them after all of the children are done.

Synopsis:

use strict; use warnings;

use Pcmd;


# Spawn 24 children and each child won't run a command longer than 90 seconds.

my $cmd = Pcmd->new( "children" => 24 );

$cmd->timeout(90);


# Each child will be given a line from the file in.txt (Or, STDIN, i.e.: cat in.txt | cmd.pl)

$cmd->input_file("in.txt");

OR

$cmd->input_file('-'); # the default is STDIN so you pipe to your script


# The input_delimiter defines $/ inside Pcmd.pm. Used to read your input_file

$cmd->input_delimiter("\n"); # the default. 

# Be very careful not to exit from a child_callback. Run anything that you feel

# might cause an error inside an eval block. Make sure that you always return 

# from child_callback. At this time of writing, Pcmd.pl won't detect that a child

# has suddenly and unexpectedly died. Just behave socially :-)

sub child_callback { # This is your child code, what runs in parallel

my $m = shift;

        chomp $m;

        my $o = your_func($m);

        return $o# Returned value is passed to parent_callback 

}


my @all_answers ;

sub parent_callback {
my $answer = shift;

push @all_answers, @$answer; #that's all I do, save the answer for later.

# or do whatever else with @$answer here ... 

# but carefull as you should return quickly from parent_callback

# so that Pcmd can go on working with its other children

        # in Pcmd.pm i'm not even dealing with a return value for parent_callback.

}


$cmd->callback(\&child_callback, 'child');

$cmd->callback(\&parent_callback,'parent');


$cmd->run(); 

foreach my $a (@all_answers){

print "I am parent. Child said: $a\n";

}


#Any child that timeout, if there was anything in its output stream, it would be store here

my $timedout_answers = $cmd->timedout_answers();

foreach my $a (@$timedout_answers){#each $a is a reference to an array of answers

foreach my $b (@$a){

print "timedout answers:\"" . $b ."\"\n" if ($b);

}

}


Or put URI's one per line in uris.txt and load test a drupal site:


#!/usr/bin/perl


use strict;

use warnings;

use Pcmd;

use HTTP::Tiny;

use Time::HiRes qw(gettimeofday tv_interval);


my $cmd = Pcmd->new( "children" => 24 );

$cmd->timeout(90);

$cmd->input_file("uris.txt");

$cmd->input_delimiter("\n"); # the default.



sub child_callback { # This is your child code, what runs in parallel

    

    my $uri = shift;

    chomp $uri;

    my $start_t = [gettimeofday];

    my $response = HTTP::Tiny->new->get("http://stage.farhadsaberi.com${uri}");

    my $elapsed_t = tv_interval ($start_t, [gettimeofday]);

    $elapsed_t = 'E' unless $response->{success};

    

    my $cacheHit;

    if (exists $response->{'headers'}->{'x-drupal-cache'}){

        $cacheHit = $response->{'headers'}->{'x-drupal-cache'} eq 'HIT' ? 'H' : 'M';

    }else{

        $cacheHit = 'N'; # means that the header is not even there

    }

    

    # format of the return is "H 1.235224 /" where the first letter is an H or M for

    # cache MISS or HIT. We want to count how many hits we get.

    return "$cacheHit $elapsed_t $uri"; # Returned value is passed to parent_callback

}


my @all_answers;

$|=1;

sub parent_callback {

    my $answer = shift;

    

    # ## spinner code if needed

    # ## print substr( "-/|\\", $spin++ % 4, 1 ), "\b";

    # ##

    

    push @all_answers, @$answer; #that's all I do, save the answer for later.

    my $requests= scalar(@all_answers);

    

    my $b = length $requests;

    print "requests: $requests";

    $b += 10;

    print "\b" x $b;

}


$cmd->callback(\&child_callback, 'child');

$cmd->callback(\&parent_callback,'parent');


$cmd->run();

my $tot = 0;

my $tot_c_hits = 0;          # total caches

my $tot_c_none_existant = 0; # No cache info in header

my $tot_c_miss = 0;          # Cache misses

my $tot_t = 0;               # total times


my %error_uris; # keys are the uris and value the count

foreach my $a (@all_answers){

    $a =~ /([HMN])\s(\d+\.?\d*|E)\s(.*)$/;

    my ($c, $t, $uri) = ($1, $2, $3);

    

    # #########

    # possible values:

    # $1: H= Cache Hit, M= Cache Miss, N= No Cache Header

    # $2: E= Error in HTTP respone, x.abc= Decimal number timing the request

    # $3: The requested URI

    # /\/\/\/\/\

    

    ++$tot_c_hits if ($c eq 'H');

    ++$tot_c_miss if ($c eq 'M');

    ++$tot_c_none_existant if ($c eq 'N');

    

    if($t =~ /^\d+\.?\d*/){

        $tot_t += $t;

        $tot++;

    }elsif($t eq 'E'){

        if(exists $error_uris{$uri}){

            $error_uris{$uri} += 1;

        }else{

            $error_uris{$uri} = 1;

        }

    }else{

        print "ERROR Bad child response: \"$a\"\n";

    }

}


my $all_requests_count = scalar(@all_answers);

print "There were $tot successful requests out of $all_requests_count\n";


my $avg = $tot_t / $tot;

print "average response time: $avg\n";

print "Cache Hits: $tot_c_hits. Missing Cache header: $tot_c_none_existant.\n";


print "\nUnsuccessful URI's:\n";

foreach my $k (keys %error_uris){

    print $error_uris{$k} . "  $k\n";

}


#Any child that timeout, if there was anything in its output stream, it would be store here

my $timedout_answers = $cmd->timedout_answers();

foreach my $a (@$timedout_answers){#each $a is a reference to an array of answers

    foreach my $b (@$a){

        print "timedout answers:\"" . $b ."\"\n" if ($b);

    }

}



I hope that there's some use to this for others. When I first started programming in parallel processing it was difficult to get a grasp of what's going on and why it would fail. But after many years of experience and especially many failed attempts it becomes clear. One note about signal handling. I did not explain what I put here and not all signals are handled. That's perhaps the subject of another article, once I would know what I would exactly write :-)

Clearly you will never die or exit within your child callback. This model does not expect any child to ever exit on its own. The parent is the only one with the discretion of terminating a hung child and replacing it with another one with the next item from its input. So, be paranoid and place anything that you think might exit or die your code in an eval block. Pcmd.pm will call the child callback in an eval block as well and if a die or any error has occurred the whole process group is terminated. If a child exits, the parent process will handle SIGCHLD by terminating all other children and exiting itself also. Therefore, never exit from a child process because it is programmed this way for maximum speed.

HTTP Post stream upload a file chunked transfer

| 0 Comments | 0 TrackBacks
Let's upload a file using our own method instead of Perl's LWP. It is really easy to do just about anything with LWP but uploading a file might present a little bit more challenge when the HTTP server you're dealing with does not handle multipart post messages correctly. Thus you will have to dig down and write your own uploading code.

Further more I want to upload without first reading the entire file into memory. LWP does support this  though still forcing your upload to be a multipart POST. I'll first explain how LWP works with uploads, how the HTTP multipart message will look like and what the limitation is with LWP. Then going my own way once again, I will show my Perl program that uses the module IO::Socket::SSL to stream a file over a secured HTTPS connection thus never running out of memory no matter how large the file may be.

LWP's idea of a file upload is to mimic the POST'ing of an HTML's form data. This is shown in its HTTP::Request::Common documentation:

my $res = $ua->request(POST 'http://www.perl.org/survey.cgi', 
                        Content_Type => 'form-data', 
                        Content => [ name => 'Gisle Aas',         email => 'gisle@aas.no',         gender => 'M',  
                                born => '1964',  
                                init => ["$ENV{HOME}/.profile"],
                                ]
                      )
It asks you to require HTTP::Request::Common which exports the subroutine POST and you will create your HTTP::Request object as shown above. In order for LWP to create its multipart/form-data message for uploading you must specify a content type of 'form-data' as one of the request headers. If you want LWP to automatically open a file and read it in for you then you must place the file name inside of an anonymous array reference (the square brackets around $ENV{HOME}/.profile).


Now what if your .profle is 20T in size? Let's just assume that. The above code will run out of memory because LWP by default will read into memory the entire file before sending it. The problem is solved by streaming up the file and this technique is called chunked transferring. More specifically you would tell the HTTP server that the data will be coming in small chunks so keep on reading until I tell you that there's nothing else to send. This instruction is done when your request header says "Transfer-Encoding: chunked".

You instruct LWP to stream up your 20T file by setting this environment value in your code:

$HTTP::Request::Common::DYNAMIC_FILE_UPLOAD = 1;
By the way chunked transfers are part of HTTP/1.1 specification so you can't say 1.0.  In either case, chunked transferring or not, LWP will construct a multipart/form-data message that will look something like this:
 
POST http://www.perl.org/survey.cgi
  Content-Length: 388
  Content-Type: multipart/form-data; boundary="6G+f"

  --6G+f
  Content-Disposition: form-data; name="name"

  Gisle Aas
  --6G+f
  Content-Disposition: form-data; name="email"

  gisle@aas.no
  --6G+f
  Content-Disposition: form-data; name="gender"

  M
  --6G+f
  Content-Disposition: form-data; name="born"

  1964
  --6G+f
  Content-Disposition: form-data; name="init"; filename=".profile"
  Content-Type: text/plain

  PATH=/local/perl/bin:$PATH
  export PATH

  --6G+f--

The PATH=/local... stuff is the content of the file .profile. In the cause of a chunked transfer, the last part would would look like this:

  Content-Disposition: form-data; name="init"; filename=".profile"
  Content-Type: text/plain
  Transfer-Encoding: chunked

  0x1B
  PATH=/local/perl/bin:$PATH
  0xB

  export PATH

  0

    --6G+f--

So what's the difference? The chunked transferring protocol kicks in and says in hex how many bytes of data are coming (eg: 0x1B for 27 bytes. (26 chars and 1 newline)). The server reads 27 bytes and stores it as content. The next byte, 0xB, is a hexadecimal value saying the amount of data to read next time that will be part of the data. This goes on and on until a \r\n is sent, followed by a 0 and two more \r\n's.

Now the limitation with LWP. I was writing code to upload a file to an HTTP server and my upload was failing. The other team responsible for the HTTP server told me that your message must have a content-type of application/octet-sream. And to put more restrictions on me, their HTTP server did not support multipart/form-data messages properly so I must upload a file by not including it as part of a multipart POST. Just a straight PUT or POST, NOT multipart as shown above.

But LWP does not give you these flexibilities. It forces 2 things on you. First is that your HTTP request message will be multipart. Second is that the first part of the multipart request will have a content type of "multipart/form-data." In fact if in your Perl code you don't say "form-data" then your code will break all together. LWP does allow you to change the content type of the other parts of your multipart request, but not the first part. I could explicitly specify .profile's part of the multipart request to be application/octet-stream but since the first part would always remain multipart/form-data the server handling my request would reject the entire message. And they did say they don't support multipart requests properly. So for this reason I saw that I cannot use LWP and no other module I searched for could give me a straight forward solution. So I was forced to write my own uploading code, and uploading whilst reading the input file bits at a time so as not to run out of memory in case the input file was very large.

Here's the HTTPS upload code for documentation and i don't believe it needs any explanation. When dealing for sockets one has to take lots of precautions and check for errors at every print. Everything's been left out and the gist of the chunked transfer upload over HTTPS is shown.


use IO::Socket::SSL

$IO::Socket::SSL::DEBUG=2;

open (FH,"<","$src_file") || die "create: can't open src file $src_file: $!\n";
binmode FH;

my $sock;
if(! ($sock = IO::Socket::SSL->new(PeerAddr => $server,
                                         PeerPort => $port,
                                         Proto => 'tcp'))) {
                  die "unable to create socket: ". &IO::Socket::SSL::errstr . "\n";
      }else{
           print "create: socket connected to ${server}:${port}\n" if ($verbose > 1);
      }
      binmode $sock;

      $sock->print("POST $url HTTP/1.1\n");
      $sock->print("Host: ${server}:${port}\n");
      $sock->print("Content-Type: application/octet-stream\n");
      $sock->print("farhadsaberi_com_auth: ". $auth_string ."\n");
      $sock->print("Any_Other_Header: Some more Header info here for ya ... \n");
      $sock->print("Transfer-Encoding: chunked\n\n");
      my $filebuf;
      while ( (my $bytes = read(FH,$filebuf,8192)) > 0 ){
           my $hex = sprintf("%X",$bytes);
           unless($sock->print($hex)){
                 warn "Error printing to socket " . &IO::Socket::SSL::errstr . "\n";
                 return 0;
           }
           $sock->print("\r\n");
           unless($sock->print($filebuf)){
                 warn "Error printing to socket " . &IO::Socket::SSL::errstr . "\n";
                 return 0;
           }
           $sock->print("\r\n");
      }
      $sock->print("0\r\n") || return 0;
      $sock->print("\r\n") || return 0;

      my @buf = $sock->getlines();
      close(FH);
      $sock->close();

Now @buf will contain the server's response to you. Print it out and you should see that the first line will be an HTTP response code like this:

HTTP/1.1  201

And the rest will be any header responses the server wished to send back especially set-cookie headers which you will have to parse out correctly and store it if you need to reconnect back right away. The way I did it was to read LWP's source code to figure it out. But that's another story. Hope this helps.

HTTP Client Browser Simulator Robot Monitoring Tool

| 0 Comments | 0 TrackBacks
Administrators use Nagios, Zenoss or something else to monitor their websites. Sometimes these sites are simple pages that you do an HTTP GET on and look for a string as a validation token that your service is up. But sometimes things are not so simple. You'll have an application in which you want to GET the first page, login by entering a username and a password, then getting the next page, and the next page while having a valid session, clicking more around without getting kicked out of the application and then finally logging yourself out. During your navigation you might want to send an email through the site and check that it was delivered and also verifying database entries. Nagio or Zenoss or any other monitoring tool is not going to do this for you automatically. You must create your own robot.

I've seen a tool written for the navigation part. You play the record button and start your browsing. You'll authenticate and navigate around. The tool serving as a proxy records your protocol and all its parameters and will generate a script with which you can replay your browsing. Then you can use this to monitor your application without having to do any coding yourself or even understanding HTTP's design features.

That's brilliant but there are two problems with the above tool. First, It doesn't work when content is changed dynamically by client side mantis.jpgjavascripts (DOM) or when a good security feature is in place such as a different dynamically generated hidden form field returned by the server each time that needs to be sent back. Second is validations that are non HTTP related such as email check or database entry verification for your session. So you know that you will have to take matters into your own hands and write your own HTTP browser robot that would mimic a user.

My tool is the most powerful one because I wrote it (around March 2008) and thus know what it does and I can change it however I like. I will have to know the application's behavior at each step and create a configuration file manually. I don't have to be the application's developer. I use Firefox's Firebug to capture the parameters. The configuration I use is in YAML format. The client simulator reads the configuration and at each step will handle the processing to a module written for that step's protocol. Whether it is a GET, POST, MAIL, DB, or whatever else you like.

There are two sections: global and main. Global  contains information that is not related to the browser's navigation. Global has information about the HTTP's cookie-jar file location or the email settings we need to send an alert message. The main section has subsections 1, 2, 3 and so on. Each integer defines, in order, the action the browser script will take. Let's show a sample configuration to clarify (YAML config format):

---

mysite.conf:
   global:
      email:
         FROM: farhad@mysite.com
         TO: web_operations@mysite.com,pagers@mysite.com
         Subject: Alert mysite.com needs attention
         content-type: text/html; charset="ISO-8859-1"; format=flowed
         msg: https://mysite.com/auth.php?method=autoLogin
      lwp:
         cookie-jar: cookiejar/mysite.com.cookie_jar

   main:
      1:
        proto: GET
        action:
           url: https://www.mysite.com/
           save: out/mysite.com_01.html
           validate: Welcome To mysite
           title: Step 1: Welcome Page

      2:
        proto: GET
        action:
           url: https://www.mysite.com/auth.php?method=autoLogin
           save: out/mysite.com_02.html
           validate: Enter Username
           title: Step 1: Login Page

      3:
        proto: POST
        action:
           url: https://www.mysite.com/Login.php
           save: out/mysite.com_03.html
           validate: Welcome
           title: Step: 2 Successfully logged in
          
        fields:
           method: auth
           password: monitoring123
           username: robot


Here is a basic config for getting a page that is protected by the Basic authentication method:

---
basic_auth_login.conf:
   global:
      email:
         FROM: farhad.saberi@basicauth.com
         TO: farhad.saberi@basicauth.com
         Subject: Alert basicauth.com
         content-type: text/html; charset="ISO-8859-1"; format=flowed
         msg: http://basicauthsite.com
      lwp:
         cookie-jar: cookiejar/mysite.com.cookie_jar

   main:
      1:
         proto: GET_AUTH_BASIC
         action:
            url: http://basicauthsite.com/
            save: out/basicauthsite.01.html
            validate: Basic Auth will open in a new window
            title: "Step 1: Basic Auth login"
            realm: basicauth
            username: farhad.saberi
            password: robot123



The above YAML config shows three steps in the main section that our automated client browser will action on. The first two use the GET protocol on the urls https://www.mysite.com/ and https://www.mysite.com/auth.php?method=autoLogin and will save the response to the files out/mysite.com_01.html and out/mysite.com_02.html. The field "validate" is used to verify the result. If the response at step 2: contains the string "Enter Username" then we are good to continue to step 3: which is the POST. At each step of the browsing there's a field called "title" and it is informational only. If at a step the response fails to validate (regex match usually) then we know there's a problem. I include the "title" field in the email message to specify at which step the failure occurred.

The third step which requires the POST protocol to be used contains the "fields" that I will send. While I set my configuration up I will navigate once and use Firefox's Firebug to know all of the POST fields that were sent out. I just copy & paste them into my config. The YAML parser i use is YAML::Syck.

I will show the code for the browser and then the modules GET.pm and POST.pm that the browser will use. I mentioned above DB and MAIL as well. If there's a 4th step that requires a DB query for example, you would write a DB.pm module, add a 4: step in your YAML config and specify its "proto" field as DB. I actually have all this but left them out so as to show the framework for this monitoring tool only and not make this article too long.

Here's the browser code:

#!/usr/bin/perl

use strict;
use warnings;
use File::Basename;
use Cwd 'abs_path';
use LWP::UserAgent;
use HTTP::Cookies;
use YAML::Syck;
use Time::HiRes qw(gettimeofday tv_interval);

unless(defined $ARGV[0]){
   die "usage: $0 <config>\nWhere <config> is a configuration file under configs directory.\n";
}

my $title=$ARGV[0]; #note that this must be the same YAML config title on top of the config file
my $scriptdir = dirname(abs_path($0));

push @INC, "${scriptdir}/pm";
require Mail::Sendmail;
require GET;
require GET_AUTH_BASIC;
require POST;

my $config = "${scriptdir}/configs/".$ARGV[0];
my $conf   = LoadFile($config);
my $global = $conf->{$title}->{'global'};
my $main = $conf->{$title}->{'main'};

my $tot_attemtps = 1;
my $cj = $global->{'lwp'}->{'cookie-jar'};

# We make our own specialization of LWP::UserAgent that asks for
# user/password if document is protected.
{
    package RequestAgent;
    our @ISA = qw(LWP::UserAgent);

    sub new
    {
        my $self = LWP::UserAgent::new(@_);
        $self->{username} = '';
        $self->{password} = '';
        $self->agent("lwp_robot ");
        $self;
    }

    # set_user_password() is called inside GET_AUTH_BASIC
    #
    sub set_user_password
    {
        my($self, $user, $password) = @_;
        $self->{username} = $user;
        $self->{password} = $password;
    }

    sub get_basic_credentials
    {
        my($self, $realm, $uri) = @_;
        return ($self->{username}, $self->{password});
    }
}

for(my $attempt=0 ; $attempt <= $tot_attemtps ; $attempt++){
  print "attempt=${attempt}\n";
  my $flag=0;#if flag is 1, then it means a second attempt should be made
  my $cookie_jar=HTTP::Cookies->new(file => "${cj}",autosave => 1,ignore_discard => 1);
  HERE: {
       my $ua = RequestAgent->new;
          $ua->cookie_jar($cookie_jar);
       foreach my $k (sort keys %{$main}){
           my $req = $main->{$k}->{'proto'}->new(\%{$main->{$k}->{'action'}}, \%{$main->{$k}->{'fields'}});

           print "$k. ". $main->{$k}->{'proto'} ." ... ";
           print $main->{$k}->{'action'}->{'title'} if ($main->{$k}->{'action'}->{'title'});

           $cookie_jar->load;
           my $start_t = [gettimeofday];
           my $res = $req->submit($ua,$cookie_jar);
           my $elapsed_t = tv_interval ($start_t, [gettimeofday]);

           #if $res is defined, then it means it is an error and we mail out the error.
           if (defined $res && $attempt >= $tot_attemtps){
               print " ... ERROR a attempt=$attempt flag=${flag}. $elapsed_t sec\n";
               alert($res, $k);
           }elsif(defined $res && $attempt < $tot_attemtps){ #out first attempt that failed, where $attmept is 0
               undef $ua;
               sleep 7;
               print " ... ERROR b attempt=$attempt flag=${flag}. $elapsed_t sec\n";
               $flag=1;
               $cookie_jar->clear;
               last HERE;
           }else{
               print " ... SUCCESS. $elapsed_t sec\n";
           }

           $cookie_jar->extract_cookies( $res );
           $cookie_jar->save;
           sleep 3;
       }
  }
  if ($flag == 0){
        $cookie_jar->clear; #clean up and leave
        last;
  }
}

sub alert{
   my $res = shift;
   my $step= shift;

   my $title = $main->{$step}->{'action'}->{'title'};
   my $msg= $global->{'email'}->{'msg'};
   my $host=`hostname`;chomp $host;

   my %mail = (
        from => $global->{'email'}->{'FROM'},
        to => $global->{'email'}->{'TO'},
        subject => $global->{'email'}->{'Subject'} . " from $host",
       'content-type' => $global->{'email'}->{'content-type'}
   );
   $mail{body}  = $global->{'email'}->{'Subject'} . " Failed at ${title}\n<br>\n<br>";
   $mail{body} .= $msg ."\n<br>";
   $mail{body} .= $res->content ;
   sendmail(%mail) || print "Error: $Mail::Sendmail::error\n";
   exit 1;
}

When I push "$scriptdir/pm" onto @INC i follow to require my other modules that reside under the script's current location. Config::YAML and Mail::Sendmail have been installed under the script's pm directory from CPAN. GET.pm and POST.pm are written by myself. I can't say use because the current directory's pm is not yet in @INC and I don't want to place it in a BEGIN block. Why? Because that would be a hard coded path and if I were to move my installation under another directory I would have to edit the script's BEGIN block with the new location. The require statement happens at runtime while use happens at compile time so that's why I require instead of use for those locally installed modules that are not in perl's default @INC location.

There's a $tot_attempts variable. I should have put its count into the config's global section :-) This is the number of attempts the script will try before giving up. If there's a error at any step, the browser script will clean up and start over again a total of $tot_attempt times.

Moving on, let's look at GET.pm and POST.pm. They are written to handle the hash reference passed as argument, which is the configuration's subsection for it.

package GET;
use warnings;
use strict;

sub new {
  my ($class) = shift;
  my $action = shift;
  my $fields = shift;
  my $self={ 'action' => {%$action}, 'fields' => {%$fields} };
  bless $self,$class;
  return $self;
}

sub submit{
  my $self = shift;
  my $ua = shift;
  my $cookie_jar = shift;
  use HTTP::Request;
  my $req=HTTP::Request->new(GET => $self->{'action'}->{'url'});

  foreach my $header (keys %{$self->{'action'}->{'headers'}}){
        $req->header($header => $self->{'action'}->{'headers'}->{$header} );
  }

  $cookie_jar->add_cookie_header( $req );
  my $res = $ua->request($req);

  if($self->{'action'}->{'save'}){
        open (FH,">",$self->{'action'}->{'save'}) || die("ERR open save file ".$self->{'action'}->{'save'}." failed: $!");
        print FH $res->content;
        close(FH);
  }
  my $validate_s = $self->{'action'}->{'validate'};
  if($res->content =~ /$validate_s/ ){
        return undef;
  }else{
        return $res;
  }
}
1;

And POST.pm.

package POST;
use warnings;
use strict;

sub new {
  my ($class) = shift;
  my $action = shift;
  my $fields = shift;
  my $self={ 'action' => {%$action}, 'fields' => {%$fields} };
  bless $self,$class;
  return $self;
}

sub submit{
  my $self = shift;
  my $ua = shift;
  my $cookie_jar = shift;

  use HTTP::Request::Common;
  my $req = POST($self->{'action'}->{'url'}, [ %{$self->{'fields'}} ]);

  $cookie_jar->add_cookie_header( $req );
  my $res = $ua->request($req);

  if($self->{'action'}->{'save'}){
     open (FH,">",$self->{'action'}->{'save'}) || die("ERR open save file ".$self->{'action'}->{'save'}." failed: $!");
     print FH $res->content;
     close(FH);
  }

  my $validate_s = $self->{'action'}->{'validate'};
  unless($res->content =~ /$validate_s/ ){
        return $res;
  }else{
        return undef;
  }

  if($res->content =~ /$validate_s/ ){
        return undef;
  }else{
        return $res;
  }
}
1;

and GET_AUTH_BASIC.pm

package GET_AUTH_BASIC;
use warnings;
use strict;

sub new {
  my ($class) = shift;
  my $action = shift;
  my $fields = shift;
  my $self={ 'action' => {%$action}, 'fields' => {%$fields} };
  bless $self,$class;
  return $self;
}

sub submit{
  my $self = shift;
  my $ua   = shift;
  my $cookie_jar = shift;
  use HTTP::Request;
  my $req=HTTP::Request->new(GET => $self->{'action'}->{'url'});

  foreach my $header (keys %{$self->{'action'}->{'headers'}}){
        $req->header($header => $self->{'action'}->{'headers'}->{$header} );
  }

  $cookie_jar->add_cookie_header( $req );

  $ua->set_user_password($self->{'action'}->{'username'}, $self->{'action'}->{'password'});

  my $res = $ua->request($req);
  if($self->{'action'}->{'save'}){
        open (FH,">",$self->{'action'}->{'save'}) || die("ERR open save file ".$self->{'action'}->{'save'}." failed: $!");
        print FH $res->content;
        close(FH);
  }
  my $validate_s = $self->{'action'}->{'validate'};
  unless($res->content =~ /$validate_s/ ){
        return $res;
  }else{
        return undef;
  }
}
1;


There you have a solid framework to build on. You can now easily add a DB.pm or a MAIL.pm too. For each proto in your YAML config you write your own module and make it do what you like.

Let me be explicit that this browser is not overly complicated and it shouldn't be unless it would have to be. Got that? :-) What I mean is this. Assume that a connection hangs forever. What if you add an SSH check in there. SSH can hang on you going through a certain firewall, literally forever. I've seen it sit there for 20 hours! You would use signals and most likely fork() with a timeout setting that will return if your SSH attempt stalls for too long. But if you stick to the basic HTTP stuff then you won't need to do that. I hope that this article was helpful.

Infix to Postfix Conversion Stack calculator

| 0 Comments | 0 TrackBacks
Let's implement a Perl program that will read an Infix mathematical expression and translates it into a Postfix expression that a calculator could use. As an application's support person you might not find this very useful in your daily work activities. But culturally it is very important to get an insight into what a compiler does when you write an infix expression such as

6 - 2 * 4 + 15 / 3 - 1

How will the compiler know that 2 * 4 must be evaluated before "6 - 2?" In Math + and - have a lower precedence than * and /. And so 2 * 4 and 15 / 3 must be evaluated BEFORE the additions and subtractions. There are other techniques that compilers apply to figure this out. And one of them is to translate the above infix expression to its postfix representation:

6 2 4 * - 15 3 / 1 - +

The real purpose of this article is to write a program that will translate infix format into its postfix with Perl. It's just a fun exercise and the text book I'm reading has left it as an exercise. The hard part was to understand on paper how to do it which I won't go into detail explaining here. Every text book on compilers should have this explained.

lemur.jpg
Remember that operators are those +,-,*,/ and ** which is the exponentiation notation in perl (double star). And operands are numbers 1,2,3 ...  Also, all  operators +,-,/ and * are left associative. Meaning that you go from left to right. 2 + 9 - 3 equals 8 because you went from left and did the addition first. 6/2 equals 3 because you went from left to right. The exponentiation however is Right associative. Meaning that 3 ** 2 ** 4 in math is 3 ** ( 2 ** 4 ) which is 3 ** (16) which equals 43046721. We went from right to left and did 2 ** 4 first. That's just how it is in math.

Here are the rules the algorithm obeys as it reads the infix notation one word at a time from left to right:

1- If an operand, print it immediately
2- If a closing parenthesis, pop stack symbols until an open parenthesis is seen.
3- If an operator, pop all stack symbols until we see a symbol on the stack that is of lower precedence. If the symbol on the stack is the same then pop it if it is left associative.
4- End of input, pop all remaining stack symbols.

For rule 3 to work we need to establish for every operator a precedence. Just an integer number that will define who's of higher precedence. Furthermore, an operator must have a precedence value based on whether it is on the stack or whether it is read from the input.

Let me explain. If you read a + from the input and there's a plus on the stack, then you have to pop the stack. Meaning that a + on the stack has a higher precedence than a + from input. However, we have to assign to the exponentiation operator (**) a lower precedence when it is on the stack than when it is from the input. This is to make the second part of rule 3 work. That is, if the symbol is the same, pop it if it is left associative. Since ** is right associative, we need to assign it a lower precedence for when it is on the stack so that a ** from the input does not cause a pop. To map every symbol to its precedence I use two hashes: %stackOpPrec and %inputOpPrec.

Stacks are the only data structures needed. I use two stacks: one for storing the operators (just called @stack) and one (@parenStack) for balanced symbol checker which in our case are parenthesis. Integers don't need to be stored. Rule 1 above tells us to just print it out. I won't go into the detail of the balanced symbol checker technique. It's really simple in that you push onto a stack every opening parenthesis. And every time you see a closing one, you top the stack and there's got to be a matching closing one. At the very end if the stack is empty then you know that your symbols '(' and ')' were properly balanced.

Now, take a look at this state machine and then explain it.
infix_postfix_state_machine.gif
To be able to validate the infix expression I came up with the above state machine that I use in the algorithm. You can start an infix expression with a '(' or an operand. Since we can alway put a '(' in front of any operand we will just start with '(' or assume one is there, legally. The diagram above shows that once you are at '(', the next input must either be another '(', or a ')' or an operand. If your input is an operand, then your next input must either be an operator or a ')'. Then an operator can only be followed by a '(' or an operand. The double circles denote the acceptable end states: you must be at one of those states when you finish. If you finish on a state other than a ')' or an operand then the input was invalid.

So let me explain how the state machine is implemented programatically. I define the follow states:

state 0: operand, (
state 1: operator, )
state 2: operand, ) , (

I start reading the input while being at state 0. Meaning that my input must either be an operand or a '('. And depending on which I see, I will change my state accordingly. This snippet shows the state change rule after each input:

state=0
foreach input from left to right {
  if  input == '(' 
         state = 2
  if  input == ')'  or input == operand
         state = 1
  if input == operator
         state = 0
}

I error out of the program if at a certain state I encounter an input that is not allowed in that state.
The operator parsing algorithm is presented here:

#!/usr/bin/perl
use warnings;
use strict;
my @input = split(/\s+/,$ARGV[0]);

my %inputOpPrec = ( '(' => 99, #highest precedence. Nothing would be poped
                    '+' => 4, 
                    '-' => 5 ,
                    '*' => 9 ,#pop everything on the stack higher than 9
                    '/' => 10,
                    '**' => 15 );

my %stackOpPrec = ( '(' => 0, #pop nothing and push on stack
                    '+' => 6,
                    '-' => 7, 
                    '*' => 11,
                    '/' => 12,
                    '**' => 14,#since right associative, stack prec is lower than input prec 
                   );

my @parenStack; # used to perform the parenthesis balance checker
my @stack; # used for storing operators from input

my $state = 0;
for(my $i = 0 ; $i < scalar(@input); $i++){
    my $inputSymbol = $input[$i];

    if($inputSymbol eq '(' && ($state == 0 || $state == 2) ){
        $state = 2;

        push @parenStack, '(';
        push @stack , '(';
    }
    elsif ($inputSymbol eq ')' && ($state == 1 || $state == 2)){
        $state = 1;

        # For every closing parenthesis, the rule of balancing tells us that
        # top of parenStack must be an opening paranthesis. If it is then 
        # pop the matching opening paren, if it is not then error out as it is unbalanced.
        unless (defined $parenStack[$#parenStack] && $parenStack[$#parenStack] eq '('){
               die "unbalanced parenthesis in infix expression\n";
        }else{
               delete $parenStack[$#parenStack]; 
        }

        for(my $j = $#stack; $j >= 0; $j--){
             if ($stack[$j] eq '(') {
                  delete $stack[$j];
                  last;
             }else{
                  print delete $stack[$j]; print " ";
             }
        }
    }
   elsif (exists $inputOpPrec{$inputSymbol} && $state == 1){ #is it an operator?
           $state = 0;

           #if stack is empty, push symbol onto stack and continue
           if(scalar(@stack) == 0 ){
               push @stack, $inputSymbol;
           }else{
               #from top of stack keep poping until stack symbol has lower precedence
               for(my $j = $#stack; $j >= 0; $j--){
                   my $stackSymbol = $stack[$j]; #Top stack
                   #print "input symbol: $inputSymbol\n";
                   if( $stackOpPrec{$stackSymbol} > $inputOpPrec{$inputSymbol} ){
                       print delete $stack[$j]; print " ";
                   }else{
                       last;
                   }
               }
               #we're done poping lower precedence symbols on stack
               #push onto stack the input symbol and continue with next input symbol
               push @stack, $inputSymbol;
           }
    }
    elsif($inputSymbol =~ /(?:^(-|)\d+$)/ && ($state == 0 || $state == 2)){
        $state = 1;
        print "$inputSymbol "; # it is a operand

    }
    else{
        die "Error in expression at position $i, near symbol '$inputSymbol'\n";
    }
}
if (scalar(@parenStack) > 0) {
   die "infix expression has unbalanced parantheses\n";
}
#rule 4: End of input. Pop all remaining operators
for(my $j = $#stack; $j >= 0; $j--){
   print $stack[$j]." ";
}
print "\n";

If you noticed I did not define a precedence for ')' while on stack (it's missing in %stackOpPrec). That's because it is never pushed onto it. We only push '(' until we see a ')' which ends up poping '('. 

I didn't bother elaborating on the input parsing mechanism. I think that I would have to create a state machine for that as well and tokenize the input. But that wasn't the purpose of this study. To run the infix to postfix program, just make sure that all inputs symbols are separated by a space.

$./infixPostfix.pl "3 + 4"
3 4 +

$./infixPostfix.pl "3 - 4 * 5"
3 4 5 * -

$./infixPostfix.pl "3 + -4 - -5"
3 -4 + -5 -

$./infixPostfix.pl "2 ** 5 - 1"
2 ** 5 - 1

$./infixPostfix.pl "3 * 2 ** 5 - 1"
3 2 5 ** * 1 -

$./infixPostfix.pl  "( 5 - 2 ) * 6"
5 2 - 6 *

$./infixPostfix.pl "( 5 + 6 ) * ( 6 - ( 5 + 4 ) )"
5 6 + 6 5 4 + - * 

$./infixPostfix.pl "3 + 2 ** 2 ** 2"
3 2 2 2 ** ** +

$./infixPostfix.pl "3 + ( 2 ** 2 ) ** 2"
3 2 2 ** 2 ** +

$./infixPostfix.pl "3 / ( 2 + 6 ) ** 2 - 5 * ( 10 / 5 )"
3 2 6 + 2 ** / 5 10 5 / * -

$./infixPostfix.pl "( ( ( ( 3 ) ) + ( ( -4 ) ) ) )"
3 -4 +

$./infixPostfix.pl  "( ( -3 + 74 ) - -4 ) ** ( -2 + 5 ) / ( ( 2 ) ) - -1"
-3 74 + -4 - -2 5 + ** 2 / -1 -

$./infixPostfix.pl "1 - 2 - 3 * 4 ** 5 * 6 / 7 ** 2 ** 2"
1 2 - 3 4 5 ** * 6 * 7 2 2 ** ** / -

$./infixPostfix.pl "1 - ( ( 2 + 4 ) * 8 ) / ( 2 / 2 ** 5 ) - ( 60 + 5 ) ** 3"
1 2 4 + 8 * 2 2 5 ** / / - 60 5 + 3 ** -

I have not read any compiler source code to figure out how they've done it there. I truly should and hope to do so sometime in 2011. This is my version of converting infix to postfix. If there's a better way of doing it or if there's a compiler developer out there who can teach me a thing or two then I'm all ears.

Perl Binary Heap

| 0 Comments | 0 TrackBacks
Introduction:

Using Perl to implement the Binary Heap, one of computer science's most beautiful concepts, is not has hard as one might think. It is a good contrast to last weekend's halloween celebrations. I looked on CPAN and of course and gladly the Heap is implemented. But then I thought about using it. I'm sure that the author(s) has made it as easy as possible for me but I quickly get annoyed at trying to figure out whether what I think this Heap is doing is what it really does.

Further more, what if I want to change the Heap so that the data it stores are not simple integers but references to other objects? Since I know about binary heaps, I can create and manipulate my own rather than try and figure out someone else's implementation, and perhaps change the class to suit my need. So let's write our own Binary Heap. Minimum or Maximum doesn't matter as the code will be almost identical. The only difference is flipping the comparison criteria.

Binary Heap is in every computer science text book. At the time in my University  we learned it using C++. Since I sadly and disappointedly do not work in C++ it makes sense that I translate the same code into Perl. The book is Algorithms and Data Structures with C++ by Weiss.

BinaryHeapBird.jpg
It is assumed that you are already familiar with Binary Heaps. This article is not intended to teach what it is as there are text books for that but it is to show its implementation in Perl using an unusual example from the systems administrator's or Applications Support Engineer's point of view. The place where Binary Heaps are mainly used is in event driven simulators.

Quickly, a Minimum Binary Heap is a structure in which you store the smallest value at the root of the tree. This means that you can always retrieve the minimum value in the list in a constant time of 1 (i.e.: you don't have to search for it. You know that it is always at the root position). A Maximum binary heap is the opposite. The largest value is at the root position. To insert a value into the Heap takes O(log N) time worst case which is quite fast. This is done by inserting the new value from the bottom of the tree and going up (percolating up). If the new value is a new minimum then it will end up at the root in O(log N) time. When we delete the minimum value (the root), we replace it by taking the last element in the tree, place it at the root and percolate it down. The purpose of all this percolating down or up is to keep the integrity of the heap order so that the algorithm works.

State the problem:

Find the top 20 largest files.

Obvious and easy solution:

This is really easy:   
du -kx * | sort -n | tail -20

But that's really bad. When du it sees a directory it sums up the contents of everything under it instead of crawling and looking at individual files only. So let's try the find command:

find . -type f -ls | sort -n -k7 | tail -20

What if you have 50 million directories containing 100 trillion files (if you've got the inodes for it). I bet that's peanuts at Google Inc. But that's not the point and with the above solution you will first have to store all 100 trillion file sizes and their paths in find's output buffer before you pass all of them to sort !! Admittedly sort is extremely fast and it's in C. But that's not the most efficient way memory wise. It will most likely work and probably you won't run out of memory as memory is really cheap today. When the time comes i'll do a memory and CPU comparison to see.

Still, the problem that I have is that there's no control over what's going on. You run the command and let it run wild. What if you're doing this on a production server and have to go really slow? The closest that you can come to throttling shell commands is to renice them (see man renice). So let's look at another solution that will use the smallest amount of memory and gives you the flexibility to modify, add exceptions, throttle, or do anything else you desire. I came up with this and want to document it here.

Combine DFS with a Minimum Binary Heap

In the article directory tree traversals using DFS or BFS I showed how you can search a directory structure and have full control over each file you come across. We'll use DFS here (or BFS, it doesn't matter which) to look at each and every file and we'll use a 20 element Minimum Binary Heap to store the 20 largest files.

We will simply keep inserting into the heap when we visit a file that is larger than the current min. If the heap size is less than 20 we'll just toss it in the heap. If the heap size is already 20 then we'll do the insert by deleting the root note and inserting the new one. Insertion happens by adding the new node at the end (bottom) of the heap and percolating it up.

I know that if there are 100 files of exactly the same size who happen to be the largest as well, then keeping the top 20 will ignore the other 80 that are also the largest. That's something that we need to keep in mind while limiting our heap size to 20 only. But in real life as an administrator if you find a lot of files that are exactly the same byte size then you can already draw significant conclusions into what's happening with the application.

Let's explain how the heap is represented in the computer's memory. It is really an array. In perl an array is zero indexed as you already know. But the heap's representation starts at index 1. Each node in the heap is at array index i where:

The array index 0 is unused.
The root node is always at i=1  (where the minimum value is)
For every node i, its left child is at position 2i
For every node i, its right child is at position 2i+1
For every node i, its parent is at position i/2

Let's show the Perl class that implements the Heap and explain it after.

package Heap;
use strict; 
use warnings;
sub new {
    my $class = shift;
    my $self = { "_Heap" => [], "OrderOK" => 1 , CurrentSize => 0 };
    bless($self); return $self;
}

sub FindMin {
    my $self = shift;
    $self->FixHeap() if $self->{OrderOK} == 0;
    my $t = $self->{_Heap}->[1]->[0];
    return $self->{_Heap}->[1] || undef; #returns an array ref.
}

sub Toss {
    my $self = shift;
    my $node_aref = shift;

    # remember that idx=0 is unused. So can't call perl's push.
    my $idx = ++$self->{CurrentSize} ;
    $self->{_Heap}->[$idx] = $node_aref;
  
    my $parent_aref= $self->{_Heap}->[ int($self->{CurrentSize} / 2) ] ;
    return unless ($parent_aref); #may be the root, so no parent defined.
    my $parent_file_sz =  $parent_aref->[0];
    my $newfile_sz = $node_aref->[0];
 
    $self->{OrderOK}= 0 if ($newfile_sz < $parent_file_sz); #flip this for a maximum heap
}
sub Insert{
    my $self= shift;
    my $x = shift; # percolate x up as needed
    my $newfile_sz = $x->[0];

    # If Order is not OK, just toss it.
    if ($self->{OrderOK} == 0){
           $self->Toss($x); return;
    }

    # Create a new empty (Hole) position at the end of Heap 
    my $Hole =  ++$self->{CurrentSize};
    for( ; ($Hole > 1) && 
            ($newfile_sz < $self->{_Heap}->[ int($Hole /2) ]->[0]); $Hole = int($Hole / 2) ){
                    $self->{_Heap}->[ $Hole ] = $self->{_Heap}->[ int($Hole / 2) ]; 
    }
    $self->{_Heap}->[$Hole] = $x;
}

sub DeleteMin{
    my $self=shift;
    
    # remove the last entry in Heap and put it at root position
    $self->{_Heap}->[1] = $self->{_Heap}->[ $self->{CurrentSize} ];
    delete $self->{_Heap}->[ $self->{CurrentSize} ];
    --$self->{CurrentSize};
    $self->PercolateDown(1); # fix heap order
}

sub PercolateDown{
    my $self = shift;
    my $Hole = shift;
    my $tmp_aref = $self->{_Heap}->[$Hole];
    my $tmp_sz = $self->{_Heap}->[$Hole]->[0]; #this is what our comparison is based on

    my $Child;
    for(; ($Hole * 2) <= $self->{CurrentSize}; $Hole = $Child){
          $Child = $Hole * 2; #get left Child
          my $Child_sz = $self->{_Heap}->[$Child]->[0];

          # if $Child is not the last element, check if the right child value is smaller.
          if( ($Child != $self->{CurrentSize}) && 
              ($self->{_Heap}->[$Child + 1]->[0] < $Child_sz)){
                        $Child++; # become right child since it exists and it is smaller.
          } 
          if( $self->{_Heap}->[$Child]->[0] < $tmp_sz){
                $self->{_Heap}->[$Hole] = $self->{_Heap}->[$Child];
          }else{
                last;
          }
    }
    $self->{_Heap}->[$Hole] = $tmp_aref;
}

sub FixHeap{
    my $self = shift;
    for(my $i = int(($self->{CurrentSize}) / 2);   $i > 0; $i--) {
         $self->PercolateDown($i);
    }
    $self->{OrderOK} = 1;
}
sub print_heap{
    my $self = shift;
    for(my $i=1; $i <  scalar @{$self->{_Heap}}; $i++){
           print "\t @{ $self->{_Heap}->[$i] } \n";
    }
}
1;

Isn't seeing that famous 1; at the end a relief? I want to first show a visualization of the above Heap class.

heap_example.png
In this picture you can visualize the hash key {_Heap} that is defined in sub new. Each node of the heap tree is shown by the circles. And each node is a reference to an array with idx=0 as the byte size of the file and wth idx=1 as the file path.

Our decision criteria for moving nodes around to maintain the heap order is done on the byte size of each file. Knowing this and this picture should make it easier to read the class code.

Let me now show the post order directory search algorithm, or DFS, that utilizes this heap class in order to keep the 20 largest files in the heap. Once it is done, it will just print out the entire heap array showing its treasure.

#!/usr/bin/perl
use strict;
use warnings;
use Heap;

my $Heap = new Heap();
my $HeapMaxSize = 20;

DFS('/home/farhad'); 
$Heap->print_heap();

sub DFS{
   my $start=shift;
   my @queue = ($start);
   while(scalar(@queue) > 0){
       my $dir = pop(@queue); 
       my ($files, $dirs)= get_dirs_files($dir);

       push @queue, @$dirs;
       map { &process_file($_) } @$files;
   }
}

sub get_dirs_files{
    my $sdir = shift;
    opendir(my $dh, $sdir) || die "can't opendir $sdir : $!";
    my @entries = grep {!( /^\.$/ || /^\.\.$/)} readdir($dh);

    @entries =  map { "$sdir/$_" } @entries;
    closedir $dh;
    my @files =  grep( -f $_ , @entries);
    my @dirs = grep(-d $_, @entries);
    return (\@files,\@dirs);
}
sub process_file{
   my $f = shift;
   my $file_sz = (stat($f))[7];
 
   # Create the node that you might Toss or Insert into Heap 
   my $node = [ $file_sz , $f ];

   if ( $Heap->{CurrentSize} < $HeapMaxSize ) {
        $Heap->Toss($node);
   }else{
        my $min_node = $Heap->FindMin(); #undef if Heap is empty

        if (defined $min_node && $file_sz > $min_node->[0] ){
              $Heap->DeleteMin();
              $Heap->Insert($node);
        } 
   }
}

The Heap utilization takes place inside sub process_file(). Our HeapMaxSize is 20 and all of the first 20 files we see we just toss it into the heap and don't really care if the heap order is maintained. But starting with the 21st file we start comparing. If the new file has a size ($file_az) bigger than the minimum in the Heap, then we need to get rid of the minimum and replace it with this new one. So we call DeleteMin() which will delete the root of the array @($Heap->{_Heap}} and will insert the new node.

Perl FTP Retrieve All Files and Sub Directories

| 0 Comments | 0 TrackBacks
In this article I want to tackle everyone's old friend who refuses to die. The File Transfer Protocol or FTP. I personally believe that FTP will be around forever especially on internal networks isolated from any possible outside intrusion. FTP servers are not feature rich. You can do basic file transfers, rename files or move files. But one thing that everyone at some point tries to do is retrieve all the files and directories. Basically scraping the remote file system and bringing it over locally. Borrowing from the SCP world, we want to achieve the equivalent of this for FTP:

scp  -r  remote_host.domain.com:/*  /local_directory

In my last article Perl Find file with BFS and DFS, I explained the two algorithms we can use to traverse a directory structure. Combining the Breath First Search algorithm with perl's awesome Net::FTP::File module, we can do just that. Get this module from here.

Here's a picture of the traversal of the the same directory structure as the one in the previous article but this time we shows the traversal order in BFS order.
BFS.png
Let's show the code that will do all the work and then explain it.

#!/usr/bin/perl
use strict;
use warnings;
use Net::FTP::File;

my  $ftp = Net::FTP->new( '10.1.10.50', Debug => 0) 
                                     || die "Cannot connect: $@\n";
      $ftp->login("anonymous", "anonymous") 
                                     || die "FTP: Cannot login ". $ftp->message."\n";

BFS('1'); # or '.' as the remote root directory where the crawling begins
$ftp->quit;
exit(0);

sub BFS{
     my $root=shift;
     my @queue = ($root);

     mkdir_locally($root);
     while (scalar(@queue) > 0 ){
          my @tmp_queue; 
          foreach my $remotedir (@queue){
                print "$remotedir\n";
                my($remotefiles,$remotedirs) = ftp_dirs_files($remotedir);
                map { &mkdir_locally($_);} @$remotedirs;
                map { &ftp_get_file($_);}  @$remotefiles;
                push @tmp_queue,@$remotedirs;
          }
          @queue = @tmp_queue;
     }
}

sub mkdir_locally{
    my $local_dir = shift;
    $local_dir =~ s/\/$//;
    my @dirs = split(/\//,$local_dir);
    my $dir;
    for(my $i=0; $i < scalar(@dirs) ; $i++) {
          $dir .= $dirs[$i]."/"; 
          unless(-d $dir){
                mkdir $dir || die("can't mkdir $dir: $!");
          }
    }
}

sub ftp_dirs_files{
     my $dir=shift;
     my $dirinfo_href= $ftp->dir_hashref($dir);
     my (@remotedirs, @remotefiles);
     foreach my $remotefile (keys %$dirinfo_href){
          next if ($remotefile =~ /^\.$/ || $remotefile =~ /^\.\.$/);
          $remotefile = "${dir}/${remotefile}";
          if ($ftp->isfile($remotefile)){
                push(@remotefiles,$remotefile);
          }elsif($ftp->isdir($remotefile)){
                push(@remotedirs,$remotefile);
          }
     }
     #return lists of remote files names and directories names
     return(\@remotefiles,\@remotedirs); 
}

sub ftp_get_file{
    my $remotefile = shift;
    print "remotefile: $remotefile\n";
    $ftp->get("$remotefile","$remotefile");
}

There are a few subroutines here so I'll explain each briefly. First we connect to the FTP server and immediately call BFS('1'). The argument '1' is the remote directory called '1' that we are going to fetch and all its sub directories and files. If you want to retrieve all files and subdirectories starting at root, just say BFS('.'). In Unix, a dot mean "the current directory."

canadian_loon_head.jpg
The sub mkdir_locally() takes in as its argument a single directory path which may look like '1/2/3.' Sine you cannot create directory 2 before first creating 1, we split the names into their parts on '/' and create them in order from parent to child.

The sub ftp_dirs_files() takes a remote directory name as its argument. It then populates two different arrays with the file names and directory names it finds under that remote directory. You might have noticed that I always construct a full path so that I don't have to chdir. The two arrays of files and directory names are returned back to sub BFS() where it creates those directories locally by calling mkdir_locally() on each array element in @remotedirs. The ftp_get_file() is called for each entry in @remotefiles to actually perform an "ftp get" command to retrieve the file. 

At every step of the descent you can choose what to do with each remote directory and file you see. You have full control over what you wish to perform and not be at the mercy of bundled ftp client softwares.

Just as in this article, we are going to create the directories with names that represent our tree structure above and run our program ftp.pl on it to demonstrate the BFS traversal. The directories are created with the same names that reflect 

mkdir  -p  1/2/5/11  1/2/6/12  1/2/6/13  1/2/7
mkdir  -p  1/3/8  1/3/9/14
mkdir  -p  1/4/10/15  1/4/10/16  1/4/10/17  1/4/10/18
mkdir  -p  1/4/10/17/19/21  1/4/10/17/20

If you run this program you'll get an output which shows the BFS nature of the traversal.
 $ ./ftp.pl
1
1/4
1/3
1/2
1/4/10
1/3/8
1/3/9
1/2/6
1/2/7
1/2/5
1/4/10/18
1/4/10/16
1/4/10/17
1/4/10/15
1/3/9/14
remotefile: 1/3/9/14/image.jpg
1/2/6/13
1/2/6/12
1/2/5/11
1/4/10/17/19
1/4/10/17/20
1/4/10/17/19/21

refer back to this article for an explanation of this output. In short, BFS visits all siblings of a child directory before it visits each grand child. 

I put in a file under 1/3/9/14 just to show when its retrieval happens in a print statement. I hope that you have found this article useful.