Parallel processing fork exit vfork _exit

| 0 Comments | 0 TrackBacks
The only way a new process is created by the Unix Kernel is when an existing process calls the fork function. And forking is the most expensive operation the kernel performs.

I don't mean threading. True parallel processing is achieved in Unix when the processes are independent from each other. By that we mean separate file descriptors, data segments, stack, heap, signal handlers, user ID and everything else.

The title of this article mentions exit, _exit and vfork. I will talk about these as well as they are relevant and important. But first we need to dig into the original fork itself and fully understand what happens when a program calls it. I will give examples in C and then cover a bit about Perl as there are differences between the two.

Every program has a text segment which contains the machine instructions executed by the CPU. This is sometimes shared between programs so that only a single copy needs to be in memory for frequently 
Hornbill.jpg
executed programs. A bunch of different processes could share the same text segment and save on memory usage when they all mean to execute the same set of instructions.

There are also other memory segments reserved for a process, such as a heap, a stack and an initialized data area. The heap is the space where dynamically allocated memory goes, such as when malloc  or calloc are called within a running program. This is not a fixed size area as malloc can reserve different sizes each time based on what the input of the program is. The initialized data area is the place where any declarations outside of any function call are stored. These are variable and values defined before any processing actually takes place. Such as global variables. Then there's the stack. Each process has a stack onto which it pushes information relating to each function calls, such as addresses of where to return to once a function has finished. Automatic variables (those that are locally defined and cease to exist once the function returns) are also stored onto the stack.

The above memory layout description is very general though accurate enough amongst different flavors of Unix. Each OS does it its own way more and less. But it is good enough for us to generalize in such a way that will explain to us the behaviors of fork and vfork. Especially accurate is the description of the stack. Let's start with fork() by considering this program:

#include <stdio.h>
#include <stdlib.h>

int global=5;
int main(void){
    int pid;
    int local=33;

    printf("before fork\n");
    pid = fork();
    if ( pid < 0 ){
        printf("Failed to fork!\n");
        exit(1);
    }else if (pid == 0) {
        printf("I am child, my pid is %d\n",getpid());
        global++;
        local++;
    }else {
        printf("I am parent, pid is %d and I spawned child %d\n",getpid(), pid);
        sleep(5);
    }
    printf("global=%d local=%d pid=%d\n",global,local,getpid());
    exit(0);
}
This is the proper way fork should be called. First, if it fails it will return a negative integer and we should check for that. Else if it returned zero then we are in the newly created process, the child. Else the returned value was a positive integer which is the PID of this other process that was created and we are the original process, the parent. The printf's make all this clear.

Let's compile and run the above code:

farhadsa@farhadsaberi.com [~/tmp]# gcc fork.c 
farhadsa@farhadsaberi.com [~/tmp]# ./a.out 
before fork
I am parent, my pid is 15905 and I spawned child 15907
I am child, my pid is 15907
global=6 local=34 pid=15907
global=5 local=33 pid=15905
farhadsa@farhadsaberi.com [~/tmp]# 

Before I show what just happened pictorially, I want to say something about buffering because it's important for understanding fork. Let's run the same program but this time redirect its output into a file and see what happens.

farhadsa@farhadsaberi.com [~/tmp]# ./a.out > output
farhadsa@farhadsaberi.com [~/tmp]# cat output 
before fork
I am child, my pid is 21709
global=6 local=34 pid=21709
before fork
I am parent, my pid is 21708 and I spawned child 21709
global=5 local=33 pid=21708
farhadsa@farhadsaberi.com [~/tmp]# 

Did you see the difference? I'll explain in a minute why the line "before fork" was printed twice when I redirected the output to a file. Let's look at what happened pictorially.

fork.png
As we can see in the picture, the text segment of the new process is identical to the text segment of the existing process that created it. The kernel created a new process and copied over the text segment. It also created a new heap for the new process and copied over what was in the parent process's heap. It also created a file descriptor table for the new process and copied over all open file descriptors from the parent to the child. For example if the parent had a file opened with an offset 312, the child will also have a file descriptor pointing to the same file with an offset 312. That is why when the child calls printf() on STDOUT to say "I am child, my pid is ...," we see it on the same terminal on which we started the parent process. In fact, anything you can think of that the parent had, the child has it as well.

Getting back to my executing the above C program with and without redirecting its output to a file ... we noticed that the line "before fork" was printed twice when redirected to a file. The reason is that when the new process is created, the buffered IO segment of the parent is also copied over to the child. Really, nothing was left out. In Unix, in general, the Standard IO Library is buffered differently based on what kind of file its destination is. When its destination is a file of type Terminal then it is "line buffered" while if it is a regular file then it is "fully buffered." Meaning that while printing to a Terminal, every time the "\n" character is seen the buffer is flushed. So when the output was to a regular file, fork() was called while the words "before fork" were still in the buffer. After fork the parent flushed (wrote) its buffer and the child did the same. But when it was printing to the terminal, the original process's standard IO library's printf() call was already flushed to the terminal because it was newline terminated (and hence nothing was there to copy over to the child).

It's important to mention the sleep(5) in the above code where the parent process executes. It is purposefully meant that the parent sleeps for a while so that the child finishes (exits) first. CPU time is given to each of these processes in a manner completely unpredictable to us. There's no synchronization between the two and this is termed as asynchronously executing. This is a topic for another article where I will show how we can synchronize them. But for now let's sleep in the parent so that the child exits first.

exit vs _exit

I have to explain the two exit functions available before explaining vfork: exit() performs program cleanups and then calls _exit(). It is then _exit() that closes file descriptors and blushes IO buffers before returning to the kernel. When a process calls exit() or return() inside main(), if there were any functions registered with atexit(), then those functions are invoked. Once a registered function returns it is then removed from the registered functions on the stack. Meaning that the same atexit() registered function cannot be invoked twice within the same process.

If you _exit() however, registered atexit() functions are ignored. And this is the main difference. With exit() the atexit() registered functions are called and then implicitly _exit() is invoked. You can bypass exit() by calling _exit() directly. Consider this simple program:
#include <stdio.h>
#include <stdlib.h> 

static void myexit1(void), myexit2(void);
int main(void){
    printf("begin program\n");

    if(atexit(myexit2) != 0) {
             printf("ERROR can't register myexit1"); exit(EXIT_FAILURE);}
    if(atexit(myexit1) != 0) {
             printf("ERROR can't register myexit1"); exit(EXIT_FAILURE);}
    _exit(EXIT_SUCCESS); /* note that it is not exit() */
}
static void myexit2(void){
    printf("call to myexit2, second exit handler\n");
}
static void myexit1(void){
    printf("call to myexit1, first exit handler\n");
}
farhadsa@farhadsaberi.com [~/tmp]# ./a.out
begin program
farhadsa@farhadsaberi.com [~/tmp]#

The functions myexit2() and myexit1() were skipped and the program returned to the kernel.

COW (Copy On Write)

Final note before vfork. Today most flavors of Unix implement what's called the Copy On Write optimization of the fork() function. Meaning that the parent's stack is not copied over to the child. Since it is assumed that possibly the child() will not touch any variables previously declared before calling fork(). It is assumed that the child will exec(). Since the child does not need the parent's stack then why waste resources copying it over. However if the child does try to write to those variables, then the kernel will copy the stack over to the child first so that changes are in the child's space.

This is the optimization. Copy the information over to the child only if the child attempts to write to it. Otherwise leave it there.

vfork

The are only two reasons why a program would ever call fork().

1- To duplicate itself so that one process performs one section of the code
     while the other process executes another section.
2- To call exec(), meaning that the child process becomes another program.

vfork() is intended to be used for the second case. When you fork() and then immediately exec() another program, all that heap and stack copying to the child's process space is huge overhead for nothing because they are never used. vfork() creates the new process without fully copying the address space of the parent into the child. While the child is running however, it is sharing the parent's address space. Meaning that if the child writes to any variable then it is changing (or corrupting) the same variable within the parent.
 
Many Unixes have abandoned vfork because of the Copy On Write optimization of fork() and have made vfork() synonymous to fork(). However some systems have brought vfork() back because it is still much faster than fork() with COW. One such Kernel is NetBSD and here is their explanation verbatim:

fork() with COW:
  • Traverse parent's vm_map, marking the writable portions of the address space COW. This means invoking the pmap, modifying PTEs, and flushing the TLB.

  • Create a vm_map for the child, copy the parent's vm_map entries into the child's vm_map. Optionally, invoke the pmap to copy PTEs from the parent's page tables into the child's page tables.

  • Block parent.

  • Child runs. If PTEs were not copied, take page fault to get a physical mapping for the text page at the current program counter.

  • Child execs, and unmaps the entire address space that was just created, and creates a new one. This implies that the parent's vm_map has to be traversed to mark the COW portions not-COW.

  • Unblock parent.

  • Parent runs, takes page fault when modifying previously R/W data that was marked R/O for COW. No data is copied at this time.

vfork():

  • Take reference to parent's vmspace structure.

  • Block parent.

  • Child runs. No page faults occur because the parent's page tables are being used, and the PTEs are already valid.

  • Child execs, deletes the reference it had to the parent's vmspace structure, and creates a new one.

  • Unblock parent.

  • Parent runs. (No page faults occur because the parent's vm_map was not modified.)


Until we are kernel developers we'll ignore the details of their explanation and only understand that vfork() is faster. Let's show an example:

farhadsa@farhadsaberi.com [~/tmp]# cat vfork.c
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int glob=6;
static void myexit1(void), myexit2(void);

int main(void){
    int var;
    pid_t pid;
    var = 88;
    printf("before vfork\n");
 
    if(atexit(myexit2) != 0) {
               printf("ERROR can't register myexit1"); exit(EXIT_FAILURE);}
    if(atexit(myexit1) != 0) {
               printf("ERROR can't register myexit1"); exit(EXIT_FAILURE);}

    pid = vfork();
    if( pid < 0){
          printf("fork failed\n"); exit(EXIT_FAILURE);
     } else if ( pid == 0){
          glob++;
          var++;
          _exit(EXIT_SUCCESS);
    }
    printf("pid = %d, glob = %d, var = %d\n", getpid(), glob, var);
    exit(EXIT_SUCCESS);
}
static void myexit2(void){
    printf("call to myexit2, second exit handler\n");
}
static void myexit1(void){
    printf("call to myexit1, first exit handler\n");
}
farhadsa@farhadsaberi.com [~/tmp]# gcc vfork.c
farhadsa@farhadsaberi.com [~/tmp]# ./a.out
before vfork
pid = 8581, glob = 7, var = 89
call to myexit1, first exit handler
call to myexit2, second exit handler
farhadsa@farhadsaberi.com [~/tmp]#

Here I did not exec and simply _exit'ed the child after modifying the two variables var and global. The parent printed the values of var and global and they were changed to 7 and 89. But it was the child that did the updating!! You can see how vfork() is sharing the parent's stack for the automatic variable var and the global variable global within the heap as the child was able to change them. If you replace the vfork() call with fork() you would see that the parent's values would remain unchanged.

Now i'll take the same above code and change the exit function of the child from _exit to exit. Meaning that the child is now going to call atexit() registered functions. Let's see what happens:

farhadsa@farhadsaberi.com [~/tmp]# cat vfork.c
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int glob=6;
static void myexit1(void), myexit2(void);

int main(void){
    int var;
    pid_t pid;
    var = 88;
    printf("begin program\n");
 
    if(atexit(myexit2) != 0) {
                   printf("ERROR can't register myexit1"); exit(EXIT_FAILURE);}
    if(atexit(myexit1) != 0) {
                   printf("ERROR can't register myexit1"); exit(EXIT_FAILURE);}

    pid = vfork();
    if( pid < 0){
          printf("fork failed\n"); exit(EXIT_FAILURE);
     } else if ( pid == 0){
          glob++;
          var++;
          exit(EXIT_SUCCESS);
    }
    printf("pid = %d, glob = %d, var = %d\n", getpid(), glob, var);
    exit(EXIT_SUCCESS);
}
static void myexit2(void){
    printf("call to myexit2, second exit handler\n");
}
static void myexit1(void){
    printf("call to myexit1, first exit handler\n");
}
farhadsa@farhadsaberi.com [~/tmp]# gcc vfork.c
farhadsa@farhadsaberi.com [~/tmp]# ./a.out
begin program
call to myexit1, first exit handler
call to myexit2, second exit handler
pid = 21315, glob = 7, var = 89
farhadsa@farhadsaberi.com [~/tmp]#

See the output changed orders?! The two registered functions were called first and then the parent printed its pid and global and var variables. Why? Because vfork() kept the child running with the same atexit() registered functions and exit() caused them to be called by the child this time. They were therefore removed from the registered atexit() stack by the child and when the parent's turn came nothing was registered to be executed. This is why it is important to call _exit() in a vfork child and not exit.

If you change the vfork() in the above program to fork() you will see the atexit() registered functions myexit1() and myexit2() called twice. Once by the child and once by the parent because they would each have a separate copy of the registered atexit functions.

Another very important aspect of vfork() is synchronization. In both examples you saw that the child executed first without me doing anything special. This is by design. vfork() guarantees that the parent waits for the child to either call exit() or exec(). Until then the parent is blocked by the kernel from executing. So if you're going to use vfork() make sure that you exec() as soon as possible and if your exec() fails then call _exit() and not exit(). Using vfork() runs the risk of a blocking the parent forever if the child never exec's or exits.

Perl, fork and Copy On Write

fork() is simulated as much as possible in Perl. The buffering is also different. There is no vfork() in Perl so that makes it tiny bit less efficient and performance degrades fast when you repeatedly call fork and modify variables within the child. That's because COW becomes useless. If you are going to fork() in Perl then design your program carefully so that you first of all take advantage of COW and secondly keep children running rather than forking a new one for each task. Think of an HTTP server. I'll write about this with a full example later on.

But now let's look how copy on right speeds things up using simple examples in Perl and timing them. First let see how much time it takes to assign 130 variables an integer, 1000 times. That's a total of 130000 variable assignments.


#!/usr/bin/perl
use strict;
use warnings;

my ($a,$b,$c,$d,$e,$f,$g,$h,$i,$j,$k,$l,$m,$n,$o,$p,$q,$r,$s,$t)=0;
my ($u,$v,$w,$x,$y,$z) =0;
my ($aa,$ab,$ac,$ad,$ae,$af,$ag,$ah,$ai,$aj,$ak,$al,$am,$an,$ao,$ap)=0;
my ($aq,$ar,$as,$at,$au,$av,$aw,$ax,$ay,$az) =0;
my ($ba,$bb,$bc,$bd,$be,$bf,$bg,$bh,$bi,$bj,$bk,$bl,$bm,$bn,$bo,$bp)=0;
my ($bq,$br,$bs,$bt,$bu,$bv,$bw,$bx,$by,$bz) =0;
my ($ca,$cb,$cc,$cd,$ce,$cf,$cg,$ch,$ci,$cj,$ck,$cl,$cm,$cn,$co,$cp)=0;
my ($cq,$cr,$cs,$ct,$cu,$cv,$cw,$cx,$cy,$cz) =0;
my ($da,$db,$dc,$dd,$de,$df,$dg,$dh,$di,$dj,$dk,$dl,$dm,$dn,$do,$dp)=0;
my ($dq,$dr,$ds,$dt,$du,$dv,$dw,$dx,$dy,$dz) =0;

for ( 1 .. 1000 ){
  $a=$_;    
  $b=1;$c=1;$d=1;$e=1;$f=1;$g=1;$h=1;$i=1;$j=1;$k=1;$l=1;$m=1;$n=1;$o=1;
                 $p=1;$q=1;$r=1;$s=1;$t=1;$u=1;$v=1;$w=1;$x=1;$y=1;$z=1;
  $aa=1;$ab=1;$ac=1;$ad=1;$ae=1;$af=1;$ag=1;$ah=1;$ai=1;$aj=1;$ak=1;$al=1;$am=1;    
  $an=1;$ao=1;$ap=1;$aq=1;$ar=1;$as=1;$at=1;$au=1;$av=1;$aw=1;$ax=1;$ay=1;$az=1;
  $ba=1;$bb=1;$bc=1;$bd=1;$be=1;$bf=1;$bg=1;$bh=1;$bi=1;$bj=1;$bk=1;$bl=1;$bm=1;
  $bn=1;$bo=1;$bp=1;$bq=1;$br=1;$bs=1;$bt=1;$bu=1;$bv=1;$bw=1;$bx=1;$by=1;$bz=1;
  $ca=1;$cb=1;$cc=1;$cd=1;$ce=1;$cf=1;$cg=1;$ch=1;$ci=1;$cj=1;$ck=1;$cl=1;$cm=1;
  $cn=1;$co=1;$cp=1;$cq=1;$cr=1;$cs=1;$ct=1;$cu=1;$cv=1;$cw=1;$cx=1;$cy=1;$cz=1;
  $da=1;$db=1;$dc=1;$dd=1;$de=1;$df=1;$dg=1;$dh=1;$di=1;$dj=1;$dk=1;$dl=1;$dm=1;
  $dn=1;$do=1;$dp=1;$dq=1;$dr=1;$ds=1;$dt=1;$du=1;$dv=1;$dw=1;$dx=1;$dy=1;$dz=1;
}
print "a  ${a} \n";
Running the above program consistently takes 22 to 28 milliseconds on my laptop.

$ time ./loop.pl 
a  1000 

real 0m0.023s
user 0m0.016s
sys 0m0.006s


Let's change the above code so that this time we fork 1000 times without making any variable assignment within the child. And then fork 1000 times but this time assigning a value to those 130 variables within the child and see the time differences.

#!/usr/bin/perl
use warnings;
use strict;

print "before fork\n";
my ($a,$b,$c,$d,$e,$f,$g,$h,$i,$j,$k,$l,$m,$n,$o,$p,$q,$r,$s,$t)=0;
my ($u,$v,$w,$x,$y,$z) =0;
my ($aa,$ab,$ac,$ad,$ae,$af,$ag,$ah,$ai,$aj,$ak,$al,$am,$an,$ao,$ap)=0;
my ($aq,$ar,$as,$at,$au,$av,$aw,$ax,$ay,$az) =0;
my ($ba,$bb,$bc,$bd,$be,$bf,$bg,$bh,$bi,$bj,$bk,$bl,$bm,$bn,$bo,$bp)=0;
my ($bq,$br,$bs,$bt,$bu,$bv,$bw,$bx,$by,$bz) =0;
my ($ca,$cb,$cc,$cd,$ce,$cf,$cg,$ch,$ci,$cj,$ck,$cl,$cm,$cn,$co,$cp)=0;
my ($cq,$cr,$cs,$ct,$cu,$cv,$cw,$cx,$cy,$cz) =0;
my ($da,$db,$dc,$dd,$de,$df,$dg,$dh,$di,$dj,$dk,$dl,$dm,$dn,$do,$dp)=0;
my ($dq,$dr,$ds,$dt,$du,$dv,$dw,$dx,$dy,$dz) =0;

for ( 1 .. 1000 ){
   my $pid = fork();
   if ($pid < 0){
      print "fork failed $!";
   }elsif($pid == 0){
      exit(0);
   }else{
       wait(); # parent reaps child
   }
} 
Don't worry about the wait() function in the parent. I'll talk about that in the next article. Let's time it. $ time ./fork.pl before fork real 0m1.549s user 0m0.358s sys 0m0.902s Run it many times. It consistently takes around 1500 milliseconds (1.5 seconds) to run. Now let's make an assignment to all those 130 variables and see how long it takes. Remember that the same assignment of 130 x 1000 times without forking took about 25 milliseconds. So we would expect that making 130 variable assignment within the child would add 25 milliseconds to 1500 ms for the fork part, for a total of 1525 ms, or so.
#!/usr/bin/perl
use warnings;
use strict;

print "before fork\n";
my ($a,$b,$c,$d,$e,$f,$g,$h,$i,$j,$k,$l,$m,$n,$o,$p,$q,$r,$s,$t)=0;
my ($u,$v,$w,$x,$y,$z) =0;
my ($aa,$ab,$ac,$ad,$ae,$af,$ag,$ah,$ai,$aj,$ak,$al,$am,$an,$ao,$ap)=0;
my ($aq,$ar,$as,$at,$au,$av,$aw,$ax,$ay,$az) =0;
my ($ba,$bb,$bc,$bd,$be,$bf,$bg,$bh,$bi,$bj,$bk,$bl,$bm,$bn,$bo,$bp)=0;
my ($bq,$br,$bs,$bt,$bu,$bv,$bw,$bx,$by,$bz) =0;
my ($ca,$cb,$cc,$cd,$ce,$cf,$cg,$ch,$ci,$cj,$ck,$cl,$cm,$cn,$co,$cp)=0;
my ($cq,$cr,$cs,$ct,$cu,$cv,$cw,$cx,$cy,$cz) =0;
my ($da,$db,$dc,$dd,$de,$df,$dg,$dh,$di,$dj,$dk,$dl,$dm,$dn,$do,$dp)=0;
my ($dq,$dr,$ds,$dt,$du,$dv,$dw,$dx,$dy,$dz) =0;

for ( 1 .. 1000 ){
   my $pid = fork();
   if ($pid < 0){
      print "fork failed $!";
   }elsif($pid == 0){
     $b=1;$c=1;$d=1;$e=1;$f=1;$g=1;$h=1;$i=1;$j=1;$k=1;$l=1;$m=1;$n=1;$o=1;
                    $p=1;$q=1;$r=1;$s=1;$t=1;$u=1;$v=1;$w=1;$x=1;$y=1;$z=1;
     $aa=1;$ab=1;$ac=1;$ad=1;$ae=1;$af=1;$ag=1;$ah=1;$ai=1;$aj=1;$ak=1;$al=1;$am=1;
     $an=1;$ao=1;$ap=1;$aq=1;$ar=1;$as=1;$at=1;$au=1;$av=1;$aw=1;$ax=1;$ay=1;$az=1;

     $ba=1;$bb=1;$bc=1;$bd=1;$be=1;$bf=1;$bg=1;$bh=1;$bi=1;$bj=1;$bk=1;$bl=1;$bm=1;
     $bn=1;$bo=1;$bp=1;$bq=1;$br=1;$bs=1;$bt=1;$bu=1;$bv=1;$bw=1;$bx=1;$by=1;$bz=1;
     $ca=1;$cb=1;$cc=1;$cd=1;$ce=1;$cf=1;$cg=1;$ch=1;$ci=1;$cj=1;$ck=1;$cl=1;$cm=1;
     $cn=1;$co=1;$cp=1;$cq=1;$cr=1;$cs=1;$ct=1;$cu=1;$cv=1;$cw=1;$cx=1;$cy=1;$cz=1;
     $da=1;$db=1;$dc=1;$dd=1;$de=1;$df=1;$dg=1;$dh=1;$di=1;$dj=1;$dk=1;$dl=1;$dm=1;
     $dn=1;$do=1;$dp=1;$dq=1;$dr=1;$ds=1;$dt=1;$du=1;$dv=1;$dw=1;$dx=1;$dy=1;$dz=1;
     exit(0);
   }else{
     wait();
   }
}
Running the above multiple times takes about 1700 milliseconds. But I expected 1525 ms. There's consistently a 175 ms gap. That's because the kernel has to copy over from the parent to the child's stack those $ba, $dc and so on variables before making the assignments. That's a significant performance degradation. Most of the time we can live with it if it is done a few time. But if you fork and copy on a continuous basis then each 100 ms gap will add up fast to a sluggishly performing program.

No TrackBacks

TrackBack URL: http://www.farhadsaberi.com/cgi-bin/mt/mt-tb.cgi/11

Leave a comment

About this Entry

This page contains a single entry by Farhad Saberi published on January 22, 2011 10:09 PM.

Hard Link Soft Symbolic Links was the previous entry in this blog.

build create a one single file rpm package is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.