Let's implement a Perl program that will read an Infix mathematical expression and translates it into a Postfix expression that a calculator could use. As an application's support person you might not find this very useful in your daily work activities. But culturally it is very important to get an insight into what a compiler does when you write an infix expression such as

6 - 2 * 4 + 15 / 3 - 1

How will the compiler know that 2 * 4 must be evaluated before "6 - 2?" In Math + and - have a lower precedence than * and /. And so 2 * 4 and 15 / 3 must be evaluated BEFORE the additions and subtractions. There are other techniques that compilers apply to figure this out. And one of them is to translate the above infix expression to its postfix representation:

6 2 4 * - 15 3 / 1 - +

The real purpose of this article is to write a program that will translate infix format into its postfix with Perl. It's just a fun exercise and the text book I'm reading has left it as an exercise. The hard part was to understand on paper how to do it which I won't go into detail explaining here. Every text book on compilers should have this explained.

Remember that operators are those +,-,*,/ and ** which is the exponentiation notation in perl (double star). And operands are numbers 1,2,3 ... Also, all operators +,-,/ and * are left associative. Meaning that you go from left to right. 2 + 9 - 3 equals 8 because you went from left and did the addition first. 6/2 equals 3 because you went from left to right. The exponentiation however is Right associative. Meaning that 3 ** 2 ** 4 in math is 3 ** ( 2 ** 4 ) which is 3 ** (16) which equals 43046721. We went from right to left and did 2 ** 4 first. That's just how it is in math.

Here are the rules the algorithm obeys as it reads the infix notation one word at a time from left to right:

1- If an operand, print it immediately

2- If a closing parenthesis, pop stack symbols until an open parenthesis is seen.

3- If an operator, pop all stack symbols until we see a symbol on the stack that is of lower precedence. If the symbol on the stack is the same then pop it if it is left associative.

4- End of input, pop all remaining stack symbols.

For rule 3 to work we need to establish for every operator a precedence. Just an integer number that will define who's of higher precedence. Furthermore, an operator must have a precedence value based on whether it is on the stack or whether it is read from the input.

Let me explain. If you read a + from the input and there's a plus on the stack, then you have to pop the stack. Meaning that a + on the stack has a higher precedence than a + from input. However, we have to assign to the exponentiation operator (**) a lower precedence when it is on the stack than when it is from the input. This is to make the second part of rule 3 work. That is, if the symbol is the same, pop it if it is left associative. Since ** is right associative, we need to assign it a lower precedence for when it is on the stack so that a ** from the input does not cause a pop. To map every symbol to its precedence I use two hashes: %stackOpPrec and %inputOpPrec.

Stacks are the only data structures needed. I use two stacks: one for storing the operators (just called @stack) and one (@parenStack) for balanced symbol checker which in our case are parenthesis. Integers don't need to be stored. Rule 1 above tells us to just print it out. I won't go into the detail of the balanced symbol checker technique. It's really simple in that you push onto a stack every opening parenthesis. And every time you see a closing one, you top the stack and there's got to be a matching closing one. At the very end if the stack is empty then you know that your symbols '(' and ')' were properly balanced.

Now, take a look at this state machine and then explain it.

To be able to validate the infix expression I came up with the above state machine that I use in the algorithm. You can start an infix expression with a '(' or an operand. Since we can alway put a '(' in front of any operand we will just start with '(' or assume one is there, legally. The diagram above shows that once you are at '(', the next input must either be another '(', or a ')' or an operand. If your input is an operand, then your next input must either be an operator or a ')'. Then an operator can only be followed by a '(' or an operand. The double circles denote the acceptable end states: you must be at one of those states when you finish. If you finish on a state other than a ')' or an operand then the input was invalid.

So let me explain how the state machine is implemented programatically. I define the follow states:

state 0: operand, (

state 1: operator, )

state 2: operand, ) , (

I start reading the input while being at state 0. Meaning that my input must either be an operand or a '('. And depending on which I see, I will change my state accordingly. This snippet shows the state change rule after each input:

state=0

foreach input from left to right {

if input == '('

state = 2

if input == ')' or input == operand

state = 1

if input == operator

state = 0

}

I error out of the program if at a certain state I encounter an input that is not allowed in that state.

The

*operator parsing algorithm*is presented here:#!/usr/bin/perl

use warnings;

use strict;

my @input = split(/\s+/,$ARGV[0]);

my %inputOpPrec = ( '(' => 99, #highest precedence. Nothing would be poped

'+' => 4,

'-' => 5 ,

'*' => 9 ,#pop everything on the stack higher than 9

'/' => 10,

'**' => 15 );

my %stackOpPrec = ( '(' => 0, #pop nothing and push on stack

'+' => 6,

'-' => 7,

'*' => 11,

'/' => 12,

'**' => 14,#since right associative, stack prec is lower than input prec

);

my @parenStack; # used to perform the parenthesis balance checker

my @stack; # used for storing operators from input

my $state = 0;

for(my $i = 0 ; $i < scalar(@input); $i++){

my $inputSymbol = $input[$i];

if($inputSymbol eq '(' && ($state == 0 || $state == 2) ){

$state = 2;

push @parenStack, '(';

push @stack , '(';

}

elsif ($inputSymbol eq ')' && ($state == 1 || $state == 2)){

$state = 1;

# For every closing parenthesis, the rule of balancing tells us that

# top of parenStack must be an opening paranthesis. If it is then

# pop the matching opening paren, if it is not then error out as it is unbalanced.

unless (defined $parenStack[$#parenStack] && $parenStack[$#parenStack] eq '('){

die "unbalanced parenthesis in infix expression\n";

}else{

delete $parenStack[$#parenStack];

}

for(my $j = $#stack; $j >= 0; $j--){

if ($stack[$j] eq '(') {

delete $stack[$j];

last;

}else{

print delete $stack[$j]; print " ";

}

}

}

elsif (exists $inputOpPrec{$inputSymbol} && $state == 1){ #is it an operator?

elsif (exists $inputOpPrec{$inputSymbol} && $state == 1){ #is it an operator?

$state = 0;

#if stack is empty, push symbol onto stack and continue

if(scalar(@stack) == 0 ){

push @stack, $inputSymbol;

}else{

#from top of stack keep poping until stack symbol has lower precedence

for(my $j = $#stack; $j >= 0; $j--){

my $stackSymbol = $stack[$j]; #Top stack

#print "input symbol: $inputSymbol\n";

if( $stackOpPrec{$stackSymbol} > $inputOpPrec{$inputSymbol} ){

print delete $stack[$j]; print " ";

}else{

last;

}

}

#we're done poping lower precedence symbols on stack

#push onto stack the input symbol and continue with next input symbol

push @stack, $inputSymbol;

}

}

elsif($inputSymbol =~ /(?:^(-|)\d+$)/ && ($state == 0 || $state == 2)){

$state = 1;

print "$inputSymbol "; # it is a operand

}

else{

else{

die "Error in expression at position $i, near symbol '$inputSymbol'\n";

}

}

if (scalar(@parenStack) > 0) {

die "infix expression has unbalanced parantheses\n";

}

#rule 4: End of input. Pop all remaining operators

for(my $j = $#stack; $j >= 0; $j--){

print $stack[$j]." ";

}

print "\n";

If you noticed I did not define a precedence for ')' while on stack (it's missing in %stackOpPrec). That's because it is never pushed onto it. We only push '(' until we see a ')' which ends up poping '('.

I didn't bother elaborating on the input parsing mechanism. I think that I would have to create a state machine for that as well and tokenize the input. But that wasn't the purpose of this study. To run the infix to postfix program, just make sure that all inputs symbols are separated by a space.

$./infixPostfix.pl "3 + 4"

3 4 +

$./infixPostfix.pl "3 - 4 * 5"

3 4 5 * -

$./infixPostfix.pl "3 + -4 -

**-**5"3 -4 +

**-**5 -$./infixPostfix.pl "2 ** 5 - 1"

2 ** 5 - 1

$./infixPostfix.pl "3 * 2 ** 5 - 1"

3 2 5 ** * 1 -

$./infixPostfix.pl "( 5 - 2 ) * 6"

5 2 - 6 *

$./infixPostfix.pl "( 5 + 6 ) * ( 6 - ( 5 + 4 ) )"

5 6 + 6 5 4 + - *

$./infixPostfix.pl "3 + 2 ** 2 ** 2"

3 2 2 2 ** ** +

$./infixPostfix.pl "3 + ( 2 ** 2 ) ** 2"

3 2 2 ** 2 ** +

$./infixPostfix.pl "3 / ( 2 + 6 ) ** 2 - 5 * ( 10 / 5 )"

3 2 6 + 2 ** / 5 10 5 / * -

$./infixPostfix.pl "( ( ( ( 3 ) ) + ( (

**-**4 ) ) ) )"3

**-**4 +$./infixPostfix.pl "( (

**-**3 + 74 ) -**-**4 ) ** (**-**2 + 5 ) / ( ( 2 ) ) -**-**1"**-**3 74 +

**-**4 -

**-**2 5 + ** 2 /

**-**1 -

$./infixPostfix.pl "1 - 2 - 3 * 4 ** 5 * 6 / 7 ** 2 ** 2"

1 2 - 3 4 5 ** * 6 * 7 2 2 ** ** / -

$./infixPostfix.pl "1 - ( ( 2 + 4 ) * 8 ) / ( 2 / 2 ** 5 ) - ( 60 + 5 ) ** 3"

1 2 4 + 8 * 2 2 5 ** / / - 60 5 + 3 ** -

I have not read any compiler source code to figure out how they've done it there. I truly should and hope to do so sometime in 2011. This is my version of converting infix to postfix. If there's a better way of doing it or if there's a compiler developer out there who can teach me a thing or two then I'm all ears.

I have not read any compiler source code to figure out how they've done it there. I truly should and hope to do so sometime in 2011. This is my version of converting infix to postfix. If there's a better way of doing it or if there's a compiler developer out there who can teach me a thing or two then I'm all ears.

## Leave a comment