Hierarchical data representation in SQL Netsted Sets

| 0 Comments | 0 TrackBacks
Storing hierarchical data in a relational database is not obvious. There are a few techniques and I studied one that is called Netsted Sets and popularized by Joe Celko in his book Trees and Hierarchies in SQL for Smarties. I found this technique in an online article a long time ago and studied it and have implemented it at work and it is always useful so short of buying that book, I will spend the time to explain it here. That article did not go into detail as I will here, in fact I don't see anyone else spending so much time detailing this one out. I have not read Joe Celko's book and wonder if there are any discrepancies.

Let's show what we're talking about:

sql_hierarchy_tree.png

Nested Sets representation in SQL is a form of Preorder Tree Traversal where every child of a node is visited first, depth first. Every node is represented by a left and a right number.  Our goal is to show the above tree, which is stored in a mySQL table for example, as this:

sql_hierarchy_side_view.gifThis is always very useful. Examples of websites that can use this are forums or a javascript menu. Forums can use this for parent to child relationships of reply messages.

Ok let's explain what's happening in the above tree with the red arrows. In a Preorder Tree Traversal we visit every child of a node first, and then the sibling of that child. So starting at node 'a', we visit 'b', then the children of 'b' which is 'c' and 'd', then continue depth first so next is 'e' and 'f'. Now that all children of 'b' are seen we can come back up and go to root's next sibling which is 'g'. Once all children of elephant_eye.jpg'g' are visited, 'h', we continue with the next child of root which is 'i'. Going left to right and depth first, nodes 'j' then 'k' are visited and since 'k' has a child, we have to visit it first before going to 'm'. We finally come back up from the right side to root's right.

So again, we go left to right, visit every left of a left node first, going depth first, and then once all left and depth first are visited, we can continue to visit the right node.

Understanding this left and depth first visiting order is crucial in understanding how every node in this technique is stored in a relational database table.

First let's show how a tree is built. This will show the SQL's involved in inserting a new node. Then we'll show the code that draws the tree, that is the depth of the indentation shown above. Then how to delete a node, how to find the path from one node up to its root. And then some basic facts and calculations such as how to determine the number of children of a node. (think how a comments section of a site knows how many replies a comment has received without actually retrieving all children of that node).

When the root node is born it looks like thissql_hierarchy_root_node.jpg
A SQL INSERT statement with left=1 and right=2 will create the root node. Then every child node will require the the updating of the left and right nodes of every node that would appear *after* that node. Let's show this:

sql_hierarchy_addNode_b.jpgYou can see that it required the right value of the root node to be incremented by 2. The four sql statements show what needs to be done and they are the same for every node added. In your HTML page you will obviously know the node a user clicks on to add a child value and that node has a left (parent left or plft) and right (prgt) values. Now let's add a child node 'c' to the parent node 'b':
sql_hierarchy_addNode_c.jpgwhat's a bit more interesting is the simplification of the m value. It is just the parent's right value minus one. The m value, which is really a reference point to know which left and right values need incrementing during a new node addition, can be further simplified. Let's show this by adding a new node 'd' to the parent 'b':
sql_hierarchy_addNode_d.jpgSo far only right values have been updated. Let's skip ahead a few steps and show a new node 'x' being added that affects the updating of left values as well.
sql_hierarchy_addNote_x.png
Now that new node insertion is cleared up, let's show how we would retrieve the nodes from the table and display them in proper order and indent a child node beneath its parent according to its depth. Just like a java script menu or the forum on the reddit website where replies to a message are indented below it. Let's do this for the same eight node tree above.

The piece of code that does it is this. There is a stack involved. In this php code it is the array $right. Also, the same code would apply whether you would display a subtree or the entire tree starting at root. That's why we specify the node's plft or prgt. The returning tree would be its subtree. (I have a php class that handles my SQL statements which is irrelevant here). I'm not an HTML DIV expert that's why the   are used for indentation :-)

$right = array();  // define stack

$ans = $db->executeQuery("select txt,lft,rgt from T where lft between $plft AND $prgt ORDER by lft asc");

      while ($row = mysql_fetch_array($ans)){
          
                if(count($right)>0){
                          while($right[count($right)-1] < $row['rgt']){
                                    array_pop($right);
                          }
                }
                // indent txt according to its depth in the tree
               echo str_repeat('&nbsp;&nbsp;&nbsp;&nbsp;',count($right));
               echo $row['txt'] . "<br />";

               //add this node to the stack
               $right[] = $row['rgt'];
       }


In a few words, you define an empty stack (called $right). Then you loop over the rows retrieved. For each node, if the stack is not empty, keep popping stack until the top of the stack has a value that is smaller than the current node. Then what remains in the stack, that is the size of the stack, will define the depth of the node. For root this will be zero. For root's immediate children, b, d and e, the size of the stack (the size of the array $right) will be 1. So once the depth of the node is determined, you print the node txt and then you add this node's right value on top of the stack. Let's show this loop and its stack graphically:
sql_hierarchy_draw_tree_a.jpgThe first time we enter the loop the stack is empty so nothing's popped and the size of the stack being zero defines that the root node will be indented zero times. 16 is pushed onto the stack $right and now we come back to the top of the while loop. The next node is 'b' (ORDER by lft ASC);sql_hierarchy_draw_tree_b.png Since 'b's right is smaller than the top of the stack, nothing's popped. The size of the stack remains 1 so the indentation of 'b' will be 1. Then 5 is pushed onto stack. Next node is 'c':
sql_hierarchy_draw_tree_c.jpg
Current right value is 4. Nothing's popped from the stack because 4 is smaller than the top of the stack. The resulting size of the stack, 2 (values 16 and 5), is the depth of 'c'. Then 4 is pushed onto stack. The next node d is where we seeing the popping of the stack. Node c is at depth 2 and we see d we'll pop 5 and 4 because they are smaller than d's right value 9. This gives a stack size of 1, (value 16), defining correctly d's depth of 1.
sql_hierarchy_draw_tree_d.jpgThe same process continues with the rest of the nodes x, e, f and g, in that order because the SQL statement was ORDER by lft ASC.

Now that we know how to add to the tree and how to draw it, let's show how to delete a node from it. Remember the property of the PREORDER traversal of this tree. Every child's left value will be higher than that of its parent. And every left child of a parent will have a smaller left value than its right sibling. Look at the tree drawn at the top to see that 'c' has a bigger left value than its parent 'b.' Also that 'b' has a smaller left value than its sibling 'g.' What it all means is that when you delete a node, you have to update every other node that would be seen AFTER the deleted node in a preorder traversal. Or, update every node that would have a left and right value higher than the deleted node.

While showing how to delete a node or an entire sub tree, we'll also learn how to count the number of child nodes of any node. Given lft and rgt of a node, the count of its children is going to be ⌊ ( rgt - lft ) / 2 ⌋

For the tree at the very top of this article, we see that node i has 4 children. Or, ⌊ ( 25 - 16 ) / 2 ⌋ = 4. When we're going to delete a node, obviously all of its children are going to be deleted as well. Every node coming after the group of nodes deleted (because of pre order traversal) will have to have their left and right values updated. By how much is what we have to calculate. The number would have to be relative to how many nodes are going to disappear. Let's show the formula by deleting node "5 d 10" from our original tree at the top.
sql_hierarchy_delete_node.jpgSo 6 is the number that we have to reduce the left and right values of all nodes that come *after* node 'd' in our preorder traversal. You already know your deleting node's left and right values of 5 and 10:

DELETE from T where lft >= 5 and rgt <= 10.
UPDATE T set lft=lft - 6 where lft > 5.
UPDATE T set rgt=rgt - 6 where rgt > 10.

So far we've seen how to add a node, delete a node of a sub tree in case the node is not a leaf node, and how to draw the tree. We also saw how to calculate how many children a node has. Another cool property this technique allows us to find is the ancestry of a leaf node up to the root of the tree:
sql_hierarchy_node_ancestry.jpg
That's the path to a node. This SQL SELECT statement on any node will return all nodes that are between it and the root node. To draw it you would "ORDER by lft ASC" and use the same loop to draw the tree as discussed above.

One other property you can see is the discovery of a leaf node. If right minus left equals 1 then the node is a leaf node.

Finally, there comes the problem of knowing all immediate children of a node. That is, only one level deep. For 'a' it would be b,g and i. For 'i' it would be j,k and m. You can write a script to keep calling select statements starting from left node 'b', then selecting 'g' and then selecting 'i'. That's because it is obvious that the immediate left child of a parent will have its left number be the parent's plus 1. (b's 2 is a's 1 + 1) (j's 17 is i's 16 + 1). And then for every node, the left value of its immediate sibling is going to be that node's right value plus one. (g's 12 is b's 11 + 1) (k's 19 is j's 18 + 1). So knowing the two above rules one could write a select statement in a loop to keep fetching first the immediate left child and then the next sibling of that child until you get a node who's right value is the parent's right minus one. (i's 25 is a's 26 - 1) (m's right is i's 25 - 1). Not knowing how many siblings there are, calling select in a loop can become overwhelming.

The other option to knowing every node's immediate children in one SQL query is to add a depth column to the original table. This is also explained in an article on Wikipedia on Nested Sets model. The new depth column would look like this for the above example. 

txt left right depth
a1260
b2111
c342
d5102
e673
f893
g12151
h13142
...
You know when you have inserted your very first node, the root with lft=1 and rgt=2. You specify its depth to be zero. After that, the node's immediate child is inserted with a depth of "depth plus one." To retrieve all nodes just one level deeper one more constraint can be added:

SELECT txt,lft,rgt FROM T where lft between $plft AND $prgt AND depth=ParentDepth + 1 ORDER by lft ASC;

I hope this was of any use. If anyone sees a mistake please do let me know.

No TrackBacks

TrackBack URL: http://www.farhadsaberi.com/cgi-bin/mt/mt-tb.cgi/21

Leave a comment

About this Entry

This page contains a single entry by Farhad Saberi published on January 12, 2013 1:55 AM.

Follow Symbolic Link Tree pstree was the previous entry in this blog.

Perl detach process daemon is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.