Tree Building Challenge

I worked out a bit of code to build an array tree for the directory structure of a website. The tree is stored in the database on a single table. Here’s my test array, the results of query “SELECT * FROM pages”



Array
(
    [0] => Array
        (
            [recordid] => 1
            [parentid] => 0
            [name] => 
            [title] => Home Page
            [class] => Gazelle\\Page
        )

    [1] => Array
        (
            [recordid] => 2
            [parentid] => 1
            [name] => js
            [title] => Javascript Page
            [class] => Gazelle\\JavascriptPage
        )

    [2] => Array
        (
            [recordid] => 3
            [parentid] => 1
            [name] => technology
            [title] => Tech News
            [class] => 
        )

    [3] => Array
        (
            [recordid] => 4
            [parentid] => 3
            [name] => windows
            [title] => Windows
            [class] => 
        )

    [4] => Array
        (
            [recordid] => 5
            [parentid] => 3
            [name] => mac
            [title] => Mac
            [class] => 
        )

    [5] => Array
        (
            [recordid] => 6
            [parentid] => 3
            [name] => linux
            [title] => Linux
            [class] => 
        )

    [6] => Array
        (
            [recordid] => 7
            [parentid] => 6
            [name] => slackware
            [title] => Slackware
            [class] => 
        )

    [7] => Array
        (
            [recordid] => 8
            [parentid] => 6
            [name] => debian
            [title] => Debian
            [class] => 
        )

)

The resulting tree needs to look like this



Array
(
    [1] => Array
        (
            [attributes] => Array
                (
                    [name] => 
                    [title] => Home Page
                    [class] => Gazelle\\Page
                )

            [children] => Array
                (
                    [2] => Array
                        (
                            [attributes] => Array
                                (
                                    [name] => js
                                    [title] => Javascript Page
                                    [class] => Gazelle\\JavascriptPage
                                )

                            [children] => Array
                                (
                                )

                        )

                    [3] => Array
                        (
                            [attributes] => Array
                                (
                                    [name] => technology
                                    [title] => Tech News
                                    [class] => 
                                )

                            [children] => Array
                                (
                                    [4] => Array
                                        (
                                            [attributes] => Array
                                                (
                                                    [name] => windows
                                                    [title] => Windows
                                                    [class] => 
                                                )

                                            [children] => Array
                                                (
                                                )

                                        )

                                    [5] => Array
                                        (
                                            [attributes] => Array
                                                (
                                                    [name] => mac
                                                    [title] => Mac
                                                    [class] => 
                                                )

                                            [children] => Array
                                                (
                                                )

                                        )

                                    [6] => Array
                                        (
                                            [attributes] => Array
                                                (
                                                    [name] => linux
                                                    [title] => Linux
                                                    [class] => 
                                                )

                                            [children] => Array
                                                (
                                                    [7] => Array
                                                        (
                                                            [attributes] => Array
                                                                (
                                                                    [name] => slackware
                                                                    [title] => Slackware
                                                                    [class] => 
                                                                )

                                                            [children] => Array
                                                                (
                                                                )

                                                        )

                                                    [8] => Array
                                                        (
                                                            [attributes] => Array
                                                                (
                                                                    [name] => debian
                                                                    [title] => Debian
                                                                    [class] => 
                                                                )

                                                            [children] => Array
                                                                (
                                                                )

                                                        )

                                                )

                                        )

                                )

                        )

                )

        )
)

My function to do this process can build the tree taking between 36 and 44 steps (possibly a few more). The reason for the variance is I plugged in a shuffle statement on the test array so that I could be sure the function would work even if the input array order was entirely random, though the order will affect the number of loops.


protected function buildTree( $data, $key, $parent, $seek = 0) {

	$tree = array();
		
	foreach($data as $k => $v) {
		if ($v[$parent] == $seek) {
			$i = $v[$key];
			unset($data[$k], $v[$key], $v[$parent]);
			$tree[$i] = array(
				'attributes' => $v,
				'children' => $this->buildTree( $data, $key, $parent, $i)
			);
		}
	}
		
	return $tree;
		
}

During testing I used this invocation.


$array = $s->fetchAll( \\PDO::FETCH_ASSOC );
shuffle($array); // make sure this works no matter what the order of input.
$this->buildTree( $array, $key, $parent);

Now, I am not that good with data structures, and something tells me the people here that are smarter than I am can come up with something that can build the tree in fewer than 36 steps.

Or if someone knows of an inbuilt function that does this tell me so I can laugh at myself for wasting time (though the educational value of working this out was worth it).

Hard to say. Even on the simple 10 node test there’s a lot of variance in the time exec of both functions, probably due to other threads that the OS happens to be processing in that particular span. Structural depth has a role to play to I’m thinking since on none of the tests does the tree go down more than 3 levels, but I’d expect more levels to further slow my original function. If I were to guess I’d think the average would be around 50 nodes. That said, my function isn’t so much faster on less than 50 nodes that it really matters or could be justified in being kept around. If that where the case a simple branch at the calling level could choose which function to use based on the node count.

Sorta like the fact my compression algorithm doesn’t run for files less than 2K, it’s just not worth the overhead.

BTW. Since your function is in O(n^2) and mine in O(n) the difference will only become bigger with more nodes (the time your function takes increases exponentially in the number of nodes, the time my function takes increases linearly in the number of nodes). Could you try something extreme like 10,000 nodes ? :slight_smile:

I’ve already cleared the test code out of sandbox and moved on, but I did think up this loop to build a test case on both functions to further illustrate the difference. Here’s a loop to build your 10,000 nodes (God help your computer) or any arbritrary number of nodes (total nodes created being maxOuter * maxInner).


$data = array(); 
$maxOuter = 100;
$maxInner = 100;

for ($i = 0; $i <= $maxOuter; $i++) {
        for($j = 0; $j <= $maxInner; $j++) { 
            $data[$j]['recordid'] = $j; 
            $data[$j]['parentid'] = $i * rand(1,100); 
        } 
}
shuffle($data); 

Then run the two functions. For those watching the code is drifting around elsewhere in the thread. It’s very instructive.

Thanks again Scallio.

As the dataset grows the new function wins out. Here’s s the test code


$data = array();
		for($i=0; $i<101; $i++) {
			$data[$i]['recordid'] = $i;
			$data[$i]['parentid'] = 0;
		}
		
		for($i=101; $i<500; $i++) {
			$data[$i]['recordid'] = $i;
			$data[$i]['parentid'] = rand(1,100);
		}
shuffle($data);

While no page structure is ever going to be that fricken big (I hope) I do intend to use the function for other things. With this 500 node array the new function averages 3x faster.

This is puzzling. I installed the functions you gave above ScallioTx with modifications to fit the requirements of the class. For reference, the output


       /**
 	* Turn data into a tree node
 	* @param array $data
	* @return array An array 'name', 'title' and 'class' in the 
	* 	key 'attributes' and an empty array in the key 'children' -- to add children to later
 	*/
	protected function createNode($data, $key, $parent){
		unset($data[$key], $data[$parent]);
		return array(
			'attributes' => $data,
            'children' => array()
        );
	}

	/**
	  * Add children to a node
	  * @param int $i The recordid of the node to add the children to
	  * @param array $tree The initial tree to start growing
	  * @param array $childrenOf Array that holds children
	  * @param <type> $hasChildren Array that indicates which node has children
	  * @return array The tree with the needed children added
	  */

	protected function addChildren($i, $tree, $childrenOf, $hasChildren, $key, $parent) {

		// If the node with recordid $i has children ...
		if (isset($hasChildren[$i])) {
			// .. loop through those children
			foreach($childrenOf[$i] as $child) {
				// and add the children's children
				$tree['children'][ $child[$key] ] = $this->addChildren(
					$child[$key], 
					$this->createNode($child, $key, $parent), 
					$childrenOf, 
					$hasChildren, 
					$key, 
					$parent 
				);
			}
		}

	    return $tree;
	}


	protected function buildTree ( $data, $key, $parent ) {
		
		$hasChildren = array();
		$childrenOf = array();
		$root = array();

		foreach($data as $k => $v) {
			// Determine if this is a child node or the root node
			if ($v[$parent] > 0) {
				// It's a child node
				$childrenOf[$v[$parent]][] = $v;
				$hasChildren[$v[$parent]] = true;
			} else {
				// It's a root node
				$root[$v[$key]] = $this->createNode( $v, $key, $parent );
			}
		}
		
		foreach ( $root as $k => &$r ) {
			$r = $this->addChildren($k, $r, $childrenOf, $hasChildren, $key, $parent );
		}
		
		return $root;
	}

I then tested against the old function just to be sure…


shuffle($data);
		$tbefore = microtime();
		$this->buildTree( $data, $key, $parent );
		$newFuncTime = microtime() - $tbefore;
		
		$tbefore = microtime();
		$this->oldBuildTree($data, $key, $parent);
		$oldFuncTime = microtime() - $tbefore;
		
		stop($newFuncTime, $oldFuncTime);

The new loop is twice as slow on average!? This seems counterintuitive if it iterates less. Hmm… Maybe it’s not… Testing.

EDIT: Well, your function’s for each loop consistently sits at 11 iterations on the test data where mine bounces around 100 depending on the shuffle result. The interesting part is recursion only happens 11 times for my function and 12 times for yours regardless of the shuffle.

It would seem the cost in having PHP do the extra data structures outweighs avoiding what is mainly an empty loop.

I still need to create a fairly large dataset of around 100 nodes and see if the results switch.

Okay, so I did some benchmarking for the two different functions using the following code to generate the data:


// The root node
$data[0]['recordid']=0;
$data[0]['parentid']=-1;

// Level 1
for($i=1; $i<1001; $i++) {
	$data[$i]['recordid'] = $i;
	$data[$i]['parentid'] = 0;
}

// Level 2
for($i=1001; $i<[COLOR="Blue"]2000[/COLOR]; $i++) {
	$data[$i]['recordid'] = $i;
	$data[$i]['parentid'] = rand(1,1000);
}
shuffle($data);

Note that your code didn’t include a root node, which might explain why your results were all over the place (the scripts just took a random node as root node …)
Also, I had to change if ($v['parentid'] > 0) to if ($v['parentid'] >= 0) in my code to work.

Okay so what I did is constantly increase the value indicated in blue in the code above.

Attached are two benchmark graphs which I’ll let speak for themselves :smiley:

The only thing I’d like to clarify that I’d been happy to test my script for even higher values, but when I tried 202,000 PHP ran out of its allocated 128MB of RAM :stuck_out_tongue:
The x-axis is for the number in blue indicated in the code above and the y-axis is the number of seconds it took the script to create the tree.

For my script I added sort($childrenOf[ $i ]); so the nodes in my function would also be sorted as they are in your script, for a good like-for-like comparison.

PS. Note the labels on the x-axis!

Have you by any chance found out the tipping point? i.e. at what number of nodes does my function start to be faster than yours?

BTW. Since your function is in O(n^2) and mine in O(n) the difference will only become bigger with more nodes (the time your function takes increases exponentially in the number of nodes, the time my function takes increases linearly in the number of nodes). Could you try something extreme like 10,000 nodes ? :slight_smile:

That was fun!
Your solution runs in O(n^2), meaning it can visit each of the n data points n times. My solution (below) runs in O(n), meaning it only visits each data point twice (the 2 is not taken into account in terms of the big Oh)


/**
 * Turn data into a tree node
 * @param array $data
 * @return array An array 'name', 'title' and 'class' in the key 'attributes' and an empty array in the key 'children' -- to add children to later
 */
function createNode($data)
{
	return
		array(
			'attributes' => array(
				'name'=>$data['name'],
				'title'=>$data['title'],
				'class'=>$data['class']
			),
			'children'=>array()
		);
}

/**
 * Add children to a node
 * @param int $i The recordid of the node to add the children to
 * @param array $tree The initial tree to start growing
 * @param array $childrenOf Array that holds children
 * @param <type> $hasChildren Array that indicates which node has children
 * @return array The tree with the needed children added
 */
function addChildren($i, $tree, $childrenOf, $hasChildren)
{
	// If the node with recordid $i has children ...
	if (isset($hasChildren[ $i ]))
	{
		// .. loop through those children
		foreach($childrenOf[ $i ] as $child)
		{
			// find the recordid of the children
			$recordid=$child['recordid'];
			// and add the children's children
			$tree['children'][ $child['recordid'] ] = addChildren($recordid, createNode($child), $childrenOf, $hasChildren);
		}
	}
	return $tree;
}

$hasChildren = array();
$childrenOf = array();
foreach($data as $k=>$v)
{
	// Determine if this is a child node or the root node
	if ($v['parentid'] > 0)
	{
		// It's a child node
		$childrenOf[ $v['parentid'] ][] = $v;
		$hasChildren[ $v['parentid'] ] = true;
	}
	else
	{
		// It's the root node
		$root_record_id=$v['recordid'];
		$root=createNode($v);
	}
}

// Create the tree and show it
print_r(array($root_record_id => addChildren($root_record_id, $root, $childrenOf, $hasChildren)))

The only “drawback” is that the children are not necessarily sorted. If the input data is not sorted the script may find that the children of node 1 are 3 and 2, instead of 2 and 3.
This can be easily solved by adding sort($childrenOf[ $i ]); just before // .. loop through those children, but that adds sorting to the equation which makes my script run in O(n log n) – assuming PHP’s sort function uses a O(n log n) sorting algorithm like merge sort, which I suppose it does.
Oh, and my scripts needs more bookkeeping then yours to know what’s what.

Enjoy, and if you have any questions don’t hesitate to ask :slight_smile:

This is great. Thanks a lot.

One note - The array I gave had three elements other than the parentid and recordid, but the intent of the original was a bit more broad but I didn’t go over. The function that this is part of is called queryTree. You give it an sql query string and you may give it a parameters string and further tell it which field is the key and which is the parent, though it has defaults for no parameters, the key is assumed to be recordid and parent defaults to parentid.

SQL will return a flat array and then the second part of the function that this challenge question was about transforms the array into a tree. That said, the exact fields other than the key and the parent are meant to be unknown and have no constraints put on them allowing the function to be used whenever the data of a table needs to be rendered in a tree fashion.

Since its core to a new framework I’m working on and plan to release open source it’s critical that it be efficient.

That said, I think I can finish from here to restore the arbitrary field count. Thanks again.