Nearest neighbor algorithm in PHP

Hi guys,

I am trying to solve the nearest neighbor problem in PHP (see http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm))

I have a list of x,y points representing Cartesian coordinates on a 2d plane. For any given point on the Cartesian plane, I need to find it’s nearest neighbor(s).

This is the first time for me working with the k-nn problem and appreciate any sort of guidance.

I know I will need to start bisecting the plane, and building a tree. But looking for some more detailed steps on how to approach this. What are the steps? What kind of tree should I use? Any help on how to get started much appreciated. Thank you.

Here’s a quick implementation, I think I got the algorithm right, I just glanced over it.

<?php

// Random cartesian coordinates (x, y) and labels
$data = array(
    array(1, 2, 'red'),   // 0 =>
    array(5, 3, 'blue'),  // 1 =>
    array(-1, 2, 'blue'), // 2 =>
    array(2, 5, 'red'),   // 3 =>
    array(3, 3, 'red'),   // 4 =>
    array(-4, 5, 'blue'), // 5 =>
    array(2, 2, 'blue'),  // 6 =>
    array(5, -2, 'red'),  // 7 =>
    array(-1, -2, 'blue'),// 8 =>
);


// Build distance matrix
$distances = $data;
array_walk($distances, 'euclideanDistance', $data);


// Example, target = datapoint 5, getting 3 nearest neighbors
$neighbors = getNearestNeighbors($distances, 5, 3);

echo getLabel($data, $neighbors) . "\
"; // red


/**
 * Calculates eucilean distances for an array dataset
 *
 * @param array $sourceCoords In format array(x, y)
 * @param array $sourceKey Associated array key
 * @param array $data 
 * @return array Of distances to the rest of the data set
 */
function euclideanDistance(&$sourceCoords, $sourceKey, $data)
{   
    $distances = array();
    list ($x1, $y1) = $sourceCoords;
    foreach ($data as $destinationKey => $destinationCoords) {
        // Same point, ignore
        if ($sourceKey == $destinationKey) {
            continue;
        }
        list ($x2, $y2) = $destinationCoords;
        $distances[$destinationKey] = sqrt(pow($x1 - $x2, 2) + pow($y1 - $y2, 2));
    }
    asort($distances);
    $sourceCoords = $distances;
}

/**
 * Returns n-nearest neighbors
 *
 * @param array $distances Distances generated above ^
 * @param mixed $key Array key of source location
 * @param int $num Of neighbors to fetch
 * @return array Of nearest neighbors
 */
function getNearestNeighbors($distances, $key, $num)
{
    return array_slice($distances[$key], 0, $num, true);
}

/**
 * Gets result label from associated data
 *
 * @param array $data 
 * @param array $neighbors Result from getNearestNeighbors()
 * @return string label
 */
function getLabel($data, $neighbors)
{
    $results = array();
    $neighbors = array_keys($neighbors);
    foreach ($neighbors as $neighbor) {
        $results[] = $data[$neighbor][2];
    }
    $values = array_count_values($results);
    $values = array_flip($values);
    ksort($values);
    return array_pop($values);
}

Value of $distances:

Array
(
    [0] => Array
        (
            [6] => 1
            [2] => 2
            [4] => 2.2360679775
            [3] => 3.16227766017
            [1] => 4.12310562562
            [8] => 4.472135955
            [7] => 5.65685424949
            [5] => 5.83095189485
        )

    [1] => Array
        (
            [4] => 2
            [6] => 3.16227766017
            [3] => 3.60555127546
            [0] => 4.12310562562
            [7] => 5
            [2] => 6.0827625303
            [8] => 7.81024967591
            [5] => 9.21954445729
        )

    [2] => Array
        (
            [0] => 2
            [6] => 3
            [8] => 4
            [4] => 4.12310562562
            [5] => 4.24264068712
            [3] => 4.24264068712
            [1] => 6.0827625303
            [7] => 7.21110255093
        )

    [3] => Array
        (
            [4] => 2.2360679775
            [6] => 3
            [0] => 3.16227766017
            [1] => 3.60555127546
            [2] => 4.24264068712
            [5] => 6
            [8] => 7.61577310586
            [7] => 7.61577310586
        )

    [4] => Array
        (
            [6] => 1.41421356237
            [1] => 2
            [0] => 2.2360679775
            [3] => 2.2360679775
            [2] => 4.12310562562
            [7] => 5.38516480713
            [8] => 6.40312423743
            [5] => 7.28010988928
        )

    [5] => Array
        (
            [2] => 4.24264068712
            [0] => 5.83095189485
            [3] => 6
            [6] => 6.7082039325
            [4] => 7.28010988928
            [8] => 7.61577310586
            [1] => 9.21954445729
            [7] => 11.401754251
        )

    [6] => Array
        (
            [0] => 1
            [4] => 1.41421356237
            [3] => 3
            [2] => 3
            [1] => 3.16227766017
            [7] => 5
            [8] => 5
            [5] => 6.7082039325
        )

    [7] => Array
        (
            [1] => 5
            [6] => 5
            [4] => 5.38516480713
            [0] => 5.65685424949
            [8] => 6
            [2] => 7.21110255093
            [3] => 7.61577310586
            [5] => 11.401754251
        )

    [8] => Array
        (
            [2] => 4
            [0] => 4.472135955
            [6] => 5
            [7] => 6
            [4] => 6.40312423743
            [3] => 7.61577310586
            [5] => 7.61577310586
            [1] => 7.81024967591
        )
)

Value of $neighbors of point 5 (-4, 5), 3 nearest:

Array
(
    [2] => 4.24264068712
    [0] => 5.83095189485
    [3] => 6
)

2 is blue, 0 is red, and 3 is red, which makes point 5 red.

In this example we ignore the source point’s existing label and reclassify it, but it’s just an example.

HTH

Edit:

Oh, and it doesn’t do the recursive dive if an equal number of labels exist for a value k, it instead returns one arbitrarily

Hey thanks for the reply! I will take a look at your solution.

I think really the best solution to this is to use a KDTree. But I have no idea how to implement one.

http://en.wikipedia.org/wiki/Kd-tree

Provides pretty detailed steps on the creation of a Kd-tree.

Remember to read the references section of the PHP manual as it’s slightly different than what you probably learned in your C or Java class.

Hey, OK guys, so I followed the algorithm in the wikipedia article and constructed my kdTree with my point set.

I am now trying to implement the nearest neighbor search. The wikipedia example shows how to find the closest neighbor, but I want the nearest three. Could someone breakdown the algo for me?

thanks!

anyone?

The OP has a two-dimensional data set. Why the hell is there a need to model the data in a KDTree before running the nearest neighbor algorithm?