# Nearest neighbor algorithm in PHP

Hi guys,

I am trying to solve the nearest neighbor problem in PHP (see http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm))

I have a list of x,y points representing Cartesian coordinates on a 2d plane. For any given point on the Cartesian plane, I need to find it’s nearest neighbor(s).

This is the first time for me working with the k-nn problem and appreciate any sort of guidance.

I know I will need to start bisecting the plane, and building a tree. But looking for some more detailed steps on how to approach this. What are the steps? What kind of tree should I use? Any help on how to get started much appreciated. Thank you.

Here’s a quick implementation, I think I got the algorithm right, I just glanced over it.

``````<?php

// Random cartesian coordinates (x, y) and labels
\$data = array(
array(1, 2, 'red'),   // 0 =>
array(5, 3, 'blue'),  // 1 =>
array(-1, 2, 'blue'), // 2 =>
array(2, 5, 'red'),   // 3 =>
array(3, 3, 'red'),   // 4 =>
array(-4, 5, 'blue'), // 5 =>
array(2, 2, 'blue'),  // 6 =>
array(5, -2, 'red'),  // 7 =>
array(-1, -2, 'blue'),// 8 =>
);

// Build distance matrix
\$distances = \$data;
array_walk(\$distances, 'euclideanDistance', \$data);

// Example, target = datapoint 5, getting 3 nearest neighbors
\$neighbors = getNearestNeighbors(\$distances, 5, 3);

echo getLabel(\$data, \$neighbors) . "\
"; // red

/**
* Calculates eucilean distances for an array dataset
*
* @param array \$sourceCoords In format array(x, y)
* @param array \$sourceKey Associated array key
* @param array \$data
* @return array Of distances to the rest of the data set
*/
function euclideanDistance(&\$sourceCoords, \$sourceKey, \$data)
{
\$distances = array();
list (\$x1, \$y1) = \$sourceCoords;
foreach (\$data as \$destinationKey => \$destinationCoords) {
// Same point, ignore
if (\$sourceKey == \$destinationKey) {
continue;
}
list (\$x2, \$y2) = \$destinationCoords;
\$distances[\$destinationKey] = sqrt(pow(\$x1 - \$x2, 2) + pow(\$y1 - \$y2, 2));
}
asort(\$distances);
\$sourceCoords = \$distances;
}

/**
* Returns n-nearest neighbors
*
* @param array \$distances Distances generated above ^
* @param mixed \$key Array key of source location
* @param int \$num Of neighbors to fetch
* @return array Of nearest neighbors
*/
function getNearestNeighbors(\$distances, \$key, \$num)
{
return array_slice(\$distances[\$key], 0, \$num, true);
}

/**
* Gets result label from associated data
*
* @param array \$data
* @param array \$neighbors Result from getNearestNeighbors()
* @return string label
*/
function getLabel(\$data, \$neighbors)
{
\$results = array();
\$neighbors = array_keys(\$neighbors);
foreach (\$neighbors as \$neighbor) {
\$results[] = \$data[\$neighbor][2];
}
\$values = array_count_values(\$results);
\$values = array_flip(\$values);
ksort(\$values);
return array_pop(\$values);
}

``````

Value of `\$distances`:

``````Array
(
[0] => Array
(
[6] => 1
[2] => 2
[4] => 2.2360679775
[3] => 3.16227766017
[1] => 4.12310562562
[8] => 4.472135955
[7] => 5.65685424949
[5] => 5.83095189485
)

[1] => Array
(
[4] => 2
[6] => 3.16227766017
[3] => 3.60555127546
[0] => 4.12310562562
[7] => 5
[2] => 6.0827625303
[8] => 7.81024967591
[5] => 9.21954445729
)

[2] => Array
(
[0] => 2
[6] => 3
[8] => 4
[4] => 4.12310562562
[5] => 4.24264068712
[3] => 4.24264068712
[1] => 6.0827625303
[7] => 7.21110255093
)

[3] => Array
(
[4] => 2.2360679775
[6] => 3
[0] => 3.16227766017
[1] => 3.60555127546
[2] => 4.24264068712
[5] => 6
[8] => 7.61577310586
[7] => 7.61577310586
)

[4] => Array
(
[6] => 1.41421356237
[1] => 2
[0] => 2.2360679775
[3] => 2.2360679775
[2] => 4.12310562562
[7] => 5.38516480713
[8] => 6.40312423743
[5] => 7.28010988928
)

[5] => Array
(
[2] => 4.24264068712
[0] => 5.83095189485
[3] => 6
[6] => 6.7082039325
[4] => 7.28010988928
[8] => 7.61577310586
[1] => 9.21954445729
[7] => 11.401754251
)

[6] => Array
(
[0] => 1
[4] => 1.41421356237
[3] => 3
[2] => 3
[1] => 3.16227766017
[7] => 5
[8] => 5
[5] => 6.7082039325
)

[7] => Array
(
[1] => 5
[6] => 5
[4] => 5.38516480713
[0] => 5.65685424949
[8] => 6
[2] => 7.21110255093
[3] => 7.61577310586
[5] => 11.401754251
)

[8] => Array
(
[2] => 4
[0] => 4.472135955
[6] => 5
[7] => 6
[4] => 6.40312423743
[3] => 7.61577310586
[5] => 7.61577310586
[1] => 7.81024967591
)
)

``````

Value of `\$neighbors` of point 5 (-4, 5), 3 nearest:

``````Array
(
[2] => 4.24264068712
[0] => 5.83095189485
[3] => 6
)

``````

2 is blue, 0 is red, and 3 is red, which makes point 5 red.

In this example we ignore the source point’s existing label and reclassify it, but it’s just an example.

HTH

Edit:

Oh, and it doesn’t do the recursive dive if an equal number of labels exist for a value k, it instead returns one arbitrarily

Hey thanks for the reply! I will take a look at your solution.

I think really the best solution to this is to use a KDTree. But I have no idea how to implement one.

Provides pretty detailed steps on the creation of a Kd-tree.

Remember to read the references section of the PHP manual as it’s slightly different than what you probably learned in your C or Java class.

Hey, OK guys, so I followed the algorithm in the wikipedia article and constructed my kdTree with my point set.

I am now trying to implement the nearest neighbor search. The wikipedia example shows how to find the closest neighbor, but I want the nearest three. Could someone breakdown the algo for me?

thanks!

anyone?

The OP has a two-dimensional data set. Why the hell is there a need to model the data in a KDTree before running the nearest neighbor algorithm?