The Seven Lines of Code of which I’m Most Proud
I took a machine learning course in graduate school. As part of the coursework, we were taught about the fundamental algorithms of machine learning and were asked to implement some of them. My implementation of the k-nearest-neighbors classification algorithm (k-NN) remains some of my code that I’m most proud of: we were given that a certain number of test cases that had to run in less than two minutes. My program consisted of seven lines of Python and took less than two seconds to execute. In this post, I will walk through the algorithm and how I used vectorized calculations from NumPy to make my implementation quite performant.
The goal of the k-NN classification is to decide to which class a new piece of data belongs. In this example, I was only dealing with two classes: positive and negative. Each piece of data is modeled as a multi-dimensional vector; it doesn’t matter in particular how many components there are, only that all the data have the same dimensionality.
In order to classify a new piece of data, we need some preclassified data, known as training data. So, these are our first two inputs:
positive: a 2D array of positive examples, in row-major order
negative: a 2D array of negative examples, in row-major order and with the same number of columns as
Together, these two arrays make up the training set.
There are two other parameters to k-NN. The first is the eponymous
k, which is the number of neighbors to be considered for classification. At this point, it’s probably important to define what “nearest” means. k-NN doesn’t itself specify a distance metric, but one of the simplest is Euclidean distance, which is what I will be using here.
The other parameter used is
p, which is the threshold ratio of positive to negative neighbors required for a positive classification. The simplest version of k-NN uses a majority vote: half-plus-one positive neighbors puts the data point in the positive category, which would correspond to a value of
p = 0.5. It is reasonable to imagine a case where a stricter definition of positivity is required, so making
p tunable is useful.
Thus, these are our next two inputs:
k: the number of neighbors to consider
p: the threshold ratio for determining positivity
Finally, we need the new data point that we want to classify. This implementation takes a 2D array of unclassified data and classifies all of them with respect to the same training data at once, and this is our last input:
unclassified: a 2D array of unclassified examples, in row-major order and with the same number of columns as
This is a lot simpler: we want an array of Boolean values with the same length as there are rows in
unclassified. A value of
True implies membership in the positive set, and
False in the negative set.
Here is how I would break down the tasks the program needs to do:
- find the
knearest neighbors of each row in
- determine if more than
pof the neighbors are from the positive set
Those tasks are large though, so I did another pass of breaking them down:
- find the Euclidean distance from each member of the unclassified set to each member of the training set
- determine the
kneighbors with the smallest Euclidean distance to each member of the unclassified set
- determine the number of those
kneighbors that are from the positive set
- determine if more than the number of positive neighbors divided by
kis greater than or equal to
(Notice that the determination in step 4 here is equivalent to that in step 2 above. This formulation will be more convenient later.)
These aren’t the smallest possible tasks, but this is probably a good starting point to write some code.
The only dependency we need to import is NumPy.
import numpy as np
Going forward, we’ll assume the input variables we need are defined already.
1. Find the Euclidean Distance to All Training Data
First, I’ll combine the positive and negative sets into one training set. Later we can decide which set a row came from by checking if its index is less than or equal to the size of the positive set.
training = np.append(positive, negative, axis=0)
The addition of
axis=0 is necessary, otherwise NumPy will flatten the 2D arrays to linear arrays before appending. The zeroth axis is the “rows” axis.
Next, I use
expand_dims to add a new axis to the 2D array of the unclassified data.
uu = np.expand_dims(unclassified, axis=1)
I’ll explain why this is necessary for the next operation, but first let’s make sure we understand what this does. Imagine that the data we are working with is three-dimensional and that we have 10 data points to classify in the unclassified set. That means that the unclassified array is 10×3. The argument
axis=1 directs NumPy to add the new axis as the first axis. The new array stored in
uu thus is 10×1×3. Technically nothing has changed—no new elements have been added, but this greatly affects the next operation, because of NumPy’s concept of array broadcasting.
differences = training - uu
Normally, to do arithmetic in NumPy, the two arguments must have the same shape, and the arithmetic is performed element-by-element. However, when the two arguments have different but compatible shapes, the smaller array can be broadcast or “stretched” to match the size of the larger one. The linked page from the NumPy docs has a great explanation of this with diagrams, but I’ll also give a concrete example of our case here.
Using the same imaginary situation as above, imagine that the complete training set has 100 data points, making it 100×3. When NumPy performs this subtraction, it stretches the training array to be 10×100×3, with a copy of the training set for each data point in the unclassified set. (Actually, not a copy of the data, but a view, which is an optimization to reduce the number of expensive copy operations performed. In the interests of clarity, I have used terms like copy throughout this post where technically there are views.) Simultaneously, it also stretches the
uu array to the same dimensions, but in a different axis. Finally, the subtraction is performed element-wise and a 10×100×3 array is returned. Each row corresponds to one of the members of the unclassified set and contains a 100×3 array, where each row corresponds to one of the members of the training set.
This is the most important but also the most mind-bending transformation of this post. Take some time to read the NumPy docs on broadcasting, to run some code, and to visualize it in your head before moving on.
Next, we need to turn these differences into Euclidean distances.
distances = np.sqrt((differences ** 2).sum(axis=2))
This is the classic “square root of sum of squares” formula. Working from inside out,
differences ** 2 squares all the elements in the differences array. Then we take the sum along the second axis, which reduces the dimensions down to 10×100, with each row corresponding to a member of the unclassified set and containing an element corresponding to each member of the training set. Finally, the square root is taken, which leaves the dimensions unchanged and gives us the distances we were seeking.
2. Find the
k Nearest Neighbors
The obvious way to find the
k smallest distances is to sort them from lowest to highest and take the first
k. Unfortunately, since we’re relying on the order of data points to keep track of to which set each one belongs, we can’t modify the order. NumPy has a solution to us in the
argsort function, which, instead of sorting, returns the indices of the elements in sorted order.
A moment more of thought also brings the realization that we don’t actually need a fully-sorted list of indices, just the lowest
k, which is a perfect use for the
partition function, or its cousin
neighbors = np.argpartition(distances, k)[:, :k]
From the NumPy docs on
The k-th element will be in its final sorted position and all smaller elements will be moved before it and all larger elements behind it.
Then since all we need is the first
k, we use NumPy’s powerful slicing syntax to get just the indices we care about. In our example situation, this results in a 10×
k array of neighbors, with each row corresponding to a member of the unclassified set and each element being an index of one of its
k nearest neighbors in the training set.
3. Determine How Many Neighbors are from the Positive Set
We’re in the home stretch now!
num_pos_neighbors = (neighbors < positive.shape).sum(axis=1)
If the index of a neighbor in the training set is less than the number of rows in the positive set, it is a member of the positive set: this is what
neighbors < positive.shape computes, in our example, resulting in a 10×
k array of Boolean values. Summing along the first axis gives a 1D array of the number of neighbors from the positive set for each member of the unclassified set.
4. Determine if the Number of Positive Neighbors is Enough
This is the last step of our task.
classifications = num_positive_neighbors / k >= p
This divides all the counts of positive neighbors by the maximum possible number of positive neighbors and checks if that is greater than or equal to the ratio
classifications is a 1D array of Boolean values with the same number of elements as there were rows in the unclassified set (10, in our example). A value of
True indicates membership in the positive set.
Here’s all the code together, wrapped in a function with type annotations, a docstring, and comments:
import numpy as np import numpy.typing as npt def knn_classify( positive: npt.ArrayLike, negative: npt.ArrayLike, unclassified: npt.ArrayLike, k: int, p: float, ) -> np.ndarray: """Classify vectors based on their nearest neighbors. :param positive: a 2D array of positive examples, each row representing an observation and each column representing a condition :param negative: a 2D array of negative examples, with the same number of columns as `positive` :param unclassified: a 2D array of unclassified examples, with the same number of columns as `positive` :param k: the number of neighbors to consider for classification :param p: the threshold ratio of positive to negative neighbors required for a positive classification :returns: a 1D array of booleans, where each element corresponds to the observation in the `unclassified` array and a value of `True` represents classification into the positive group """ # combine the positive and negative sets into the training set training = np.append(positive, negative, axis=0) # add a dimension to the unclassified vectors so we can use NumPy # array broadcasting to compute Euclidean distances for all vectors # at once uu = np.expand_dims(unclassified, axis=1) # compute the Euclidean distances between each member of the # unclassified set and each member of the training set differences = training - uu distances = np.sqrt((differences ** 2).sum(axis=2)) # use argpartition to find the k nearest neighbors of each member # of the unclassified set neighbors = np.argpartition(distances, k)[:, :k] # count up the positive neighbors of each unclassified observation num_pos_neighbors = (neighbors < positive.shape).sum(axis=1) # determine if there are enough positive neighbors to warrant # classification as a positive example return num_positive_neighbors / k >= p
I hope you can see why I am proud of this code. It made use of some advanced NumPy features to completely avoid performance-sapping in-Python
for loops and boiled down the entire k-NN algorithm to very few lines of code.
In writing this years ago and in going through it again today to write this post, I also (re-)learned some valuable techniques. I found it important to keep track of the dimensions of the data at each step. I was lucky that at most the data was three-dimensional, which allowed me to visualize the broadcasting process in my mind. I also went back to my perennial technique of using a concrete example: using numbers like 3, 10, and 100 is always much easier for me than trying to keep a large number of abstract variables in mind at once.
- : Fix spelling of “Euclidean” in comments in the final code block.