Vectorizing Nearest Neighbours Algorithm
A simple problem statement to display computational speed up due to vectorization and broadcasting in numpy
- Situation
- Problem
- Task
- Part 1: Find the index of the nearest Redi from each Homo BITSian
- Solutions begin from here
- Test Your Algorithm
- Extra Resources
This problem is similar to Stanford's kNN classifier assignment which implements this algorithm on the CIFAR-10 dataset
Situation
In a hypothetical $n$-dimensional universe, there exists $p$ population of a particular species of humans, Homo BITSians. This species likes to hangout in specialized eateries, called Redis. In this universe, there are $q$ Redis which serve delicious snacks and beverages at nominal prices. Our task is to find the nearest Redi from each of the Homo BITSians so that they spend less time on commuting.
Problem
Matrices, $X \in p \times n$ and $Y \in q \times n$, which have the co-ordinates of $p$ Homo BITSians and $q$ Redis respectively in the $n$-dimensional universe are given. The $i^{th}$ row in the matrix, $X$, corresponds to the $i^{th}$ Homo BITSian. Similarly, the $i^{th}$ row in the matrix, $Y$, corresponds to the $i^{th}$ Redi.
Note: Here, row numbering (indexing) starts from $0$.
Task
Given $X$, $Y$, find a vector, $V$, of length $p$. The vector, $V$, is such that the $i^{th}$ element of $V$ has the index of the nearest Redi from the $i^{th}$ Homo BITSian.
Distance metric is the usual $l_2$-norm. In a n-dimensional space with points $x = (x_0, x_0, \ldots, x_{n-1})$ and $y = (y_0, y_0, \ldots, y_{n-1})$, the distance can be calculated as:
$$D_{xy}^2 = (x_0 - y_0)^2 + (x_1 - y_1)^2 + \ldots + (x_{n-1} - y_{n-1})^2$$# Base Distance Function to be completed by the student
import numpy as np
def distances(X, Y):
"""
Given matrices X and Y, the function returns a distance matrix.
The (i,j)th element of the matrix contains the distance of jth Redi
from the ith Homo BITSian.
Parameters: X,Y
Returns: D
"""
### BEGIN SOLUTION
### END SOLUTION
return D
def nearest_redi(X, Y):
"""
Given matrices X and Y, the function returns a nearest redi vector.
The i-th element of the vector contains the index of nearest Redi
from the ith Homo BITSian.
Parameters: X,Y
Returns: V
"""
D = distances(X,Y)
### BEGIN SOLUTION
### END SOLUTION
return V
Solutions begin from here
The way to understand the axis argument in numpy functions is that it collapses the specified axis. So when we specify the axis 1 (the column), it applies the function across all columns, resulting in a single column.For more intuition check out this post by Aerin Kim. In fact, I would reccommend you to read all of her posts.
def nearest_redi(X, Y):
D = distances(X, Y)
V = np.argmin(D, 1)
return V
Let's evaluate the time taken by different approaches
X = np.random.randn(100,1024)
Y = np.random.randn(1000,1024)
def distances(X, Y):
dist = np.zeros((X.shape[0], Y.shape[0]))
for i,x in enumerate(X):
dist[i] = ((X[i] - Y)**2).sum(axis=1)
return dist
%time test1 = nearest_redi(X, Y)
here
no loops, using a gigantic outer product by forming a 3-d matrix of distances. Neat but still inefficient. Explaineddef distances(X, Y):
x = X.reshape(X.shape[0], 1, X.shape[1])
dist = ((x - Y)**2).sum(axis = 2)
return dist
%time test2 = nearest_redi(X, Y)
$$ \left\| \mathbf{I_1} - \mathbf{I_2} \right\| = \sqrt{ \left\| \mathbf{I_1} \right\|^2 + \left\| \mathbf{I_2} \right\| ^2 - 2 \mathbf{I_1}\cdot\mathbf{I_2}} $$
def distances(X, Y):
dist = np.zeros((X.shape[0], Y.shape[0]))
dist = (X**2).sum(axis=1)[:, np.newaxis] + (Y**2).sum(axis=1) - 2 * X.dot(Y.T)
return dist
%time test3 = nearest_redi(X, Y)
#collapse-hide
print("Running base test case 1...")
X_test1 = np.array([[-3., 4.],
[ 4., -2.],
[-1., 0.]])
Y_test1 = np.array([[-3., 0.],
[-3., -3.]])
V_test1 = nearest_redi(X_test1, Y_test1)
V_ans_test1 = np.array([0, 1, 0])
assert np.array_equal(V_test1, V_ans_test1)
print("Base test case 1 successful!!\n")
print("Running base test case 2...")
X_test2 = np.array([[ 0.08170274, -4.8955951 , -4.0473417 ],
[-1.13259313, 4.38171415, -3.22068891]])
Y_test2 = np.array([[ 3.79010736, 1.70042849, -3.06603884],
[ 3.8921235 , -1.85207272, 2.33340715],
[ 1.67360485, 2.11437547, 0.87529999]])
V_test2 = nearest_redi(X_test2, Y_test2)
V_ans_test2 = np.array([0, 2])
assert np.array_equal(V_test2, V_ans_test2)
print("Base test case 2 successful!!\n")
#collapse-hide
# Running hidden test case for Part 1. Don't edit the cell. *** 5 marks ***
### BEGIN HIDDEN TESTS
X = np.array([[ 0.27170746, 0.89441607, 0.64849028],
[ 0.42296173, 0.54342876, 0.47889235],
[ 0.48688657, 0.11082849, 0.10691689],
[ 0.04419385, 0.68777309, 0.49437059],
[ 0.70143641, 0.09964604, 0.20949214],
[ 0.01725016, 0.37424641, 0.94070338]])
Y = np.array([[ 0.24232741, 0.08413896, 0.014919 ],
[ 0.15801316, 0.31713579, 0.0416702 ],
[ 0.15784176, 0.50998073, 0.45405793],
[ 0.44382259, 0.44515729, 0.49186482],
[ 0.00695024, 0.23603969, 0.77601819]])
V = nearest_redi(X,Y)
V_ans = np.array([2, 3, 0, 2, 0, 4])
assert np.array_equal(V, V_ans)
### END HIDDEN TESTS
Extra Resources
- https://www.labri.fr/perso/nrougier/from-python-to-numpy/ - Read the sections on Code Vectorization and Problem Vectorization
- https://medium.com/@mikeliao/numpy-vectorization-d4adea4fc2a
- https://mlxai.github.io/2017/01/03/finding-distances-between-data-points-with-numpy.html