This problem is similar to Stanford's kNN classifier assignment which implements this algorithm on the CIFAR-10 dataset

Situation

In a hypothetical $n$-dimensional universe, there exists $p$ population of a particular species of humans, Homo BITSians. This species likes to hangout in specialized eateries, called Redis. In this universe, there are $q$ Redis which serve delicious snacks and beverages at nominal prices. Our task is to find the nearest Redi from each of the Homo BITSians so that they spend less time on commuting.

Problem

Matrices, $X \in p \times n$ and $Y \in q \times n$, which have the co-ordinates of $p$ Homo BITSians and $q$ Redis respectively in the $n$-dimensional universe are given. The $i^{th}$ row in the matrix, $X$, corresponds to the $i^{th}$ Homo BITSian. Similarly, the $i^{th}$ row in the matrix, $Y$, corresponds to the $i^{th}$ Redi.

Note: Here, row numbering (indexing) starts from $0$.

Task

Given $X$, $Y$, find a vector, $V$, of length $p$. The vector, $V$, is such that the $i^{th}$ element of $V$ has the index of the nearest Redi from the $i^{th}$ Homo BITSian.

Distance metric is the usual $l_2$-norm. In a n-dimensional space with points $x = (x_0, x_0, \ldots, x_{n-1})$ and $y = (y_0, y_0, \ldots, y_{n-1})$, the distance can be calculated as:

$$D_{xy}^2 = (x_0 - y_0)^2 + (x_1 - y_1)^2 + \ldots + (x_{n-1} - y_{n-1})^2$$

Part 1: Find the index of the nearest Redi from each Homo BITSian

# Base Distance Function to be completed by the student

import numpy as np

def distances(X, Y):
    """
    Given matrices X and Y, the function returns a distance matrix. 
    The (i,j)th element of the matrix contains the distance of jth Redi 
    from the ith Homo BITSian.
    
    Parameters: X,Y
    Returns: D
    """
    
    ### BEGIN SOLUTION

    ### END SOLUTION
    
    return D

def nearest_redi(X, Y):
    """
    Given matrices X and Y, the function returns a nearest redi vector. 
    The i-th element of the vector contains the index of nearest Redi 
    from the ith Homo BITSian.
    
    Parameters: X,Y
    Returns: V
    """
    
    D = distances(X,Y)
    
    ### BEGIN SOLUTION

    ### END SOLUTION
    
    return V

Solutions begin from here

The way to understand the axis argument in numpy functions is that it collapses the specified axis. So when we specify the axis 1 (the column), it applies the function across all columns, resulting in a single column.For more intuition check out this post by Aerin Kim. In fact, I would reccommend you to read all of her posts.

def nearest_redi(X, Y):
    D = distances(X, Y)
    V = np.argmin(D, 1)
    return V

Let's evaluate the time taken by different approaches

X = np.random.randn(100,1024)
Y = np.random.randn(1000,1024)

single loop using matrix addition/subtraction broadcast

def distances(X, Y):
    dist = np.zeros((X.shape[0], Y.shape[0]))
    for i,x in enumerate(X):
        dist[i] = ((X[i] - Y)**2).sum(axis=1)
    return dist
%time test1  = nearest_redi(X, Y)

CPU times: user 166 ms, sys: 92 µs, total: 166 ms
Wall time: 165 ms

no loops, using a gigantic outer product by forming a 3-d matrix of distances. Neat but still inefficient. Explained here

def distances(X, Y):
    x = X.reshape(X.shape[0], 1, X.shape[1])
    dist = ((x - Y)**2).sum(axis = 2)
    return dist
%time test2  = nearest_redi(X, Y)

CPU times: user 295 ms, sys: 96.3 ms, total: 392 ms
Wall time: 390 ms

no loop, but breaking up the L2 norm into individual terms

$$ \left\| \mathbf{I_1} - \mathbf{I_2} \right\| = \sqrt{ \left\| \mathbf{I_1} \right\|^2 + \left\| \mathbf{I_2} \right\| ^2 - 2 \mathbf{I_1}\cdot\mathbf{I_2}} $$

def distances(X, Y):
    dist = np.zeros((X.shape[0], Y.shape[0]))
    dist = (X**2).sum(axis=1)[:, np.newaxis] + (Y**2).sum(axis=1) - 2 * X.dot(Y.T)
    return dist
%time test3  = nearest_redi(X, Y)

CPU times: user 12.8 ms, sys: 888 µs, total: 13.7 ms
Wall time: 4.58 ms

Test Your Algorithm

#collapse-hide
print("Running base test case 1...")

X_test1 = np.array([[-3.,  4.],
                    [ 4., -2.],
                    [-1.,  0.]])

Y_test1 = np.array([[-3.,  0.],
                    [-3., -3.]])

V_test1 = nearest_redi(X_test1, Y_test1)
V_ans_test1 = np.array([0, 1, 0])

assert np.array_equal(V_test1, V_ans_test1)

print("Base test case 1 successful!!\n")



print("Running base test case 2...")

X_test2 = np.array([[ 0.08170274, -4.8955951 , -4.0473417 ],
                    [-1.13259313,  4.38171415, -3.22068891]])

Y_test2 = np.array([[ 3.79010736,  1.70042849, -3.06603884],
                    [ 3.8921235 , -1.85207272,  2.33340715],
                    [ 1.67360485,  2.11437547,  0.87529999]])

V_test2 = nearest_redi(X_test2, Y_test2)
V_ans_test2 = np.array([0, 2])

assert np.array_equal(V_test2, V_ans_test2)

print("Base test case 2 successful!!\n")

Running base test case 1...
Base test case 1 successful!!

Running base test case 2...
Base test case 2 successful!!

#collapse-hide
# Running hidden test case for Part 1. Don't edit the cell.                                     *** 5 marks ***
### BEGIN HIDDEN TESTS
X = np.array([[ 0.27170746,  0.89441607,  0.64849028],
              [ 0.42296173,  0.54342876,  0.47889235],
              [ 0.48688657,  0.11082849,  0.10691689],
              [ 0.04419385,  0.68777309,  0.49437059],
              [ 0.70143641,  0.09964604,  0.20949214],
              [ 0.01725016,  0.37424641,  0.94070338]])

Y = np.array([[ 0.24232741,  0.08413896,  0.014919  ],
              [ 0.15801316,  0.31713579,  0.0416702 ],
              [ 0.15784176,  0.50998073,  0.45405793],
              [ 0.44382259,  0.44515729,  0.49186482],
              [ 0.00695024,  0.23603969,  0.77601819]])

V = nearest_redi(X,Y)
V_ans = np.array([2, 3, 0, 2, 0, 4])

assert np.array_equal(V, V_ans)
### END HIDDEN TESTS

Extra Resources

https://www.labri.fr/perso/nrougier/from-python-to-numpy/ - Read the sections on Code Vectorization and Problem Vectorization
https://medium.com/@mikeliao/numpy-vectorization-d4adea4fc2a
https://mlxai.github.io/2017/01/03/finding-distances-between-data-points-with-numpy.html