1

I am writing a program that utilizes NumPy to calculate accuracy between testing and training points, but I am not sure how to utilize the vectorized functions as opposed to the for loops I have used in my code.

Here is my code(Is there a way to simply the code so that I do not need any loops?)

ty#command to import NumPy package
import numpy as np


iris_train=np.genfromtxt("iris-train-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
iris_test=np.genfromtxt("iris-test-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)


train_cat=np.genfromtxt("iris-training-data.csv",delimiter=',',usecols=(4),dtype=str)
test_cat=np.genfromtxt("iris-testing-data.csv",delimiter=',',usecols=(4),dtype=str)


correct = 0

for i in range(len(iris_test)):
    n = 0
    old_distance = float('inf')
    
    
    while n < len(iris_train):
        #finding the difference between test and train point
        iris_diff = (abs(iris_test[i] - iris_train[n])**2)
        #summing up the calculated differences
        iris_sum = sum(iris_diff)
        new_distance = float(np.sqrt(iris_sum))
        
        #if statement to update distance
        if new_distance < old_distance:
            index = n
            old_distance = new_distance
        n += 1
        
    
    print(i + 1, test_cat[i], train_cat[index])
    if test_cat[i] == train_cat[index]:
        correct += 1
        

accuracy = ((correct)/float((len(iris_test)))*100)
print(f"Accuracy:{accuracy: .2f}%")pe here

:

2
  • 1
    Is this just supposed to be a 1-nearest-neighbor classifier? Commented Jan 30, 2023 at 0:52
  • Yes, and I am trying to create four separate arrays to fit my data. Commented Jan 30, 2023 at 0:54

2 Answers 2

2

The trick with computing the distances is to insert extra dimensions using numpy.newaxis and use broadcasting to compute a matrix with the distance from every testing sample to every training sample in one vectorized operation. Using numpy's broadcasting rules, diff has shape (num_test_samples, num_train_samples, num_features), and distance has shape (num_test_samples, num_train_samples) since we summed along the last axis in the call to numpy.sum.

Then you can use numpy.argmin to find the index of the closest training sample for every testing sample. index has shape (num_test_samples, ) since we did the reduction operation along the last axis of distance.

Finally, you can use index to select the training classification closest to the testing classification. We can construct a boolean array that represents the equality between the testing classification and the closest training classification using the == operator. The number of correct classifications is then the sum of the True elements of this boolean array. Since True is casted to 1 and False is casted to 0 we can simply sum this boolean array to get the number of correct classifications.

# Compute the distance from every training sample to every testing sample
# Note that `np.sqrt` is not necessary since sqrt is a monotonically
# increasing function -- removing it doesn't change the answer
diff = iris_test[:, np.newaxis] - iris_train[np.newaxis, :]
distance = np.sqrt(np.sum(np.square(diff), axis=-1))

# Compute the index of the closest training sample to the testing sample
index = np.argmin(distance, axis=-1)

# Check if class of the closest training sample matches the class
# of the testing sample
correct = (test_cat == train_cat[index]).sum()

Sign up to request clarification or add additional context in comments.

Comments

1

If I get correctly what you are doing (but I don't really need to, to answer the question), for each vector of iris_test, you are searching for the closest one in isis_train. Closest being here in the sense of euclidean distance.

So you have 3 embedded loop (pseudo-python)

for u in iris_test:
    for v in iris_train:
        s=0
        for i in range(dimensionOfVectors):
            s+=(iris_test[i]-iris_train[i])**2
        dist=sqrt(s)

You are right to try to get rid of python loops. And the most important one to get rid of is the inner one. And you already got rid of this one. Since the inner loop of my pseudo code is, in your code, implicitly in:

iris_diff = (abs(iris_test[i] - iris_train[n])**2)

and

iris_sum = sum(iris_diff)

Both those line iterates through all dimensions of your vectors. But do it not in python but in internal numpy code, so it is fast.

One may object that you don't really need abs after a **2, that you could have called the np.linalg.norm function that does all those operations in one call

new_distance = np.linalg.norm(iris_test[i]-iris_train[n])

which is faster than your code. But at least, in your code, that loop over all components of the vectors is already vectorized.

The next stage is to vectorize the middle loop.

That also can be accomplished. Instead of computing one by one

new_distance = np.linalg.norm(iris_test[i]-iris_train[n])

You could compute in one call all the len(iris_train) distances between iris_test[i] and all iris_train[n].

new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)

The trick here lies in numpy broadcasting and axis parameter

  • broadcasting means that you can compute the difference between a 1D, length W vector, and a 2D n×W array (iris_test[0] is a 1D vector, and iris_train is 2D-array whose number of columns is the same as the length of iris_test[0]). Because in such case, numpy broadcasts the 1st operator, and returns a 2D n×W array as result, whose each line k is iris_test[0] - iris_train[k].
  • Calling np.linalg.norm on that n×W 2D matrix would return a single float (the norm of the whole matrix). Unless you restrict the norm to the 2nd axis (axis=1). In which case, it returns n floats, each of them being the norm of one row.

In other words, after the previous line of code, new_distances[k] is the distance between iris_test[i] and iris_train[k].

Once that done, you can easily find k such as this distance is the smallest, using np.argmin. np.argmin(new_distances) is the index of the smallest of the distances. So, all together, your code could be rewritten as:

correct = 0

for i in range(len(iris_test)):
    new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
    index=np.argmin(new_distances)
        
    #printing out classifications
    print(i + 1, test_cat[i], train_cat[index])
    if test_cat[i] == train_cat[index]:
        correct += 1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.