Biomedical data set built by Dr. Henrique da Mota during a medical residence period in the Group of Applied Research in Orthopaedics (GARO) of the Centre Médico-Chirurgical de Réadaptation des Massues, Lyon, France. Each patient in the data set is represented in the data set by six biomechanical attributes derived from the shape and orientation of the pelvis and lumbar spine (in this order): pelvic incidence, pelvic tilt, lumbar lordosis angle, sacral slope, pelvic radius, and grade of spondylolisthesis. The following convention is used for the class labels: DH (Disk Hernia), Spondylolisthesis (SL), Normal (NO), and Abnormal (AB). In this exercise, we only focus on a binary classification task NO=0 and AB=1
In this Blog, I will walk through how do we predict a patient as Normal or Abnormal by training the model with 6 biomechanic features. The main method here is the K-nearest neighbor prediction. Two questions are addressed:
How many neighbors should we choose?
How large is my training set for good prediction?
Pre-Processing and Exploratory data analysis
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#Load data and separate each column by whitespace
df=pd.read_csv("./vertebral_column_data/column_2C.dat",sep="\s+")
#Assign column names
df.columns=['pelvic_incidence', 'pelvic_tilt', 'lumbar_lordosis_angle', 'sacral_slope', 'pelvic_radius','degree_spondylolisthesis','Class']
To check correlations of 6 features, we make scatterplots of the independent variables in the dataset. Encoding class Normal as Class 0 and Abnormal as Class 1. Use color to show Classes 0 and 1. The trick to blank the diagonal graph in pariplot is making linewidth =0 and shade is False.
df_NO=df.loc[df['Class'].str.match('NO')]
df_NO.loc[:, 'Class'] =0
df_AB=df.loc[df['Class'].str.match('AB')]
df_AB.loc[:, 'Class'] = 1
df_binary=pd.merge(df_NO, df_AB,how='outer')
g=sns.pairplot(df_binary, vars=['pelvic_incidence', 'pelvic_tilt', 'lumbar_lordosis_angle', 'sacral_slope', 'pelvic_radius','degree_spondylolisthesis'],hue='Class', markers=["o", "*"],diag_kind="kde",diag_kws={"linewidth": 0, "shade": False})
From the pair plot, we can see that correlation among abnormal class is much stronger than normal class.
K-neast neighbors analysis
Here we select the first 70 rows of Class 0 and the first 140 rows of Class 1 as the training set and the rest of the data as the test set. The 6 features would be variable X and the predicted variable Y is the "class".
training_set=pd.merge(df_NO.iloc[0:70,:],df_AB.iloc[0:140,:],how='outer') #selec training set Test_set=pd.merge(df_NO.iloc[70:,:],df_AB.iloc[140:,:],how='outer')#selec test set
X_T = training_set.iloc[:, :-1].values
y_T = training_set.iloc[:,-1].values
X_test=Test_set.iloc[:,:-1]
Y_test=Test_set.iloc[:,-1]
Now let's start to write some codes for a k-nearest neighbor with Euclidean metric.
def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)-1): # -1 Because last column is class which would be ignored
distance += (row1[i] - row2[i])**2
return sqrt(distance)
#Get K NN
def k_neighbors(data, next_row, num_neighbors):
distances = list()
for row in data:
dist = euclidean_distance(next_row,row)
distances.append((row, dist))# use () because List append only one at once
distances.sort(key=lambda tup: tup[1])
neighbors = list()
for i in range(num_neighbors):
neighbors.append(distances[i [0]) # pick up from distance [(array[row],distance)]
return neighbors
#Prediction
def predict_classification(data, next_row, num_neighbors):
neighbors = k_neighbors(data, next_row, num_neighbors)
result = [row[-1] for row in neighbors]
prediction = max(set(result), key=result.count)
return prediction
On the other hand, Sklearn has a well established package ready to use:
from sklearn.neighbors import KNeighborsClassifier
classifier=KNeighborsClassifier(n_neighbors=208,metric='euclidean')
Then , let's test all the data in the test database with k nearest neighbors. Take decisions by majority polling.
Plot train and test errors in terms of k for k∈ { 208, 205, . . . ,7,4,1,} (in reverse order) and find the Best k*. Calculate the confusion matrix, true positive rate, true negative rate, precision and F1-score when k=k*.
Train_error=[]
alternative_k = np.arange(208, 0, -3)
for i in alternative_k:
knn=KNeighborsClassifier(n_neighbors=i,metric='euclidean')
knn.fit(X_T, y_T)
pred_i = knn.predict(X_T)
Train_error.append(np.mean(pred_i != y_T))
print("Minimum train error:",min(Train_error),"at K =",208-3*Train_error.index(min(Train_error)))
Minimum train error: 0.0 at K = 1
Test_error = []
for i in alternative_k:
knn=KNeighborsClassifier(n_neighbors=i,metric='euclidean')
knn.fit(X_T, y_T)
pred_i = knn.predict(X_test)
Test_error.append(np.mean(pred_i != Y_test))
plt.figure(figsize=(15, 6))
plt.plot(alternative_k, Test_error, color='blue', linestyle='dashed',markerfacecolor='blue', markersize=10,label='Test error rate')
plt.plot(alternative_k, Train_error, color='red', linestyle='dashed',markerfacecolor='blue', markersize=10,label='Train error rate')plt.title('Test Error Rate K Value')
plt.xlabel('K Value')plt.ylabel('Mean Error')
plt.grid(True)plt.minorticks_on()plt.legend(prop={'size': 12})
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
print("Minimum test error:",min(Test_error),"at K =",208-3*Test_error.index(min(Test_error)))
Minimum test error: 0.06060606060606061 at K = 4
We can make a conclusion that when the k-nearest neighbor is 4, the test error is the smallest.
To get the confusion matrix and TP & TN rate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
classifier=KNeighborsClassifier(n_neighbors=4,metric='euclidean')
classifier.fit(X_T, y_T)y_pred = knn.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print("**Confusion matrix is:**")
print(confusion_matrix(Y_test, y_pred))
print('**Classification report is:**')
print(classification_report(Y_test, y_pred))
print("F-1 score at k=4",f1_score(Y_test, y_pred))
The result is :
**Confusion matrix is:**
[[20 10]
[ 3 66]]
**Classification report is:**
precision recall f1-score support
0 0.87 0.67 0.75 30
1 0.87 0.96 0.91 69
accuracy 0.87 99
macro avg 0.87 0.81 0.83 99
weighted avg 0.87 0.87 0.86 99
F-1 score at k=4 0.9103448275862069
cm=confusion_matrix(Y_test, y_pred)TP = cm[0][0]
FP = cm[0][1]
FN = cm[1][0]
TN = cm[1][1]
print("true_positive_rate(TPR) = ", TP/ (TP + FN))
print("true_negative_rate(TNR) = ",TN / (FP + TN))
true_positive_rate(TPR) = 0.8695652173913043 true_negative_rate(TNR) = 0.868421052631579
To optimize the training set size, We can also plot the best test error rate which is obtained by some value of k, against the size of the training set, when the size of the training set is N∈ {10,20,30, . . . ,210}. This is called the "Learning Curve"
for each N, select the training set by choosing the first N/ 3 rows of Class 0 and the first N − N/ 3 rows of Class 1 in the training set we created.
#Select set size N
alternative_N = np.arange(10, 210, 10)
Train_K_min=[]
for N in alternative_N:
training_N_set=pd.merge(df_NO.iloc[0:N//3,:],df_AB.iloc[0:N-N//3,:],how='outer') #selec training set
Test_N_set= pd.merge(df_NO.iloc[N//3:,:],df_AB.iloc[N-N//3:,:],how='outer')
alternative_N_k=np.arange(1,N,5)
X_N_Train = training_N_set.iloc[:, :-1].values
Y_N_Train = training_N_set.iloc[:, -1].values
X_N_Test = Test_N_set.iloc[:, :-1].values
Y_N_Test = Test_N_set.iloc[:, -1].values
Train_N_error=[]
for k in alternative_N_k:
classifier=KNeighborsClassifier \
(n_neighbors=k,metric='euclidean')
classifier.fit(X_N_Train, Y_N_Train)
pred_i=classifier.predict(X_N_Test)
Train_N_error.append(np.mean(pred_i != Y_N_Test))
Train_K_min.append((min(Train_N_error)))
x=range(10,211,10)
plt.figure(figsize=(15, 6))
plt.plot(alternative_N, Train_K_min, color='red', linestyle='dashed',markerfacecolor='blue', markersize=10)
plt.title('Learning curve')
plt.xlabel('Training Size N')
plt.ylabel('Mean Error')
plt.grid(True)
plt.minorticks_on()
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
It is obvious that if we involve more training data in our model, we will get a good prediction!
Comentários