What is a Confusion Matrix in Machine learning?
The confusion matrix is an important feature that helps in understanding the output from the algorithms.
In the case of regression, we can find out the correctness using the score
and coefficient of determination
.
Here we will explore various methods for understanding the correctness in the case of classification algorithms.
Finding Accuracy:
Accuracy gives us a fraction of the correct predictions made. If the accuracy is ‘1’ then we can say that for a given testing sample of variables the predicted variables match the testing sample.
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
print("Score :", accuracy_score(y_true, y_pred))
Then only 2 values are predicted correctly out of 4 then the accuracy is 2/4 or 0.5.
Why accuracy is not the best option?
Suppose we have an unbalanced dataset consisting of 95 ‘0’’s and 5 ‘1’s and the algorithm predicts 0 every time an input is given.
Then even when we get 95% accuracy when it had to detect ‘1’ it is wrong.
Confusion Matrix
A confusion matrix is useful in that we can assess how many predictions the model got right, and we understand that the model is performing in this particular way so we can think about how we can further improve our model.
There are some terms that one must know regarding confusion matrices.
- True Positives: This is the number of samples predicted positive which were actually positive.
- True Negatives: This is the number of samples that predicted negative which were actually negative.
- False Positives: This is the number of samples predicted positive which were not actually positive.
- False Negatives: This is the number of samples predicted negative which were not actually negative.
In the case of multi-class classification, however, the confusion matrix shows the number of samples predicted correctly and wrongly for each class instead of true positives etc.
Let's check the confusion matrix for the iris dataset.
Importing the dataset and loading the dataset.
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
Dividing the dataset into training and testing data and using predict() to find the values.
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 1)
clf = LogisticRegression(solver = 'liblinear')
clf.fit(x_train, y_train)
y_train_pred = clf.predict(x_train)
y_test_pred = clf.predict(x_test)
Using confusion matrix and the output will be-
confusion_matrix(y_train, y_train_pred)
confusion_matrix(y_test, y_test_pred)
The jupyter notebook for this blog is here.
Follow me for learning Machine learning easily.