Beginners Guide to Random Forests Classifier using Scikit Learn
Random forests combine multiple decision trees and merge them together to get a more accurate and stable prediction.
Disadvantages of using Decision trees:
Decision trees have the problem of overfitting. Although we can use pruning to reduce it drastically. But as the number of nodes grows till we get a pure node there are more chances of overfitting.
Understand the decision tree here.
Not all features are selected in different decision trees being made.
We try to create more decision trees so that all decision trees are included.
The randomness in selecting the features and data points helps in reducing overfitting. The final answer will be the majority of the answers from the decision trees of the forest.
There are many ways to select the final answer. One of them is
Data bagging
It is a very powerful ensemble method.
An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.
Our goal is to reduce the variance of the decision trees. Combining multiple trees to give prediction will improve accuracy and precision.
Procedure:
We try to create smaller datasets in which we also repetition of data points and randomly select some features. Bagging is done generally on data points, not features.
These smaller data-sets are obtained by choosing the data-points and the features in the following manner:
- Features are selected at random without repetition
- Data-points are selected at random with repetition (which is actually bagging)
No data points should be left out so we will try to increase the number of trees.
Selecting different features in a data set helps us know the relative importance of each feature.
Feature selection
Feature selection is picking only useful features that make up a major contribution to the output.
The advantages of feature selection are as follows :
1. Reduces Overfitting
2. Improves accuracy of the model
3. Reduces training time
Implementation using sklearn.
We will be using the iris dataset to see how feature selection works.
Let's import all the necessary libraries.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets,tree
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
import pandas as pd
Load the iris dataset and initialize ‘X’ to features and ‘Y’ to labels.
iris = datasets.load_iris()
features = iris.feature_names
X = iris.data
Y = iris.target
Splitting the dataset into training and testing data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state = 14)
Using the RandomForestClassifier to train the model. The number of decision trees to be formed will be initialized to 10000.
clf = RandomForestClassifier(n_estimators=10000, n_jobs=-1, random_state = 14)# Train the classifierclf.fit(X_train, Y_train)
clf.score(X_test, Y_test)
Implementing feature selection
feature_importances = pd.DataFrame(clf.feature_importances_, index = features, columns=['importance']).sort_values('importance', ascending=False)feature_importances
This shows that petal length and petal width are important features as compared to the other two features i.e.sepal length and sepal width.
# Making a classifier picking only important features,
# picking only those features that have importance value greater than 0.15sfm = SelectFromModel(clf, threshold = 0.15)
sfm.fit(X_train, Y_train)# Create a data subset picking only important features out of all the features.X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)
Now we call the RandomForestClassifier() again to only the important features.
# New random forest classifier with only important featuresclf_important = RandomForestClassifier(n_estimators=10000, n_jobs=-1, random_state = 14)
We fit this with the important features with which we created a subset.
clf_important.fit(X_important_train, Y_train)
We look at the score which tells us how the model is performing.
clf_important.score(X_important_test, Y_test)
We can see the improved score after feature selection.
The jupyter notebook for this blog is available here.
Follow me to learn Machine learning from scratch.