Beginner's Guide to Decision Trees Using Scikit-Learn

Anantha Kattani
3 min readMay 27, 2022

--

We will be taking the iris dataset from sklearn to implement DecisionTreeClassifier. Let's first import all the things necessary for running the model.

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

Splitting the data into training and testing data using the train_test_split().

iris = datasets.load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 1)

Calling the decision tree classifier.

clf = DecisionTreeClassifier(criterion='entropy',max_depth=4)
clf.fit(x_train, y_train)
y_train_pred = clf.predict(x_train)
y_test_pred = clf.predict(x_test)

Analyzing the result using confusion_matrix

from sklearn.metrics import confusion_matrix
confusion_matrix(y_train, y_train_pred)
example of output
confusion_matrix(y_test, y_test_pred)
example of output

Important Points to remember in Decision Trees

Entropy

It actually decides how a decision tree should split its data.

A decision tree is built from the top-down approach where the root node is divided into subsets till we get elements with similar values (pure node).

Entropy specifies the homogeneity of the sample like how pure is the node. If it’s completely pure then the entropy is zero.

Information gain

This value is about decreasing the entropy as much as possible after the dataset is split on a particular attribute.

For a split, we will be considering the attribute or condition with the maximum information gain possible.

Split Number

It determines the number of splits happening in a node i.e if the number of splits is more then more will be the split number.

Gain ratio

It is the ratio of information gain and the split number. The gain ratio has to be as more as possible.

Gini Index

It indicates the probability that the selected feature that the node is divided with is wrong or right. If it's right its value is zero. The Gini index should be as low as possible.

The Gini index varies between values 0 and 1, where 0 expresses the purity of classification, i.e. all the elements belong to a specified class or only one class exists there. And 1 indicates the random distribution of elements across various classes. The value of 0.5 of the Gini Index shows an equal distribution of elements over some classes.

Difference between using Gini Index and Information Gain

Gini index operates on the categorical target variables in terms of “success” or “failure” and performs only binary split, in opposite to that Information Gain computes the difference between entropy before and after the split and indicates the impurity in classes of elements.

This method makes the algorithm reach zero errors on the training data but at the cost of reducing the accuracy of the training data. This condition is called over-fitting.

To avoid overfitting we have to implement more stopping conditions for the tree. Right now, the only two stopping criteria we have is when we reach a pure node, or when a number of features exceed a certain limit. By doing this, we stop early and avoid the building of the entire tree before it perfectly classifies the training set.

This can be done in two simple ways:

Defining max depth(k):

We shall decide the depth till which the tree builds its node and after that it stops.

Choosing the optimal value for k is important.

Another problem with this method is that, in some cases, we require unbalanced trees for our classification. But selecting the k value would restrict us towards building only balanced decision trees, thus wasting computation time, as well as leading to unwanted errors in classification.

The jupyter notebook for this blog is here.

In the next blog, we will discuss how to visualize the decision tree.

--

--

Anantha Kattani
Anantha Kattani

Written by Anantha Kattani

Let's create good machine learning projects to create a positive change in the society.

Responses (1)