Trees for making decision in ML

Varad Kajarekar
5 min readMay 31, 2021

--

For the student or professionals in data science field tree is valuable word as it refers to a famous & widely used data structure. It helps data scientist data analytics and overall data enthusiastic in their work

Trees: Unlike Arrays, Linked Lists, Stack and queues, which are linear data structures, trees are hierarchical data structures.
Tree Vocabulary: The topmost node is called root of the tree. The elements that are directly under an element are called its children. The element directly above something is called its parent. For example, ‘a’ is a child of ‘f’, and ‘f’ is the parent of ‘a’. Finally, elements with no children are called leaves.

tree

— —

j ← root

/ \

f k

/ \ \

a h z ← leaves

Why Trees?
1. One reason to use trees might be because you want to store information that naturally forms a hierarchy. For example, the file system on a computer:

file system-----------/    <-- root/      \...       home/          \ugrad        course/       /      |     \...      cs101  cs112  cs113

2. Trees (with some ordering e.g., BST) provide moderate access/search (quicker than Linked List and slower than arrays).
3. Trees provide moderate insertion/deletion (quicker than Arrays and slower than Unordered Linked Lists).
4. Like Linked Lists and unlike Arrays, Trees don’t have an upper limit on number of nodes as nodes are linked using pointers.

“Decision trees can be used to perform one of two tasks: Classification and Regression.”

Decision tree algorithm:

  • Decision Tree is a supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.
  • In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.
  • It is a graphical representation for getting all the possible solutions to a problem/decision based on given conditions.

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and problem is the main point to remember while creating a machine learning model. Below are the two reasons for using the Decision tree:

  • Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
  • The logic behind the decision tree can be easily understood because it shows a tree-like structure.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further. It continues the process until it reaches the leaf node of the tree. The complete process can be better understood using the below algorithm:

  • Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
  • Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
  • Step-3: Divide the S into subsets that contains possible values for the best attributes.
  • Step-4: Generate the decision tree node, which contains the best attribute.
  • Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node.

keys involved in Decision Making

o A decision tree before starting usually considers the entire data as a root. Then on particular condition, it starts splitting by means of branches or internal nodes and makes a decision until it produces the outcome as a leaf. Only one important thing to know is it reduces impurity present in the attributes and simultaneously gains information to achieve the proper outcomes while building a tree.

Example of bank

1. Entropy

It is defined as a measure of impurity present in the data

Entropy with the lowest value makes a model better in terms of prediction as it segregates the classes better. Entropy is calculated based on the following formula

2. Information Gain

It is a measure used to generalize the impurity which is entropy in a dataset. Higher the information gain, lower is the entropy

Information Gain = Entropy of Parent — sum (weighted % * Entropy of Child)

Weighted % = Number of observations in particular child/sum (observations in all

child nodes)

4. Reduction in Variance

Reduction in variance is used when the decision tree works for regression and the output is continuous is nature. The algorithm basically splits the population by using the variance formula.

Advantages:

1. Compared to other algorithms decision trees requires less effort for data preparation during pre-processing.

2. A decision tree does not require normalization of data.

3. A decision tree does not require scaling of data as well.

4. Missing values in the data also do NOT affect the process of building a decision tree to any considerable extent.

5. A Decision tree model is very intuitive and easy to explain to technical teams as well as stakeholders.

Disadvantage:

1. A small change in the data can cause a large change in the structure of the decision tree causing instability.

2. For a Decision tree sometimes calculation can go far more complex compared to other algorithms.

3. Decision tree often involves higher time to train the model.

4. Decision tree training is relatively expensive as the complexity and time has taken are more.

5. The Decision Tree algorithm is inadequate for applying regression and predicting continuous values.

By now I hope you would have got an introduction about the Decision tree, One of the best machine learning algorithms to solve a classification problem.

As a fresher, I’d advise you to learn these techniques and understand their implementation and later implement them in your models.

Some other applications of tree.

1. XML Parser uses tree algorithms.

2. Decision-based algorithm is used in machine learning which works upon the algorithm of tree.

3. Databases also uses tree data structures for indexing.

4. Domain Name Server(DNS) also uses tree structures.

5. File explorer/my computer of mobile/any computer

6. BST used in computer Graphics

7. Posting questions on websites like Quora, the comments are child of questions

--

--

Varad Kajarekar
Varad Kajarekar

Responses (2)