#Categories

  • Sheikh Aman

A complete explanation of Random Forest Algorithm.

Updated: Aug 10




Things you will learn now.

  • What is the Random Forest Algorithm?

  • Why we use the Random Forest Algorithm?

  • How does the Random Forest Algorithm work?

  • Difference Between Decision Trees and Random Forest.

  • Real-life example of the Random Forest Algorithm.

  • Applications of Random Forest Algorithm.

  • Feature's Importance Random Forest algorithm.

  • Important hyperparameters of Random Forest algorithm.

  • Random Forest Algorithm advantages.

  • Random Forest Algorithm disadvantages.

  • Summary.



What is the Random Forest Algorithm?


Random Forest algorithm. is a type of supervised learning algorithm which is based on ensemble learning. Ensemble learning is a technique where there is a joining of different types of algorithm or same types of algorithm and then it forms a more powerful regression and classification model. Here, in the random forest algorithm, it combines with multiple decision trees and forms a model. Because of its diversity and simplicity, it is one of the most used algorithms. It is used for both classification and regression problems.


Why we use the Random Forest Algorithm?


The reason we use a random forest algorithm is:

  • It can be used for both classification and regression.

  • The random forest algorithm handles all the missing values.

  • Chance of overfitting model is very less.

  • Random forest algorithm models categorical values also.

How does the Random Forest Algorithm work?


The following are the essential steps involved in performing the random forest algorithm:

  • Pick N random records from the dataset.

  • Build a Decision tree supported these N records.

  • Choose the number of trees you would like in your algorithm and repeat steps 1 and a couple of.

  • In case of a regression problem, for a replacement record, each tree within the forest predicts a worth for Y (output). the ultimate value is often calculated by taking average predicted values by all the trees in the forest. Or, just in case of a classification problem, each tree within the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the bulk vote.

click me for python tutorial for random forest algorithm.


Difference between Decision tree and Random Trees.


While the random forest is always a collection of decision trees, there are some differences.


If you input a training dataset with features and labels into the Decision tree, it'll formulate some set of rules, which can be accustomed to make the predictions.


For example, to predict whether an individual will click on a web advertisement, you would possibly collect the ads the person clicked on within the past and a few features that describe his/her decision. If you set the features and labels into the decision tree, it'll generate some rules that help predict whether the advertisement is going to be clicked or not. as compared, the random forest algorithm randomly selects observations and features to create several decision trees then averages the results.


Another major difference is "deep" decision trees can suffer from overfitting. Maximum time, random forest prevents this by creating random subsets of the features and then building smaller trees with those subsets. Afterwards, it combines the subtrees. It's vital to note this doesn’t work every time and it also makes the computation slower, relying on what amount of trees the random forest builds.

Real-life example of the Random Forest Algorithm.


Before you go the technical details about the random forest algorithm. Let’s check out a true life example to know the layman sort of random forest algorithm.



Suppose Mady somehow got 2 weeks to leave from his office. He wants to spend his 2 weeks by travelling to a various place. He also wants to travel to the place he may like.


So he decided to ask his close friend about the places he may like. Then his friend started asking about his past trips. It’s a bit like his close friend will ask, you've been visited the X place did you wish it?


Based on the answers which are given by Mady, his close friend start recommending the place Mady may like. Here his best formed the decision tree with the solution given by Mady.


As his close friend may recommend his best place to Mady as a close friend. The model is going to be biased with the closeness of their friendship. So he decided to ask a few more friends to recommend the simplest, best and exciting place he may like.


Now his friends asked some random questions and everyone recommended one place to Mady. Now Mady considered the place which is high votes from his friends because the final place to go to.


In the above example of Mady's trip planning, two main interesting algorithms random forest algorithm and decision tree algorithm used. I hope you discover it already. Anyhow, I might wish to highlight it again.


Decision Tree Algorithm:

To recommend the best and good place to Mady, his close friend asked some questions. supported the answers given by mady, he recommended an area. this is often a decision tree algorithm approach. Will explain why it's a decision tree algorithm approach.


Mady friend used the answers given by mady to make rules. Later he used the created rules to recommend the simplest place which mady will like. These rules might be, mady sort of a place with many tree or waterfalls ..etc


In the above approach, mady close friend is the decision tree. The vote (recommended place) is that the leaf of the close friend tree (Target class). The target is finalized by one person, during a technical way of claiming, using an only single decision tree.


Random Forest Algorithm:

In the other case when mady asked his friends to recommend the simplest place to go to. Each friend asked him different questions and are available up their recommend an area to go to. Later mady consider all the recommendations and calculated the votes. Votes basically are to select the favoured place from the recommend places from all his friends.


Mady will consider each recommended place and if an equivalent place recommended by another place he will increase the count. At the top the high count place where mady will go.


In this case, the recommended place (Target Prediction) is taken into account by many friends. Each friend is that the tree and therefore the combined all friends will form the forest. This forest is that the random forest. As each friend asked random inquiries to recommend the simplest place visit.


Applications of Random Forest Algorithm.


There are many applications you can't even imagine but here we will discuss on 3 basic fields

it is applied.

  • Banking

  • Medicine

  • Stockmarket.

1.Banking:

In the banking sector, the random forest algorithm widely utilized in two main application. These are for locating the loyal customer and finding the fraud customers.



The loyal customer means not the customer who pays well, but also the customer who can take the large amount as loan and pays the loan interest properly to the bank. because the growth of the bank purely depends on loyal customers. The bank customers data highly analyzed to seek out the pattern for the loyal customer based the customer details.


In the same way, there's got to identify the customer who isn't profitable for the bank, like taking the loan and paying the loan interest properly or find the outlier customers. If the bank can identify theses quite a customer before giving the loan the customer. Bank will get an opportunity to not approve the loan to those sorts of customers. during this case, also the random forest algorithm is employed to spot the purchasers who aren't profitable for the bank.


2.Medicine:

In the medicine field, random forest algorithm is employed to identify the right combination of the components to validate the drugs. Random forest algorithm also helpful for identifying the disease by analyzing the patient’s medical records.


3.Stock Market:

In the stock exchange, the random forest algorithm wont to identify the stock behaviour also because of the expected loss or profit by purchasing the actual stock.


Feature's Importance Random Forest algorithm.


Another great quality of the random forest algorithm is that it's very easy to calculate the relative importance of every feature on the prediction. Sklearn provides an excellent tool for this that measures a feature's importance by watching what proportion the tree nodes that use that feature reduce impurity across all trees within the forest. It computes this score automatically for every feature after training and scales the results therefore the sum of all importance is adequate to one.


If you don’t get how a decision tree works or what a leaf or node is, here may be a good description from Wikipedia: '"In a decision tree each internal node represents a 'test' on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the result of the test, and every leaf node represents a category label (decision taken after computing all attributes). A node that has no children maybe a leaf.'"


By watching the feature importance you'll decide which features to possibly drop because they don’t contribute enough (or sometimes nothing at all) to the prediction process. this is often important because a general rule out machine learning is that the more features you've got the more likely your model will suffer from overfitting and the other way around.


Important hyperparameters of Random Forest algorithm.


The hyperparameters in the random forest are either required to increase the predictive power of the model or to form the model faster. Let's check out the hyperparameters of sklearns built-in random forest function.


1. Increasing the predictive power


Firstly, there's the n_estimators hyperparameter, which is simply the amount of trees the algorithm builds before taking the utmost voting or taking the averages of predictions. generally, a better number of trees increases the performance and makes the predictions more stable, but it also slows down the computation.


Another important hyperparameter is max_features, which is that the maximum number of features random forest considers separating a node. Sklearn provides several options, all described within the documentation.


The last important hyperparameter is min_sample_leaf. This determines the minimum number of leafs required to separate an internal node.


2. Increasing the model's speed


The n_jobs hyperparameter tells the engine what percentage processors it's allowed to use. If it's worth of 1, it can only use one processor. a worth of “-1” means there's no limit.


The random_state hyperparameter makes the model’s output replicable. The model will always produce an equivalent result when it's a particular value of random_state and if it's been given an equivalent hyperparameter and therefore the same training data.


Lastly, there's the oob_score (also called oob sampling), that is a random forest cross-validation method. during this sampling, about one-third of the data from a given dataset isn't used to train the model and may be used to evaluate its performance. These samples are called the out-of-bag samples. it's extremely almost like the leave-one-out-cross-validation method, but almost no additional computational burden goes alongside it.



Random Forest Algorithm advantages.


  • The random forest algorithm isn't biased, since, there are multiple trees and every tree is trained on a subset of data. Basically, the random forest algorithm relies on the facility of "the crowd"; therefore the general biasedness of the algorithm is reduced.

  • This algorithm is extremely stable. although a replacement data point is introduced within the dataset the general algorithm isn't affected much since new data may impact one tree, but it's very hard for it to impact all the trees.

  • The random forest algorithm works well once you have both categorical and numerical features.

  • The random forest algorithm also works well when data has missing values or it's not been scaled well.

Random Forest Algorithm disadvantages.


  • A major disadvantage of random forests lies in their complexity. They required far more computational resources, due to an outsized number of decision trees joined together.

  • Due to their complexity, they require far more time to coach or train than other comparable algorithms.

Summary (All-In-One).


Random forest may be a great algorithm to train early within the model development process, to check how it performs. Its simplicity makes building a “bad” random forest a troublesome proposition.


The algorithm is additionally an excellent choice for anyone who must develop a model quickly. On top of that, it provides a reasonably good indicator of the importance it assigns to your features.


Random forests also are very hard to beat performance-wise. Of course, you'll probably always find a model which will perform better, sort of a neural network for instance, but these usually take longer to develop, though they will handle tons of various feature types, like binary, categorical and numerical.


Overall, the random forest could be a (mostly) fast, simple and versatile tool, but not without some limitations.


In my next post, I will explain classification and regression with Random Forest algorithm with code in python.

Till then stay tuned and you can follow our Instagram and Facebook page. You will get the link on header or footer.

Thanks for giving your beautiful time to this post.

163 views1 comment

Something Interesting

#Some_Interesting_topics

MR. MACHINE

Subscribe to Our Newsletter
Copyright © 2020 MR. Machine. All Rights Reserved
  • Facebook
  • Twitter
  • Pinterest
  • Instagram