• Sheikh Aman

Regression and classification in the Random Forest algorithm with Python.

Updated: Aug 10, 2020

Things you will learn now.

  • Introduction to Random forest Algorithm.

  • How Does the Random Forest Algorithm work?

  • Bagging Technique.

  • Boosting Technique.

  • Regression using Random Forest Algorithm.

  • Classification using the Random Forest Algorithm.

  • Summary.

Introduction to Random Forest Algorithm.

Let’s understand the algorithm in layman’s terms. Suppose you'd like to travel on a trip and you would wish to visit an area which you'll enjoy.

So what does one do to seek out an area that you simply will like? you'll search online, read reviews on travel blogs and portals, otherwise, you also can ask your friends.

Let’s suppose you've got decided to ask your friends, and talked with them about their past travel experience to varied places. you'll get some recommendations from every friend. Now you've got to form an inventory of these recommended places. Then, you tell them to vote for the best place for the trip or select one from the list of recommended places you made. The place with the very most number of votes is going to be your perfect choice for the trip.

In the above process, there are two parts. First, ask your friends about their individual travel experience and getting one recommendation out of multiple places they visited. This part can be considered as the decision tree algorithm. Here, each friend makes a variety of the places he or she has visited thus far.

The second part, after collecting all the recommendations, is that the voting procedure for choosing the simplest place within the list of recommendations. This whole process of getting recommendations from friends and voting on them to seek out the simplest place is thought as the random forests algorithm.

It is an ensemble method (basically divide-and-conquer approach) of decision trees generated on a randomly split of the dataset. This collection of decision tree classifiers is additionally referred to as the forest. The individual decision trees are generated using an attribute selection indicator like information gain, gain ratio, and Gini index for every attribute. Each tree depends on an independent random sample. during a classification problem, each tree votes and therefore the hottest class is chosen because of the outcome. within the case of regression, the average of all the tree outputs is taken into account because of the final output. it's simpler and more powerful compared to the opposite non-linear classification algorithms.

How Does the Random Forest Algorithm work?

It works in four steps:

  • Select random samples from a given dataset.

  • Construct a decision tree for every sample and obtain a prediction result from each decision tree.

  • Perform a vote for every predicted result.

  • Select the prediction result with the foremost votes because of the final prediction.

Bagging Technique.

Bagging, a Parallel ensemble method (stands for Bootstrap Aggregating), could be a method to decrease the variance of the prediction model by generating additional data within the training stage. this is often produced by randomly sampling with replacement from the initial set. By sampling with replacement, some observations could also be repeated in each new training data set. within the case of Bagging, every element has an equivalent probability to look during a new dataset. By increasing the dimensions or size of the training set, the model’s predictive force can’t be improved. It decreases the variance and narrowly tunes the prediction to an expected outcome.

These multisets of data from the dataset are used to train multiple models. As a result, we find ourself with an ensemble of various models. the average of all the predictions from different models is taken or used. this is often more robust than a model. Prediction is often the average of all the predictions given by the various models just in case of regression. within the case of classification, the bulk vote is taken into consideration.

For example, Decision tree models tend to possess a high variance. Hence, we apply bagging to them. Usually, the Random Forest model is employed for this purpose. it's an extension over-bagging. It takes the random selection of features instead of using all features to grow trees. once you have many random trees. It’s called Random Forest.

Boosting Technique.

Boosting, a sequential ensemble method that generally decreases the bias error and builds strong predictive models. The term ‘Boosting’ refers to a family of algorithms which converts a weak learner to a powerful learner.

Boosting gets multiple learners. data samples are weighted and thus, a number of them may participate within the new sets more often.

In each iteration, data points that are mispredicted are identified and their weights are increased in order that succeeding learner pays extra attention to urge them right.

During training, the algorithm allocates weights to every resulting model. A learner with good prediction results on the training data is going to be assigned a better weight than a poor one. So when evaluating a brand new learner, Boosting also must keep track of learner’s errors.

Some of the Boosting techniques include an extra-condition to stay or discard one learner. for instance, in AdaBoost a mistake of less than 50% is required to take care of the model; otherwise, the iteration is repeated until achieving a learner better than a random guess.

Regression using Random Forest Algorithm.

Code snippet with output

Problem definition

Here we will predict the price of a house in the USA. Here we are using this for real estate purpose. The dataset for this problem will be available here.


We will solve this problem by using the random forest algorithm with the help pf scikit-learn, python. Solving this problem in the following steps:

1. Importing libraries and Dataset.

Here we are importing all the necessary libraries and importing dataset.

2. Quick look

In this section, we go for a quick look at the dataset before processing any codes. It is a healthy practice.

3. Preparing and dividing.

Here we prepare our target and predicted value and then we divide them into test and train set.

4. Feature Scaling

We will use Scikit-learn StandardScaler class to scale our data as our data is not properly scaled. This method is very important for the random forest algorithm.

5. Training of algorithm

We have scaled our data now its time to train our data. From sklearn.ensemble, RandomForestRegressor class is used to a regression problem. We will n_esimator=500 because of our large dataset. You can play with your choice to get a better result.

6. Evaluating algorithm.

This is the final step after training your algorithm, to evaluate the performance of the algorithm. Metrics used for regression problem are mean absolute error, mean squared error, and root mean squared error.

7. Visualizing the graph.

This helps to check the accuracy of our prediction.

Regression code for Random Forest algorithm in Scikit-learn, Python. | Get code here.

Run this code by yourself.

Classification using the Random Forest Algorithm.

Code snippet with output


In this classification problem, we will pre the type of iris flower with the help of sepal_leangth, sepal_width, petal_length, petal_width. you will get the data et here.


here we will use a random forest classifier algorithm with scikit-learn, python.

the solution is processed in the following steps.

  1. Importing all the necessary libraries and importing the dataset.

2. going for a quick look of the dataset, this helps to check whether our data has been properly imported or not. This is a good practice.

3. preparing target and predicted variable and dividing into the train test.

4. Training the algorithm with RandomForestClassifier class

5. Checking accuracy and predicting with demo value.

Classification code for Random Forest algorithm in Scikit-learn, Python. | Get code here.

Run this code by yourself


In this beautiful session, you have learned about the random forest algorithm. How does it work? There is a code snippet with an output of every line of code. you can access the code from git public post or you can run code by yourself in this post only without going anywhere or downloading any software. I have provided the link to know more theoretical knowledge about random forest algorithm just click on the heading"Introduction to random forest algorithm " mentioned above.

Hope this will help you and if you face any problem then this post is always here come and access this to fullest. If you face any problem then follow me on twitter link is present is footer and drop a message there.

Thanks and stay tuned for my upcoming posts.

93 views1 comment

Something Interesting



Subscribe to Our Newsletter
Copyright © 2020 MR. Machine. All Rights Reserved
  • Facebook
  • Twitter
  • Pinterest
  • Instagram