How to use Matplotlib in Python? | Data Visualization Techniques and Facts.
Updated: Aug 7, 2020
Topics you will learn:
What is data visualization?
Why data visualization?
Techniques of data visualization.
The language used for data visualization.
Plotting data with the help of python.
How to use Matplotlib in Python?
Introduction to Matplotlib.
Architecture of Matplotlib.
Matplotlib Important Terms.
Different types of analysis.
Different types of the plot :
What is data visualization?
Data visualization is a very important skill in applied statistics and machine learning.
Statistics does indeed concentrate on quantitative descriptions and estimations of data and information. Data visualization provides a crucial suite of tools for gaining a qualitative understanding.
This can be helpful when exploring and progressing to know a dataset and may help with identifying patterns, corrupt data, outliers, and far more.
With a bit of domain knowledge, data visualizations are often used to express and demonstrate key relationships in plots and charts that are more visceral to yourself and stakeholders than measures of association or significance.
Data visualization and exploratory data analysis are whole fields themselves and that I will recommend a deeper dive into some of the books mentioned at the end.
Why we should go for Data visualization?
Do you think to supply you with the data of let's imagine 1 Million points in a table/Database file and asking you to supply your inferences by just seeing the data on it the table is feasible? Unless you’re an excellent or superhuman it's impossible.
This is often after we make use of data visualization, wherein all the data are going to be transformed into some sort of plots and analyzed beyond that. As being an individual's, we are more accustomed to grasping tons of information from diagrammatic representation than their counterparts.
We need data visualization because a visible summary of data makes it easier to spot patterns and trends than exploring through thousands of rows on a spreadsheet. It’s the way the human brain works.
Since the aim of data analysis is to achieve insights, data is way more valuable when it's visualized. even though a data analyst can pull insights from data without visualization, it'll be harder to speak the meaning without visualization. Charts and graphs make communicating data findings easier albeit you'll identify the patterns without them.
In undergraduate business schools, students are often taught the importance of presenting data findings with visualization. Without a visible representation of the insights, it is often hard for the audience to understand the true meaning of the findings.
For instance, rattling off numbers to your boss won’t tell them why they must care about the data, but showing them a graph of what proportion money the insights could save/make them is certain to urge their attention.
Techniques of Data Visualization.
Here I have mentioned the most common Techniques.
The easiest way to show the event of 1 or several data sets could be a chart. Charts vary from bar and line charts that show the relationship between elements over time to pie charts that demonstrate the components or proportions between the elements of 1 whole.
Plots allow distributing two or more data sets over a 2D or maybe 3D space to point out the connection between these sets and therefore the parameters on the plot. Plots also vary from scattering and bubble plots are the most traditional. Though when it involves big data, analysts use box plots that enable to see the connection between large volumes of various data.
Maps are widely used in several industries. they permit to position elements on relevant objects and areas - geographical maps, building plans, website layouts, etc. Among the foremost popular map, visualizations are heat maps, dot distribution maps, cartograms.
Diagrams and matrices
Diagrams are usually accustomed to or use to demonstrate complex data relationships and links and include various sorts of data on one visualization. they will be hierarchical, multidimensional, tree-like.
The matrix may be a big data visualization technique that permits to reflect the correlations between multiple constantly updating (steaming) data sets.
Which Language I should use?
There could also be several languages on which we will perform the data Visualization, but the ones that are much widely utilized in the sector of Data Science are Python & R. So your next question could also be, Which one to learn and which one encompasses a better scope ?. the solution is simple! It’s purely your choice ;).
‘R’ is much more statistical language and has several great packages for Data science applications, whereas Python, on the opposite hand, is widely utilized in general-purpose programming also as for Data science and ML related applications.
I'm pretty comfortable with python so I will be able to continue the rest of the blog with python codes and also it has several good packages like Scikit, Matplotlib, seaborn etc which helps us tons and a special because of those developers who made our work simple.
Plotting data with the help of python.
As mentioned above, Python has several good packages to plot the data and among them, Matplotlib is that the most useful one. Seaborn is additionally a good package which offers tons of more appealing plot and even it uses Matplotlib as its base layer.
There also are many similar sorts of plots available in Pandas when the whole data is stored in the pandas data frame. during this blog, we are about to discuss the various sorts of plots that are under these 2 special packages and allow us to explore them thoroughly.
Introduction to Matplotlib.
There are many excellent plotting libraries in Python and that I recommend exploring them so as to make presentable graphics.
For quick and dirty plots intended for your own use, I like to recommend using the Matplotlib library. it's the foundation for several other plotting libraries and plotting support in higher-level libraries like Pandas.
The Matplotlib provides a context, one during which one or more plots are often drawn before the image is shown or saved to file. The context is often accessed via functions on Pyplot. The context is often imported as follows:
from matplotlib import pyplot
There is some convention to import this context and name it plt; for example:
import matplotlib.pyplot as plt
We will not use this convention, instead, we'll stick with the quality Python import convention.
Plots and charts are made by making and calling on context; for example:
Elements like axis, labels, legends, so on are often accessed and configured on this context as separate function calls.
The drawings on the context are often shown during a new window by calling the show() function:
# display the plot pyplot.show()
Alternately, the drawings on the context are often saved to file, like a PNG formatted image file. The savefig() function is often used to save images.
This is the foremost basic crash program for using the Matplotlib library.
There are overall 3 different layers In this architecture of Matplotlib as follows.
This is the bottom-most layer of a figure which contains an implementation of several functions that are required for plotting.
There are 3 main classes from the backend layer FigureCanvas ( the layer/Surface on which the figure are going to be drawn), Renderer (the class that takes care of the drawing on the surface ) and Event ( to handle the mouse and keyboard events).
We don’t work much with the Backend layer as compared to our counterparts.
This is the second/middlemost layer within the architecture. it's what does most of the duty on plotting the varied functions, like axis which coordinates on how to use the renderer on the Figure canvas. to place it simply, let's consider Paper because of the Figure Canvas and Sketch pen as renderer.
Then the hand of the painter is that the Artist layer which has certain functions knows the way to sketch to get the precise or appropriate figure. There are several classes available on artist layer and a couple of important ones are Figure, Axes and Axis.
The 2 images above explain the hierarchy between various classes within the artist layer. The figure is that the topmost one and a figure can contain multiple numbers of axes upon which the plot is completed. Moving on, under each axes we will add multiple plots.
This provides plenty more additional functionality to enhance the plots and this is often the place where most of the work works happen.
This is the topmost layer on which the bulk of our codes will play. For day to day exploratory works, we almost believe this scripting layer of Matplotlib. Pyplot is that the scripting layer that gives almost similar functionality as that of Matlab in python.
The methods within the scripting layer almost automatically pay attention to the other layers and everyone we'd like to care about is that the current state(figure & Subplot). Hence it's also called a stateful interface.
Here let's have a brief glance on a number of the commonly used terms in data visualization using Matplotlib.
Wherever plt.(Some function) is employed, it means I imported matplotlib.pyplot as plt in my program and sns means I imported seaborn only for the convenience of coding.
import matplotlib.pyplot as plt import seaborn as sns
If you don’t have the packages already installed in your system. Please install Python 3 and use the below code on your command prompt.
pip3 install matplotlib pip3 install seaborn
Axes are the whole area of one plot within the figure. it's the category that contains several attributes needed to draw a plot like adding a title, giving labels, selecting bin values for various sorts of plots on each axes etc.
We can have multiple axes during a single plot, by which we will combine multiple plots into one figure. For e.g: If we would like both PDF and CDF curves within the same figure we will create 2 axes and draw each of them in several axes. Then we will combine them into one figure.
When the grid is enabled during a plot, a group of horizontal and vertical lines are added at the background layer of the plot. this could be enabled using plt.grid(). It is often useful for a rough estimation of value at a specific coordinate, just by observing the plot.
Legends are nothing but labelled representation of the various plots available during a figure. i.e when there are multiple plots during a single image(e.g: Iris dataset), Legends will help us to spot the proper name of the various coloured plots in a very single figure.
When we want 2 or more plots during a single image, we will make use of Subplot in Matplotlib, plt.subplot (xyz). xyz is a 3 digit integer where x-No of rows, y =No of columns, z= index number of that plot.
This is often one among the foremost useful feature after we got to compare two or more plots hand to hand rather than having them in separate images.
plt.figure(1,figsize=(30,8)) plt.subplot(131) #Code for fig1. plt.subplot(132) #code for fig2 plt.subplot(133) #code for fig3. plt.show()
In addition to Subplot try using gridspec, which may help us split the plots in subplot more effectively and easier.
We can set a title to a plot using plt.title()
plt.xlabel() is that the command to set the label for the x-axis
plt.ylabel() is that the command to set the label for the y-axis
The below image very well explains each and all parts in visualizing data as a figure.
Different types of analysis.
There are different kinds of analysis as mentioned below.
Univariate: within the univariate analysis, we'll be using a single feature to analyze or research almost of its properties.
Bivariate: after we compare the data between exactly 2 features, its called bivariate analysis.
Multivariate: Comparing over 2 variables is termed as multivariate analysis.
Different types of the plot :
Dataset used Irirs dataset
As far as Machine learning/Data Science cares, one among the foremost commonly used plot for easy data visualization is scatter plots.
This plot gives us a representation of where each point within the entire dataset are present with reference to any 2/3 features(Columns). Scatter plots are available in 2D as well as 3D. The 2D scatter plot is that the important/common one, where we'll primarily find patterns/Clusters and separability of the data from the dataset. The code snippet for employing or using a scatter plot is as shown below.
plt.scatter(iris['sepal_length'],iris['sepal_width']) plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.title('Scatter plot on Iris dataset')
scatter plot using seaborn
#using seaborn sns.set_style("whitegrid") sns.FacetGrid(iris, hue="species", size=6) \ .map(plt.scatter, "sepal_length", "sepal_width") \ .add_legend() plt.show()
We can use scatter plots for 2d with Matplotlib and even for 3D, we are able to use it from plot.ly. What to do once we have 4d or more than that? this is often when Pair plot from seaborn package comes into play.
Let's say we've n number of features in a very data, Pair plot will create us a (n x n) figure where the diagonal plots are going to be histogram plot of the feature like that row and remainder of the plots are the mixture of feature from each row in y-axis and feature from each column within the x-axis.
The code snippet for pair plot implemented for Iris dataset is provided below.
sns.set_style("whitegrid"); sns.pairplot(iris, hue="species", size=2); plt.show()
By getting a high-level overview of plots from pair plot, we will see which two features can well explain/separate the data then we will use scatter plot between those 2 features to explore further. From the above plot, we are able to conclude like, Petal length and petal width are the two features which may separate the data very well.
Since we'll be getting n x n plots for n features, pair plot may become complex once we have more number of feature say like 10 or so on. So in such cases, the simplest bet is going to be using a dimensionality reduction technique to map data into 2d plane and visualizing it employing a 2d scatter plot.
This is the sort of plot which will be used to obtain more of the statistical details about the data. The straight lines at the maximum and minimum also are called as whiskers. Points outside of whiskers are going to be inferred as an outlier. The box plot gives us a representation of 25th, 50th,75th quartiles. From the box plot, we will also see the Interquartile range(IQR) where maximum details of the data are going to be present. It also gives us a transparent overview of outlier points within the data.
code for box plot.
plt.figure(figsize=(20,5)) sns.boxplot(x='sepal_length', y="sepal_width",data=iris,color="orange")
The violin plots are often inferred as a mixture of Box plot at the centre and distribution plots(Kernel Density Estimation ) on both sides of the data. this will give us the main points of distribution like whether the distribution is multimodal, Skewness etc. It also gives us the useful info like 95% confidence interval. The below image helps us grasp some important parts from a violin plot.
code for violin plot.
plt.figure(figsize=(20,5)) sns.violinplot(x='sepal_length', y="sepal_width",data=iris, color="orange")
This is one amongst the simplest univariate plot to understand the distribution of data. When analyzing the effect on the dependent variable(output) with respect to one feature(input), we use distribution plots plenty. it's also readily available within the seaborn package. This plot gives us a mixture of pdf and histogram during a single figure.
The sharp block-like structures are histograms and therefore the smoothened curve is named Probability density function(PDF). The pdf of a curve can help us to spot the underlying distribution of that feature which is one major takeaway from Data visualization/EDA.
code for distribution plot.
The great thing about joint plots is, during a single figure we will do both univariate also as bivariate analysis. the main plot will give us a bivariate analysis, whereas on the top and right side we'll get univariate plots of both the variables that were considered. there's a range of option you'll choose between, which may be tuned using a kind parameter in seaborn’s jointplot function. The one shown below is of a sort as KDE( Kernel Density Estimation) and it's represented during a contour diagram, All points with a similar boundary will have a similar value and therefore the colour at a point depends on a variety of datapoints i.e it'll be light colour when only a few points had that value and can be darker with more points. this is often why at the middle it's darker and will be pale/lighter at the ends for this dataset.
sns.jointplot(x ='sepal_length', y ='sepal_width', data = iris, kind="kde",color="red")
The 2 most vital plots we use are going to be scatter plot for bivariate and distribution plot for univariate and since we are getting both during a single plot as shown below, it'll make our work much easier.
sns.jointplot(x ='sepal_length', y ='sepal_width', data = iris, color="red")
This is one among the widely used plot, that we might have seen multiple times not just in data analysis, but wherever there's an analysis in many fields. Though it's going to seem simple it's powerful in analyzing data like sales figure weekly, revenue from a product, Number of visitors to a site on every day of every week etc.
fig = plt.figure() ax = fig.add_axes([0,0,1,1]) Students = ['jhon','sara','carlo','Andy','victor'] Maths_marks = [23,17,35,29,12] ax.bar(Students,Maths_marks) plt.show()
bar chart with a comparison.
import numpy as np import matplotlib.pyplot as plt data = [[30, 25, 50, 20], [40, 23, 51, 17], [35, 22, 45, 19]] X = np.arange(4) fig = plt.figure() ax = fig.add_axes([0,0,1,1]) ax.bar(X + 0.00, data, color = 'b', width = 0.25) ax.bar(X + 0.25, data, color = 'g', width = 0.25) ax.bar(X + 0.50, data, color = 'r', width = 0.25) ax.legend(labels=['CS', 'IT','ECE']) plt.show()
This is the plot that you simply can see in nook and corners of any kind of analysis between 2 variables. the line plots are nothing but the values on a series of data points are going to be connected with straight lines. The plot could seem very simple but it is used for more number of applications not only in machine learning but in many other areas.
code for a line plot.
import matplotlib.pyplot as plt import numpy as np x = np.arange(1,11) y = np.random.random(10) plt.plot(x,y) plt.show()
different types of line plots.
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize = (12,4)) x = np.arange(1,11) y1 = np.random.random(10) y2 = np.random.random(10) y3 = np.random.random(10) ax1.plot(x, y1) ax1.set_title('Plain Line plot') ax2.plot(x, y2, marker = 'o') ax2.set_title('Line plot with markers') ax3.plot(x, y3, marker = 'o', linestyle = ':') ax3.set_title('Line plot with markers & linestyle') plt.show()
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize = (12,4)) x = np.arange(1,11) y1 = np.random.random(10) y2 = np.random.random(10) y3 = np.random.random(10) ax1.plot(x, y1, marker = '*') ax1.plot(x, y2, marker = '*') ax1.set_title('Double Line plot') ax2.plot(x, y1, marker = '*', linestyle = ':') ax2.plot(x, y2, marker = '*', linestyle = ':') ax2.plot(x, y3, marker = '*', linestyle = ':') ax2.set_title('2 or more Lines') ax3.plot(x, y1, marker = '*', label = 'line1') ax3.plot(x, y2, marker = '*' , label = 'line2') ax3.plot(x, y3, marker = '*', linestyle = ':', label = 'line3') ax3.set_title('Multiple lines with legends') plt.legend() plt.show()
The histogram represents the frequency of occurrence of specific phenomena which lie within a selected range of values and arranged in consecutive and stuck intervals.
code for the histogram.
# create histogram plot #pyplot.hist(x) # example of a histogram plot from numpy.random import seed from numpy.random import randn from matplotlib import pyplot # seed the random number generator seed(1) # random numbers drawn from a Gaussian distribution x = randn(1000) # create histogram plot pyplot.hist(x) # show line plot pyplot.show()
Heatmap is one good visualization technique used to or accustomed to compare any 2 variables/features with respect to the values. The heatmap from the seaborn library will create a grid-like plot beside an optional colour bar. we offer a 2D input matrix with certain values on each element to the heatmap and it exactly reproduces the output plot within the same shape as that of the input matrix and every tile is coloured supported the values provided in each element of matrix to their corresponding tile.
code for the heatmap.
import numpy as np; np.random.seed(0) import seaborn as sns; sns.set() iris = np.random.rand(10, 12) plt.figure(figsize=(10,5)) ax = sns.heatmap(iris)
I think this much is enough for today. :)
Here I have mentioned all the possible topics for beginners in DATA VISUALIZATION.
There are many more topics and those will be here soon. follow me on twitter link present on the footer and always stay updated. I am looking forward to your views on the comment section. And thank for giving time and reading this.
If you like then show your love by sharing and giving heart to this article.