Data Visualization With Seaborn



Seaborn is one of the most popular and commonly used libraries for data visualisation.

So, let's kick-off with a quick question.

Do you think it is possible to visualize the relationship between all the variables in the entire data set with just one line of code?

Actually, YES! With the single line of code used below, we can visualize the relationships between all the different variables in the dataset.



So, how did it do that?

Well, Seaborn is a high-level Python data visualization library used for making attractive and informative statistical plots. It acts as a wrapper over matplotlib, and it's used in conjunction with NumPy and pandas data structures.

Matplotlib is another data visualization library used when greater control over the graphs are required. However, Seaborn should not be thought as a replacement for matplotlib, rather as a compliment. Even though matplotlib is more powerful, it has a steeper learning curve compared to Seaborn. When using seaborn, it is likely that you will often invoke matplotlib functions directly to draw simpler plots. Another way of putting it is "matplotlib tries to make easy things easier and hard things possible, seaborn attempts to make a well-defined set of hard things easy".

Importing Libraries

The data analysis libraries, i.e. NumPy and Pandas are an integral part of Data Visualisation and are often required for data preprocessing. However, I will not be going into the preprocessing of data in this blog.

import numpy as np
import pandas as pd
import seaborn as sns

Now, we use the command below to view all Seaborn plots within Jupyter Notebook.

%matplotlib inline

Overview of the Dataset

Now that we have imported the required libraries, I will be using the famous Iris dataset for this blog. If you are interested, all the other in-built Seaborn datasets are listed here.

iris = sns.load_dataset('iris')

So, here is a quick view of the Iris dataset!

sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa

Image:- Different species of Iris Plant

Given below is a brief description of all the variables:

  1. sepal_length-> Sepal length in cm

  2. sepal_width -> Sepal width in cm

  3. petal_length -> Petal length in cm

  4. petal_width -> Petal width in cm

  5. species -> Class of the Iris Plant

    * Iris Setosa

    * Iris Versicolour

    * Iris Virginica

To know more about the Iris Dataset, you can refer UC Irvine Machine Learning Repository.


Distribution Plot

Dist Plot

Dist Plot, also known as Histogram is used to plot a univariate distribution of observation. It also displays a KDE(Kernal Density Estimation) plot over the histogram.

In the plot below, we are plotting a histogram of the petal_width.

sns.distplot(a=iris['petal_width'], bins=40, color='m')


Categorical Plots

Count Plot

Count Plot displays the number of observations for a categorical variable using bars.

In the plot below, we are counting the number of observations of each species of Iris Plant, and we can see that each species has 50 samples in the dataset.

sns.countplot(x='species',data=iris, palette="OrRd")


Box Plot

A box plot shows the distribution of quantitative data across a categorical variable.

In the plot below, the quantitative data that we have used is the sepal_width, and we are comparing it to the categorical variable species.

sns.boxplot(x='species',y='sepal_width',data=iris ,palette='YlGnBu')


Violin Plot

A violin plot is a variation of box plot. It shows the distribution of quantitative data across several levels of categorical variables. The violin plot also features a kde of the underlying distribution.

In the plot below, the petal_length is compared against species.

sns.violinplot(x='species', y='petal_length', data=iris, palette='YlGnBu')


Matrix Plots

Heat Maps

A Heat Map is a graphical representation of the correlation between all the numerical variables in a dataset. The input provided is a correlation matrix generated using pandas.

sns.heatmap(iris.corr(),cmap="YlGnBu", linecolor='white', linewidths=1)


Cluster Map

Cluster Map is a plot of matrix dataset as a hierarchically clustered heatmap.

Note Cluster map can only take in quantitative variables.

g = sns.clustermap(iris, cmap="magma")


Regression Plots

Reg Plot

Regression Plot is used to map all the given data and plot a linear regression model fit for it.

In the example below, we have plotted the petal_width against the petal_length.

sns.regplot(x='petal_width', y='petal_length', data=iris)


Axis Grids

Axis Grids are graphs that combine various plots using FacetGrid to realize complex visualizations.

LM Plots

LM plots are intended as a convenient interface to fit regression models (regression line) across conditional subsets of a dataset. It is a variation of regplot which uses FacetGrid to plot multiple subsets of data and plot the regression line for each of them.

In the example below, we have plotted sepalwidth vs. sepallength. The data set is divided into three subsets based on their species, and regression lines are mapped for each of them.

sns.lmplot(x = 'sepal_width', y = 'sepal_length', data = iris, col = 'species', hue = 'species', palette = 'YlGnBu')


Joint Plot

A joint plot is used to draw a plot of two variables with bivariate and univariate graphs.

In the example below, the kdeplots (univariate graphs) at the top and right are kde's of sepal_width and petal_length respectively, and the central graph (bivariate graph) plots the relationship between the two variables.

sns.jointplot(x="sepal_width", y="petal_length", data=iris, kind="kde", color="m")


Pair Plot

Pair Plot is used to view the pairwise relationship between all the variables in a dataset and the diagonal axes show the univariate distribution of the variable.

The example takes the entire dataset as input and distinguishes data on species with varying colors.

sns.pairplot(iris, hue='species', palette="OrRd")


What Next?

Now that you are comfortable with the basics of Seaborn you can learn more about data visualization at