BUILDING DATA VISUALIZATION WITH PYTHON: A BEGINNER'S GUIDE TECHNIQUES

Introduction:
In data science, uncovering patterns, trends, and insights from complex data is essential for decisionmaking. However, raw data alone isn’t always sufficient to communicate these insights effectively, and that's where data visualization comes in.

Visualizing data allows us to represent data graphically, making it easier to understand patterns, identify trends, and convey insights to others.

Effective visualizations are crucial for communicating findings clearly. They enable stakeholders to grasp complex concepts quickly and make informed, data-driven decisions.

Python is one of the best languages for data visualization due to its versatile and powerful libraries, including Matplotlib and Seaborn. This tutorial will introduce you to these two essential libraries and walk you through how to use them to create a variety of visualizations.

Data visualization is therefore the process of transforming raw data into graphical representations, such as charts, graphs, and maps, to make it easier to understand and interpret.

It helps in simplifying complex datasets, allowing users to see patterns, trends, correlations, and outliers more clearly than through raw numbers or text alone. By using visuals, data can be communicated more effectively, enabling quicker insights and better decision-making.

Common forms of data visualization include: • Bar charts • Line graphs • Scatter plots • Heatmaps • Pie charts • Maps etc.

Although visualizations offer diverse ways to present data, creating effective designs can be challenging. Python provides a range of libraries to create custom charts and plots, while tools like Datylon allow you to design visually compelling reports.
Adhering to key principles of visualization is essential for creating impactful data visualizations.

The key principles of visualization include: • Balance • Unity• Contrast
• Emphasis
• Repetition
• Pattern
• Rhythm
• Movement
• Proportion
• Harmony
• Variety

https://www.youtube.com/watch?v=a9UrKTVEeZA

Main Content:

Introduction to Matplotlib and Seaborn:

❖ Matplotlib: Matplotlib is a low-level, flexible library in Python that provides a wide range of functionality for creating static, animated, and interactive plots. It is the foundation of many other plotting libraries in Python, offering the flexibility to customize almost any aspect of a
plot. This makes it a go-to tool for scientific and academic data visualization.

https://matplotlib.org/

The source code for Matplotlib is located at this github repository https://github.com/matplotlib/matplotlib

Advantages of Matplotlib

❖ Highly customizable with complete control over plot aesthetics.

❖ Ability to create a wide range of plots, from basic bar charts to intricate multi-faceted graphs.

❖ Often used in conjunction with libraries like Pandas to plot data from DataFrames.

Common Use Cases:

❖ Generating simple visualizations like bar charts, histograms, and line plots.

❖ Custom visualizations with unique axes, colors, and layouts for scientific reporting.

Introduction to Seaborn

❖ Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and
integrates closely with pandas data structures. It simplifies the process of creating attractive and informative statistical graphics. Seaborn offers high-level interfaces for drawing beautiful and informative plots, allowing you to visualize complex datasets with just a few lines of code.

It also integrates seamlessly with Pandas DataFrames, making it an excellent tool for statistical data analysis.

Seaborn helps you explore and understand your data. Its plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.

Its dataset-oriented, declarative API lets you focus on what
the different elements of your plots mean, rather than on the details of how to draw them.

Advantages:

❖ Simpler syntax for creating complex plots like heatmaps, pair plots, and violin plots.

❖ Better default settings for aesthetics and color palettes.

❖ Ideal for statistical visualizations, where patterns and relationships in data need to be highlighted.

Common Use Cases:

❖ Creating statistical visualizations such as correlation heatmaps, regression plots, and distribution plots.

❖ Quickly exploring and visualizing relationships between variables in data.

Setting Up the Environment: To start building visualizations in Python, you first need to set up your environment. If you’re working on a local machine, you can install the required libraries using pip.

However, for this guide, we’ll use
Google Colab, a cloud-based notebook platform that allows you to write and execute Python code online
without any installation.

Here’s how to get started:

Go to Google Colab.
Create a new notebook by clicking on "File" > "New Notebook".
In the first cell, install the necessary libraries using the following command:

python
Copy code
!pip install matplotlib seaborn

Import the libraries: python Copy code import matplotlib.pyplot as plt import seaborn as sns

This setup allows you to run Python code directly in your browser, and it includes all the necessary tools for creating visualizations.

Creating Basic Plots with Matplotlib:

Let's start with some fundamental plots using Matplotlib. Understanding the basics of plotting will
give you the foundation to explore more advanced topics later on.

• Bar Chart: A bar chart is a great way to compare quantities across different categories.

python
Copy code
import matplotlib.pyplot as plt
categories = ['Apples', 'Bananas', 'Cherries', 'Dates']
values = [50, 25, 75, 100]
plt.bar(categories, values, color='skyblue')
plt.title('Fruit Sales')
plt.xlabel('Fruit Type')
plt.ylabel('Quantity Sold')
plt.show()

Explanation:
❖ The plt.bar() function takes two arguments: the categories (x-axis) and the corresponding
values (y-axis).

❖ We use plt.title(), plt.xlabel(), and plt.ylabel() to add informative labels and titles.

• Line Plot: Line plots are ideal for visualizing changes over time or continuous data.

python
Copy code
months = ['January', 'February', 'March', 'April', 'May']
sales = [200, 150, 250, 300, 270]
plt.plot(months, sales, marker='o', color='green')
plt.title('Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Sales ($)')
plt.show()

Explanation:

➢ The plt.plot() function creates a line plot with data points connected by a line.

➢ We use the marker='o' option to highlight each data point with circles.
• Scatter Plot: Scatter plots are excellent for visualizing the relationship between two variables.

python
Copy code
height = [150, 160, 165, 170, 175, 180]
weight = [50, 55, 60, 68, 72, 78]
plt.scatter(height, weight, color='purple')
plt.title('Height vs Weight')
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.show()

Explanation:
o plt.scatter() is used to create a scatter plot, which is particularly useful for visualizing the correlation between variables (e.g., height and weight).

o The scatter plot does not connect points with a line, as the focus is on the distribution of data points.

Advanced Visualizations with Seaborn: Moving on to Seaborn, we can generate more sophisticated and aesthetically pleasing visualizations. Let’s Explore the Beautiful Visualizations of Seaborn!

▪ Bivariate Plots
Bivariate plots involve the visualization and analysis of the relationship between two variables simultaneously. They are used to explore how two variables are related or correlated. Common ones with Matplotlib are sns.scatterplot(x,y,data) , sns.lineplot(x,y,data) for scatter and line plots.
Will see more about some uncommon plots here.

▪ Regression Plot

A Regression Plot focuses on the relationship between two numerical variables: the independent variable
(often on the x-axis) and the dependent variable (on the y-axis).

There are individual data points are
displayed as dots and the central element of a Regression Plot is the regression line or curve, which represents the best-fitting mathematical model that describes the relationship between the variables.

Use sns.regplot(x,y,data) to create a regression plot.

Regression Plot

plt.figure(figsize=(8, 5))
sns.regplot(x="total_bill", y="tip", data=tips, scatter_kws={"color": "blue"}, line_kws={"color":
"red"})
plt.title("Regression Plot of Total Bill vs. Tip")
plt.xlabel("Total Bill ($)")
plt.ylabel("Tip ($)")
plt.show()
https://levelup.gitconnected.com/advanced-seaborn-demystifying-the-complex-plots-537582977c8c
Regression of Bill (independent variable) and tip (dependent variable).

• The regression line represents the best-fitting linear model for predicting tips based on total bill
amounts.
The scatter points show individual data points, and you can observe how they cluster around the regression line.
This plot is useful for understanding the linear relationship between
these two variables.

▪ Joint Plot
A joint plot combines scatter plots, histograms, and density plots to visualize the relationship between
two numerical variables.

The central element of a Joint Plot is a Scatter Plot that displays the data points
of the two variables against each other, along the x-axis and y-axis of the Scatter Plot, there are histograms or Kernel Density Estimation (KDE) plots for each individual variable. These marginal plots
show the distribution of each variable separately.

Use sns.jointplot(x,y,data=dataframe,kind) , kind can be one of [‘scatter’, ‘hist’, ‘hex’, ‘kde’, ‘reg’,
‘resid’] these.

Joint Plot

sns.jointplot(x="total_bill", y="tip", data=tips, kind="scatter")
plt.show()
The joint plot of the total bill and tip.
As we can see, this shows the relation between the two variables through a scatter plot, while the
marginal histograms show the distribution of each variable separately.

▪ Multivariate Plots
These plots give us a lot of flexibility to explore the relationships and patterns among three or more variables simultaneously. That is, Multivariate plots extend the analysis to more than two variables, which will be often needed in the Data Analysis.

Using Parameters to add dimensions

• Using Hue parameter: Using the hue parameter will add color to the plot based on the provided categorical variable, specifying a unique color for each of the categories.
This parameter can be
used almost all of the plots like; .scatterplot() , .boxplot() , .violinplot() , .lineplot() , etc.Let’s see few examples.

Violin Plot with Hue

plt.figure(figsize=(10, 6))
sns.violinplot(
x="day",

x-axis: Days of the week (categorical)

y="total_bill", # y-axis: Total bill amount (numerical)
data=tips,
hue="sex", # Color by gender (categorical)
palette="Set1", # Color palette
split=True # Split violins by hue categories
Example 1: Facet Grid
A Facet Grid is a feature in Seaborn that allows you to create a grid of subplots, each representing a different subset of your data. In this way, Facet Grids are used to compare patterns or relationships with multiple variables within different categories.

Use sns.FacetGrid(data,col,row) to create a facet grid, which returns the grid object. After creating the grid object you need to map it to any plot of your choice.

Create a Facet Grid of histograms for different days

g = sns.FacetGrid(tips, col="day", height=4, aspect=1.2)
g.map(sns.histplot, "total_bill", kde=True)
g.set_axis_labels("Total Bill ($)", "Frequency")
g.set_titles(col_template="{col_name} Day")
plt.savefig('facet_grid_hist_plot.png')
plt.show()
Facet Grid of Total Bills Frequency distribution within each day.

• Similarly, you can map any other plot to create different types of FacetGrids.

Example 2: Pair Plot
A Pair plot provides a grid of scatterplots, and histograms, where each plot shows the relationship between two variables, which is why it is also called a Pairwise Plot or Scatterplot Matrix.

The diagonal cells typically display histograms or kernel density plots for individual variables, showing
their distributions. The off-diagonal cells in the grid often display scatterplots, showing how two variables are related.

Pair Plots are particularly useful for understanding patterns, correlations, and
distributions across multiple dimensions in your data.

Use sns.pairplot(data) to create a pairplot. You can customize the appearance of Pair Plots, such as changing the type of plots (scatter, KDE, etc.), colors, markers, and more. If you want to change the
diagonal plots, you can do so by using diag_kind parameter.# Load the "iris" dataset
iris = sns.load_dataset("iris")

Pair Plot

sns.set(style="ticks")
sns.pairplot(iris, hue="species", markers=["o", "s", "D"])
plt.show()
Pair plot of iris datasets’ numeric variables, and color dimension based on species category column.

Example 3: Pair Grid
By using a pair grid you can customize the lower, upper, and diagonal plots individually.

Load the "iris" dataset

iris = sns.load_dataset("iris")

Create a Facet Grid of pairwise scatterplots

g = sns.PairGrid(iris, hue="species")
g.map_upper(sns.scatterplot)
g.map_diag(sns.histplot, kde_kws={"color": "k"})
g.map_lower(sns.kdeplot)
g.add_legend()
plt.show()Using Pair Grid to customize the lower, upper, and diagonal plots.

Customizing Visualizations: Customizations allow you to tailor visualizations to better suit your needs, making them more informative and visually appealing. • Adding Titles and Labels: A good title and proper labels for the axes enhance the readability and usefulness of a plot. python Copy code plt.title('Custom Plot Title') plt.xlabel('X-axis Label') plt.ylabel('Y-axis Label') • Changing Color Schemes: Both Matplotlib and Seaborn offer various color palettes, allowing you to convey information more clearly through colors. python Copy code sns.set_palette("muted") # Setting a soft color palette • Adding Legends: Legends help differentiate between multiple data series in a plot. python Copy code plt.legend(['Dataset 1', 'Dataset 2'])

Explanation: Legends are necessary when visualizing multiple datasets on the same plot, making it easier to interpret the chart.

Saving Visualizations: Once you’ve created and customized your visualization, you may want to save it for use in presentations or reports. • Saving the Plot: You can save the plot as an image file (e.g., PNG, JPG, or PDF) using Matplotlib. python Copy code plt.savefig('my_plot.png') Explanation: The plt.savefig() function saves the current figure to your desired file path. You can specify the format by changing the file extension (e.g., .png, .jpg, or .pdf).

Conclusion:
Data visualization is a key skill for anyone in data science or data analytics. Python’s Matplotlib and Seaborn libraries offer powerful tools for turning raw data into informative and insightful visualizations. Whether you're creating simple line plots or complex heatmaps, mastering these libraries will enable you to communicate your data more effectively and make better-informed decisions.

References:
Seaborn Official Website

https://seaborn.pydata.org/tutorial.html

https://levelup.gitconnected.com/advanced-seaborn-demystifying-the-complex-plots-537582977c8c

Written by: Aniekpeno Thompson
A passionate Data Science enthusiast. Let's explore the future of data science together!

https://www.linkedin.com/in/aniekpeno-thompson-80370a262