Exploratory Data Analysis using SQL

allan-pg - Aug 13 - - Dev Community

INTRODUCTION

SQL stands for Structured Query Language. Structured query language (SQL) is a programming language for storing and processing information in a relational database. A data analyst uses SQL to manipulate and gain insights from the data. In this project, we will analyze data and for this purpose we will use sakila sample database.

For this project, I will be using MySQL Workbench. You can download it from here. MySQL is an open-source relational database management system. You can use any other RDBMS software. Once your software is set up we shall import Data into our database.

Lets Dive in

IMPORTING DATA

After downloading sakila database extract the files from the zip file. Create database on workbench and name it Sakila.
create database in MYSQL

Open MYSQL workbench and on file click on open SQL script
Files on workbench

Open folder you extracted the data to and click on schema then open and run script
import database schema

Repeat the same process to import data and run the SQL script
Import Data to my sql

Data Analysis

  • Find all films with PG-13 films with rental rate of 2.99 or lower
SELECT *
FROM (
    SELECT *
        FROM film 
                ) x
WHERE x.rating = 'PG-13' AND x.rental_rate <= 2.99;
Enter fullscreen mode Exit fullscreen mode

In my filter condition I used and since both of the conditions must be satisfied. The sub-query helped optimize our SQL query run time to save on time and memory.

  • All films that have deleted scenes
SELECT *
FROM film f 
WHERE f.special_features like '%Deleted Scenes%';
Enter fullscreen mode Exit fullscreen mode

special_features field contains description about the film and that is where it specifies whether a film has deleted scene. LIKE operator is used to retrieve the data in a column of a table, based on a specified pattern.

  • All active customers
SELECT *
FROM customer c 
WHERE c.active = 1;
Enter fullscreen mode Exit fullscreen mode

Active field in customers table has a boolean data type of 1 or 0 where 1 is True and 0 is a False. We filtered customers where active is 1 or True.

  • Distinct names of customers who rented a movie on 26th July 2005
SELECT DISTINCT(concat(c.first_name, ' ', c.last_name)) as full_name
FROM customer c
inner join rental r USING(customer_id)
WHERE r.rental_date like '2005-07-26%';
Enter fullscreen mode Exit fullscreen mode

I decided to concatenate first and last name just incase two or more customers have the same first name but different last name are not left out

  • How many rentals we do on each day?
select  date(rental_date),
    count(rental_id) as num_of_rentals
from rental
group by date(rental_date)
order by num_of_rentals desc;
Enter fullscreen mode Exit fullscreen mode

rental_date column is a datetime data type therefore I extracted date from rental date so that we had to group by date only.

  • What are the three top earning days so far?
select  date(payment_date) as date,
    count(rental_id) as num_of_rentals,
    sum(amount) as money
FROM  payment
GROUP BY date(payment_date)
ORDER BY sum(amount) desc
limit 3;
Enter fullscreen mode Exit fullscreen mode

After grouping our data we need to sort the data in a descending manner by use of order by and to choose the top 3 best days so far we use limit clause.

With this analysis of the data we are now aware of which type of movies to keep in stock, active customers to keep and can also further create a predictive model to predict the sales of film but for that, we need some more data.

Thanks for reading! For full project check it out on Github

If you want to get in touch with me, feel free to reach me at carsonallan71@gmail.com or my LinkedIn Profile.

. . . . . . . . . . .
Terabox Video Player