Performing Basic Data Analysis with Python and SQL
In today's data-driven world, being able to extract insights from data is a valuable skill. In this article, we'll explore how to perform basic data analysis using Python and SQL. We'll connect to a SQLite database, execute SQL queries, and visualize the results using Python libraries. π
Database Schema:
The database contains information about sales transactions made by a company. Here's what each field represents:
- TransactionID (INTEGER, PRIMARY KEY):
- This field uniquely identifies each transaction. It's an integer value and serves as the primary key for the table.
- ProductID (INTEGER):
- This field represents the unique identifier for each product sold in the transaction. It's also an integer value.
- ProductName (TEXT):
- This field stores the name or description of the product sold. It's a text field allowing for variable-length strings.
- Quantity (INTEGER):
- This field indicates the quantity of the product sold in the transaction. It's an integer value.
- PricePerUnit (REAL):
- This field represents the price per unit of the product sold. It's a real number (floating-point value) to accommodate decimal prices.
- SalesDate (TEXT):
- This field stores the date of the sales transaction. It's represented as text and typically follows a specific date format.
Explanation:
The provided schema simulates a basic sales database, commonly used in retail or e-commerce settings. Each transaction (identified by TransactionID) involves the sale of one or more products. For each product sold, the database records the ProductID, ProductName, Quantity sold, PricePerUnit, and the date of the transaction (SalesDate).
Purpose:
This database schema and the accompanying data were created for the purpose of practicing database management and performing data analysis tasks,it is not real-world data.But it provides a simplified representation of sales transactions, making it suitable for learning SQL queries, data manipulation, and analysis.Note:
It's important to note that in a real-world scenario, databases and data structures may be more complex, and additional considerations such as data integrity, normalization, and indexing would be taken into account. However, for learning and practice purposes, this schema serves as a good starting point.
Setting Up the Environment
First, let's ensure we have the necessary tools installed. We'll need Python, SQLite, and the matplotlib library for data visualization. π»
pip install matplotlib
pip install sqlite3
Connecting to the Database
We'll start by creating a SQLite database and populating it with some sample data. We'll then connect to this database using Python's sqlite3
library. π
import sqlite3
import matplotlib.pyplot as plt
# Connect to the database (in-memory database for this example)
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
Creating the Database Schema and Inserting Data
We'll define a schema for our database containing information about sales transactions. The schema includes fields such as TransactionID
, ProductID
, ProductName
, Quantity
, PricePerUnit
, and SalesDate
. πΌ
# Creating a table
cursor.execute('''CREATE TABLE IF NOT EXISTS sales (
TransactionID INTEGER PRIMARY KEY,
ProductID INTEGER,
ProductName TEXT,
Quantity INTEGER,
PricePerUnit REAL,
SalesDate TEXT
)''')
# Inserting sample data
sales_data = [
(1, 101, 'Product A', 10, 20.0, '2024-01-15'),
(2, 102, 'Product B', 15, 25.0, '2024-01-20'),
(3, 101, 'Product A', 8, 20.0, '2024-02-05'),
(4, 103, 'Product C', 12, 30.0, '2024-02-10'),
(5, 102, 'Product B', 20, 25.0, '2024-03-02'),
(6, 104, 'Product D', 5, 35.0, '2024-03-10')
]
cursor.executemany('INSERT INTO sales VALUES (?, ?, ?, ?, ?, ?)', sales_data)
Performing Data Analysis Tasks
Now that our database is set up, let's perform some basic data analysis tasks using SQL queries. π
Task 1: Total Sales Revenue for Each Product
We'll calculate the total sales revenue for each product by multiplying the Quantity
with PricePerUnit
and summing it up. π°
query1 = '''SELECT ProductID, ProductName, SUM(Quantity * PricePerUnit) AS TotalRevenue
FROM sales
GROUP BY ProductID'''
cursor.execute(query1)
result = cursor.fetchall()
Visualizing the Results
We can visualize the results using Python libraries such as matplotlib
. π
# Plotting pie chart for Task 1
labels = [row[1] for row in result]
revenues = [row[2] for row in result]
plt.pie(revenues, labels=labels, autopct='%1.1f%%')
plt.title('βTask 1: Total sales revenue distribution')
plt.show()
Task 2: Top 3 Best-Selling Products
We'll find the top 3 best-selling products based on the total quantity sold. π
query2 = '''SELECT ProductID, ProductName, SUM(Quantity) AS TotalQuantity
FROM sales
GROUP BY ProductID
ORDER BY TotalQuantity DESC
LIMIT 3'''
cursor.execute(query2)
result = cursor.fetchall()
output
| ProductID | ProductName | TotalQuantity |
|-----------|-------------|---------------|
| 102 | Product B | 35 |
| 101 | Product A | 18 |
| 103 | Product C | 12 |
Task 3: Average Price Per Unit for All Products
We'll calculate the average price per unit for all products. π
query3 = '''SELECT AVG(PricePerUnit) AS AveragePricePerUnit
FROM sales'''
cursor.execute(query3)
result = cursor.fetchall()
output
| AveragePricePerUnit |
|----------------------|
| 26.666666666666668 |
Conclusion
In this article, we've demonstrated how to connect to a SQLite database using Python, execute SQL queries, and perform basic data analysis tasks. By leveraging Python libraries such as sqlite3
and matplotlib
, we can efficiently analyze data and gain valuable insights. π