We will start today with the second deliverable of out data science training with Pandas. In this article we will summarize different pandas python functions to treat concepts such as indexing, selection and filtering.
Creating, Getting Info, Selecting and Util Functions
Before we start, we are going to visualize a little example dataset that we are going to follow to put the examples. It is a well-known dataset that contains wine information.
Indexing, Selecting and Filtering
In this chapter we will explain some key functions that can be useful to obtain a high level vision and an statistical overview of our dataset.
We will start with info() function, that offers us insigths about the number of columns, name of every column, number of non-null elements of each column and data type of everycolumn. Therefore, This function gives us a detailed overview of the data columns.
df.info()
Wines Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
-------- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
Dtypes
Dtypes attribute show us the data structure associated with each column.
df.dtypes
Wines Dataset:
fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object
Describe() function gives us a statistical overview distribution of each column, giving us main metrics that offer the first useful insights : the mode represents the most prevalent value , mean and std gives us an overview of the average value and the deviations from that value, median offers a hint of how the 50 % of this column is above or below a given value.
However, it is important to note that we should analyze the % of NaNs (missing data) that these columns have in order to have a general overview of what these metrics means but this will be analysed in the next articles..
df.describe()
Wines Dataset:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000
Indexing and Selection
Here we are going to take a deep dive into explaining the two main indexing and selection pandas functions : ‘iloc’ and ‘loc’
-
.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:
– A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
– A list or array of labels [‘a’, ‘b’, ‘c’].
– A slice object with labels ‘a’:’f’ (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)
– A boolean array (any NA values will be treated as False).
– A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:
– An integer e.g. 5.
– A list or array of integers [4, 3, 0].
– A slice object with ints 1:7.
– A boolean array (any NA values will be treated as False).
– A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
Let´s dive into how these functions work through an example:
iloc` examples
- Get First Row
df.iloc[0]
- Get first column
df.iloc[:, 0]
- Get first colum of the first row
df.iloc[0:1, 0]
- Get rows from 3 to 5
df.iloc[3:5]
- Get rows 3, 7, 10
df.iloc[[3, 7, 10]]
- Get last five rows
df.iloc[-5:]
loc
examples
- Get First Row of colum ‘quality’
df.loc[0, 'quality']
- Get all rows from columns ‘quality’, ‘sulphates’, ‘alcohol’
df.loc[:, ['quality', 'sulphates', 'alcohol']]
- Get from row called ‘litres’ forward from columns ‘quality’ to ‘alcohol’
df1.loc['litres':, 'quality':'alcohol']
- Get rows from 3 to 5 (Different from iloc)
df.loc[3:5]
Conditional Selection
One of the most useful analytical vector thoughts that we can use for data science is to filter data with given conditions. With that aim, loc function allow us to filter by given conditions.
- Get all wines which quality is greater than 6
wines.loc[wines.quality > 6]
- Get all wines which quality is greater than 5 and less than 8
wines.loc[(wines.quality > 5) & (wines.quality < 8)]
- Get all wines which quality is equal to 5 or equal to 7
wines.loc[(wines.quality == 5) | (wines.quality == 7)]
Training your abilities
If you want to bring your skills further in Data Science, we have created a course that you can download for free here).
Over the next chapter, we will get a deep dive into the functions we use to extract info from our data and start to transform it to prepare the input of our prediction models.