We will start today with the second deliverable of out data science training with Pandas. In this article we will summarize different pandas python functions to treat concepts such as indexing, selection and filtering.

Creating, Getting Info, Selecting and Util Functions

Before we start, we are going to visualize a little example dataset that we are going to follow to put the examples. It is a well-known dataset that contains wine information.

Indexing, Selecting and Filtering

In this chapter we will explain some key functions that can be useful to obtain a high level vision and an statistical overview of our dataset.

We will start with info() function, that offers us insigths about the number of columns, name of every column, number of non-null elements of each column and data type of everycolumn. Therefore, This function gives us a detailed overview of the data columns.

df.info()

Wines Dataset: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 # Column Non-Null Count Dtype  
-------- ------ -------------- -----  
 0 fixed acidity 1599 non-null float64
 1 volatile acidity 1599 non-null float64
 2 citric acid 1599 non-null float64
 3 residual sugar 1599 non-null float64
 4 chlorides 1599 non-null float64
 5 free sulfur dioxide 1599 non-null float64
 6 total sulfur dioxide 1599 non-null float64
 7 density 1599 non-null float64
 8 pH 1599 non-null float64
 9 sulphates 1599 non-null float64
 10 alcohol 1599 non-null float64
 11 quality 1599 non-null int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Dtypes

Dtypes attribute show us the data structure associated with each column.

df.dtypes

Wines Dataset: 

fixed acidity float64
volatile acidity float64
citric acid float64
residual sugar float64
chlorides float64
free sulfur dioxide float64
total sulfur dioxide float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object

Describe() function gives us a statistical overview distribution of each column, giving us main metrics that offer the first useful insights : the mode represents the most prevalent value , mean and std gives us an overview of the average value and the deviations from that value, median offers a hint of how the 50 % of this column is above or below a given value.

However, it is important to note that we should analyze the % of NaNs (missing data) that these columns have in order to have a general overview of what these metrics means but this will be analysed in the next articles..

df.describe()

Wines Dataset: 

fixed acidity   volatile acidity    citric acid residual sugar  chlorides   free sulfur dioxide total sulfur dioxide    density pH  sulphates   alcohol quality
count   1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean    8.319637    0.527821    0.270976    2.538806    0.087467    15.874922   46.467792   0.996747    3.311113    0.658149    10.422983   5.636023
std 1.741096    0.179060    0.194801    1.409928    0.047065    10.460157   32.895324   0.001887    0.154386    0.169507    1.065668    0.807569
min 4.600000    0.120000    0.000000    0.900000    0.012000    1.000000    6.000000    0.990070    2.740000    0.330000    8.400000    3.000000
25% 7.100000    0.390000    0.090000    1.900000    0.070000    7.000000    22.000000   0.995600    3.210000    0.550000    9.500000    5.000000
50% 7.900000    0.520000    0.260000    2.200000    0.079000    14.000000   38.000000   0.996750    3.310000    0.620000    10.200000   6.000000
75% 9.200000    0.640000    0.420000    2.600000    0.090000    21.000000   62.000000   0.997835    3.400000    0.730000    11.100000   6.000000
max 15.900000   1.580000    1.000000    15.500000   0.611000    72.000000   289.000000  1.003690    4.010000    2.000000    14.900000   8.000000

Indexing and Selection

Here we are going to take a deep dive into explaining the two main indexing and selection pandas functions : ‘iloc’ and ‘loc’

.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:

– A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).

– A list or array of labels [‘a’, ‘b’, ‘c’].

– A slice object with labels ‘a’:’f’ (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)

– A boolean array (any NA values will be treated as False).

– A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:

– An integer e.g. 5.

– A list or array of integers [4, 3, 0].

– A slice object with ints 1:7.

– A boolean array (any NA values will be treated as False).

– A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

Let´s dive into how these functions work through an example:

iloc` examples

Get First Row

df.iloc[0]

Get first column

df.iloc[:, 0]

Get first colum of the first row

df.iloc[0:1, 0]

Get rows from 3 to 5

df.iloc[3:5]

Get rows 3, 7, 10

df.iloc[[3, 7, 10]]

Get last five rows

df.iloc[-5:]

`loc` examples

Get First Row of colum ‘quality’

df.loc[0, 'quality']

Get all rows from columns ‘quality’, ‘sulphates’, ‘alcohol’

df.loc[:, ['quality', 'sulphates', 'alcohol']]

Get from row called ‘litres’ forward from columns ‘quality’ to ‘alcohol’

df1.loc['litres':, 'quality':'alcohol']

Get rows from 3 to 5 (Different from iloc)

df.loc[3:5]

Conditional Selection

One of the most useful analytical vector thoughts that we can use for data science is to filter data with given conditions. With that aim, loc function allow us to filter by given conditions.

Get all wines which quality is greater than 6

wines.loc[wines.quality > 6]

Get all wines which quality is greater than 5 and less than 8

wines.loc[(wines.quality > 5) & (wines.quality < 8)]

Get all wines which quality is equal to 5 or equal to 7

wines.loc[(wines.quality == 5) | (wines.quality == 7)]

Training your abilities

If you want to bring your skills further in Data Science, we have created a course that you can download for free here).

Over the next chapter, we will get a deep dive into the functions we use to extract info from our data and start to transform it to prepare the input of our prediction models.

Getting Started with Pandas – Lesson 2