Introduction
We begin with the third post of our data science training saga with Pandas. In this article we are going to make a summary of the different functions that are used in Pandas to perform Iteration, Maps, Grouping and Sorting. These functions allow us to make transformations of the data giving us useful information and insights.
Iteration, Maps, Grouping and Sorting
The 2009 data set ‘Wine Quality Dataset’ maked by Cortez et al. available at UCI Machine Learning , is a well-known dataset that contains wine quality information. It includes data about red and white wine physicochemical properties and a quality score.
Before we start, we are going to visualize a head of our didactic dataset that we are going to follow to show the examples using pandas head function.
Iteration
We start with the functions related to iterating through a dataset. We might want to use this function when we want to iterate row by row.
The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects.
If we iterate over a DataFrame, we get the column names.
for element in df:
print(element)
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol
quality
To iterate over the rows of the DataFrame, we can use the following functions:
Item
Consistent with the dict-like interface, items() and iteritems() iterates through key-value pairs:
Series: (index, scalar value) pairs
DataFrame: (column, Series) pairs
for key, value in wines.items():
print(key)
print(value)
fixed acidity
0 7.4
1 7.8
2 7.8
3 11.2
4 7.4
...
1594 6.2
1595 5.9
1596 6.3
1597 5.9
1598 6.0
Name: fixed acidity, Length: 1599, dtype: float64
volatile acidity
0 0.700
1 0.880
2 0.760
3 0.280
4 0.700
...
Iterrows
It allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator yielding each index value along with a Series containing the data in each row:
for row_index, row in wines.iterrows():
print(row_index, row, sep="\n")
0
fixed acidity 7.4000
volatile acidity 0.7000
citric acid 0.0000
residual sugar 1.9000
chlorides 0.0760
free sulfur dioxide 11.0000
total sulfur dioxide 34.0000
density 0.9978
pH 3.5100
sulphates 0.5600
alcohol 9.4000
quality 5.0000
Name: 0, dtype: float64
1
fixed acidity 7.8000
volatile acidity 0.8800
citric acid 0.0000
residual sugar 2.6000
chlorides 0.0980
free sulfur dioxide 25.0000
total sulfur dioxide 67.0000
density 0.9968
pH 3.2000
sulphates 0.6800
alcohol 9.8000
quality 5.0000
Name: 1, dtype: float64
2
fixed acidity 7.800
volatile acidity 0.760
citric acid 0.040
residual sugar 2.300
...
Itertuples
The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.
for row in wines.itertuples():
print(row)
Pandas(Index=0, _1=7.4, _2=0.7, _3=0.0, _4=1.9, chlorides=0.076, _6=11.0, _7=34.0, density=0.9978, pH=3.51, sulphates=0.56, alcohol=9.4, quality=5)
Pandas(Index=1, _1=7.8, _2=0.88, _3=0.0, _4=2.6, chlorides=0.098, _6=25.0, _7=67.0, density=0.9968, pH=3.2, sulphates=0.68, alcohol=9.8, quality=5)
...
Conclusion
The Pandas library has provided us with 3 different functions which make iteration over the given data sets relatively easier. They are:
iteritems(): This function in the Pandas library helps the user to iterate over each and every element present in the set, column-wise. This function will be useful in case we want to look for something row by row but column by column. This way you don’t have to iterate over all the columns.
iterrows(): This function in the Pandas library helps the user to iterate over each and every element present in the set, row-wise. This function will be useful in case we want to iterate full-row by full-row so we can search a specific row-value without iterating the whole dataset.
itertuple(): This function in the Pandas library helps the user to iterate over each row present in the data set while forming a tuple out of the given data. This function will be useful when we need to iterate full-row by full-row but the output has to be tuple format.
Maps
We continue with the two most important functions to map a Series or Dataset.
Map
The Pandas map() function is used to map each value from a Series object to another value using a dictionary/function/Series.It is a convenience function to map values of a Series from one domain to another domain, as it allows us to make an operation for transforming all rows of a given column in a dataset.
For example, we can transform the series obtained from the density
column by executing a function that multiplies each of its values by 100.
data['density'].map(lambda x: x * 100)
0 99.780
1 99.680
2 99.700
3 99.800
4 99.780
...
1594 99.490
1595 99.512
1596 99.574
1597 99.547
1598 99.549
Name: density, Length: 1599, dtype: float64...
Apply
Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument:
For example, we can restore the values of the density
column by executing a function that divides each of its values by 100, without having to extract the Series from the Dataframe since the maps function works with Dataframes.
def divide_by_100(x):
x.density = x.density / 100
return x
data.apply(divide_by_100, axis='columns')
Grouping
The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object grouping by quality
you may do the following:
wines.groupby(["quality"]).quality.count()
quality
3 10
4 53
5 681
6 638
7 199
8 18
Name: quality, dtype: int64
You can also create the GroupBy object and apply a custom function, for example in this case we are going to group by quality
andalcohol
and obtain the highest density from each one:
wines.groupby(['quality', 'alcohol']).apply(lambda df: df.loc[df.density.idxmax()])
Finally, within the grouping section, one of the most useful functions in data analysis is the aggregation function.
In this case we are going to group by quality
and we are going to obtain the maximum and minimum value of alcohol
for each group.
wines.groupby(['quality']).alcohol.agg([min, max])
Sorting
In this case we are going to use a different dataset to clearly explain all the sorting functionality within Pandas. For this we are going to first observe the small example dataset that we are going to manipulate which we will call unsorted_df
:
- Sort by index
unsorted_df.sort_index()
- Sort by index descending order
unsorted_df.sort_index(ascending=False)
- Sort by columns
unsorted_df.sort_index(axis=1)
- Sort by values
unsorted_df.sort_values(by="two")
Training your abilities
If you want to bring your skills further in Data Science, we have created a course that you can download for free here.
Over the next chapter, we will get a deep dive into the functions we use to missing data treatment.