Getting Started with Pandas – Lesson 3

Apiumhub - Dec 2 '21 - - Dev Community

Introduction

We begin with the third post of our data science training saga with Pandas. In this article we are going to make a summary of the different functions that are used in Pandas to perform Iteration, Maps, Grouping and Sorting. These functions allow us to make transformations of the data giving us useful information and insights.

Iteration, Maps, Grouping and Sorting

The 2009 data set ‘Wine Quality Dataset’ maked by Cortez et al. available at UCI Machine Learning , is a well-known dataset that contains wine quality information. It includes data about red and white wine physicochemical properties and a quality score.

Before we start, we are going to visualize a head of our didactic dataset that we are going to follow to show the examples using pandas head function.

DMnGnN8fXlizhiIx5pyQ1GiyU5nf wnrHL31tWGy07sPB0O3UOezZ7whwcQNwRhrlVR3gH0SLk0M1ex4rr3ikZpNzwec5ogzex6XeMMRa

Iteration

We start with the functions related to iterating through a dataset. We might want to use this function when we want to iterate row by row.

The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects.

If we iterate over a DataFrame, we get the column names.

for element in df:
    print(element)

fixed acidity           
volatile acidity       
citric acid              
residual sugar       
chlorides               
free sulfur dioxide     
total sulfur dioxide    
density    
pH           
sulphates
alcohol
quality
Enter fullscreen mode Exit fullscreen mode

To iterate over the rows of the DataFrame, we can use the following functions:

Item

Consistent with the dict-like interface, items() and iteritems() iterates through key-value pairs:

  • Series: (index, scalar value) pairs

  • DataFrame: (column, Series) pairs

for key, value in wines.items():
    print(key)
    print(value)


    fixed acidity
0 7.4
1 7.8
2 7.8
3 11.2
4 7.4
        ... 
1594 6.2
1595 5.9
1596 6.3
1597 5.9
1598 6.0
Name: fixed acidity, Length: 1599, dtype: float64
volatile acidity
0 0.700
1 0.880
2 0.760
3 0.280
4 0.700
        ...
Enter fullscreen mode Exit fullscreen mode

Iterrows

It allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator yielding each index value along with a Series containing the data in each row:

for row_index, row in wines.iterrows():
    print(row_index, row, sep="\n")

    0
fixed acidity 7.4000
volatile acidity 0.7000
citric acid 0.0000
residual sugar 1.9000
chlorides 0.0760
free sulfur dioxide 11.0000
total sulfur dioxide 34.0000
density 0.9978
pH 3.5100
sulphates 0.5600
alcohol 9.4000
quality 5.0000
Name: 0, dtype: float64
1
fixed acidity 7.8000
volatile acidity 0.8800
citric acid 0.0000
residual sugar 2.6000
chlorides 0.0980
free sulfur dioxide 25.0000
total sulfur dioxide 67.0000
density 0.9968
pH 3.2000
sulphates 0.6800
alcohol 9.8000
quality 5.0000
Name: 1, dtype: float64
2
fixed acidity 7.800
volatile acidity 0.760
citric acid 0.040
residual sugar 2.300
...
Enter fullscreen mode Exit fullscreen mode

Itertuples

The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

for row in wines.itertuples():
    print(row)    

Pandas(Index=0, _1=7.4, _2=0.7, _3=0.0, _4=1.9, chlorides=0.076, _6=11.0, _7=34.0, density=0.9978, pH=3.51, sulphates=0.56, alcohol=9.4, quality=5)
Pandas(Index=1, _1=7.8, _2=0.88, _3=0.0, _4=2.6, chlorides=0.098, _6=25.0, _7=67.0, density=0.9968, pH=3.2, sulphates=0.68, alcohol=9.8, quality=5)
...
Enter fullscreen mode Exit fullscreen mode

Conclusion

The Pandas library has provided us with 3 different functions which make iteration over the given data sets relatively easier. They are:

iteritems(): This function in the Pandas library helps the user to iterate over each and every element present in the set, column-wise. This function will be useful in case we want to look for something row by row but column by column. This way you don’t have to iterate over all the columns.

iterrows(): This function in the Pandas library helps the user to iterate over each and every element present in the set, row-wise. This function will be useful in case we want to iterate full-row by full-row so we can search a specific row-value without iterating the whole dataset.

itertuple(): This function in the Pandas library helps the user to iterate over each row present in the data set while forming a tuple out of the given data. This function will be useful when we need to iterate full-row by full-row but the output has to be tuple format.

Maps

We continue with the two most important functions to map a Series or Dataset.

Map

The Pandas map() function is used to map each value from a Series object to another value using a dictionary/function/Series.It is a convenience function to map values of a Series from one domain to another domain, as it allows us to make an operation for transforming all rows of a given column in a dataset.

For example, we can transform the series obtained from the density column by executing a function that multiplies each of its values ​​by 100.

data['density'].map(lambda x: x * 100)    

0 99.780
1 99.680
2 99.700
3 99.800
4 99.780
         ...  
1594 99.490
1595 99.512
1596 99.574
1597 99.547
1598 99.549
Name: density, Length: 1599, dtype: float64...
Enter fullscreen mode Exit fullscreen mode

Apply

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument:

For example, we can restore the values ​​of the density column by executing a function that divides each of its values ​​by 100, without having to extract the Series from the Dataframe since the maps function works with Dataframes.

def divide_by_100(x):
    x.density = x.density / 100
    return x

data.apply(divide_by_100, axis='columns')
Enter fullscreen mode Exit fullscreen mode

S7BP zBf ZHVKdSeaGDJuWiiVI tZ4Tvj2md nAspc1s8m6F3zW12T7C wrpbpZsXrnz3h7jpD1TFr5m3o3Ooyqx0ou gOG9NZKAxT03uawAK vHkNSsD8Gg4SqVOmzKvgpdg25f=s0

Grouping

The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object grouping by quality you may do the following:

wines.groupby(["quality"]).quality.count()

quality
3 10
4 53
5 681
6 638
7 199
8 18
Name: quality, dtype: int64
Enter fullscreen mode Exit fullscreen mode

You can also create the GroupBy object and apply a custom function, for example in this case we are going to group by quality andalcohol and obtain the highest density from each one:

wines.groupby(['quality', 'alcohol']).apply(lambda df: df.loc[df.density.idxmax()])
Enter fullscreen mode Exit fullscreen mode

0vXl9g0fFcD oH dPoMbdik9opotd7x3cHoF3UHd eK iosmSn1f3VqQvjdzMJxmb0HPLS0

Finally, within the grouping section, one of the most useful functions in data analysis is the aggregation function.

In this case we are going to group by quality and we are going to obtain the maximum and minimum value of alcohol for each group.

wines.groupby(['quality']).alcohol.agg([min, max])
Enter fullscreen mode Exit fullscreen mode

lrwbfT H6MM6v jb8cHFeilj CYua90yG6r9 jCGyjnGtd j876Ey11aERP9ftXy92gQ1YPL ZSU7f0kNSaUM74fXIOeknRV EvzyKe K0WvJLotwCE0CKZK4TWsU2T1dNKqp0Q=s0

Sorting

In this case we are going to use a different dataset to clearly explain all the sorting functionality within Pandas. For this we are going to first observe the small example dataset that we are going to manipulate which we will call unsorted_df:

nGTN6DveDXKuA8hAgKew4FvlH17gg kkexiDhP

  • Sort by index
unsorted_df.sort_index()
Enter fullscreen mode Exit fullscreen mode

llQLVBys4nN1x1kCyaY o6zt0x42KDGG8suFS qYeCMClP10kLlEiXRXKm BtXoUMptrdbsyFKiGd334O1n6RIUeCg

  • Sort by index descending order
unsorted_df.sort_index(ascending=False)
Enter fullscreen mode Exit fullscreen mode

zTnUOge3yx1NXj9mhSKTg2h1Ij oc4ECrJysD3SqURZ1SdpcslVcq1M YBbhI7JOJwhwancyq8k5yDIm6TwOWBwSGZ2BCN4duduXx ijoaCvj16I2MGfpXcB QzUIbqBSm6RCSUL=s0

  • Sort by columns
unsorted_df.sort_index(axis=1)
Enter fullscreen mode Exit fullscreen mode

rzWy8T567k3uelJM4j7lHgUo1 KUaJtFr1kUeHU BKrxqajjKNAqj0MpSmTuffBQ95Gy2SPnv744G

  • Sort by values
unsorted_df.sort_values(by="two")
Enter fullscreen mode Exit fullscreen mode

11s11ew3gPhPXYcXKqbDYwmbvcrkoAWoTCqmK0YmXxjBhO3K6E8CpWSvcJGea9iGIS4FzNZsBvp4kUWzlWbPA3bZhrdPDP3QEKLPfU3YEACLHI2avNkipbj szh29ffopg4XeFeH=s0

Training your abilities

If you want to bring your skills further in Data Science, we have created a course that you can download for free here.

Over the next chapter, we will get a deep dive into the functions we use to missing data treatment.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player