The Power of Indexing: Boosting Data Wrangling Efficiency with Pandas

Introduction

Pandas is the most widely used Python library for data manipulation, and it allows us to access and manipulate data efficiently.

By understanding and utilizing indexing techniques effectively in Pandas, we can significantly improve the speed and efficiency of our data-wrangling tasks.

In this article, we’ll explore various indexing techniques in Pandas, and we’ll see how to leverage them for faster data wrangling.

Introducing Indexing in Pandas

The Pandas library provides two primary objects: Series and DataFrames.

A Pandas Series is a one-dimensional labeled array, capable of holding any kind of data type.

A Pandas DataFrame is a table, similar to a spreadsheet, capable of storing any kind of data and is built with rows and columns.

To be more precise, a Pandas DataFrame can also be seen as an ordered collection of Pandas Series.

So, both Series and DataFrames have an index, which provides a way to uniquely identify and access every single element.

In this article, we’ll demonstrate some indexing techniques in Pandas to enhance your daily data manipulation tasks.

Coding Indexing Techniques in Pandas

Now, let’s explore some indexing techniques using actual Python code.

Integer-Based Indexing

We’ll begin with the integer-based method that enables us to select rows and columns in a data frame.

But first, let’s understand how we can create a data frame in Pandas:

import pandas as pd data = { 'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10], 'C': [11, 12, 13, 14, 15]
} df = pd.DataFrame(data) print(df)

This will produce:

As we can see, the data for a Pandas data frame are created in the same manner we create a dictionary in Python. In fact, the names of the columns are the keys and the numbers in the lists are the values. Column names and values are separated by a colon, exactly like keys and values in dictionaries. Lastly, they’re housed within curly brackets.

The integer-based methodology uses the method iloc[] for indexing a data frame. For example, if we want to index two rows, we can type the following:


sliced_rows = df.iloc[1:3] print(sliced_rows)

And we get:

 A B C
1 2 7 12
2 3 8 13

Note: Remember that in Python we start counting from 0, iloc[1:3] selects the second and the third row.

Now, iloc[] can also select columns like so:


sliced_cols = df.iloc[:, 0:2] print(sliced_cols)

And we get:

So, in this case, the colon inside the square brackets means that we want to take all the values in the rows. Then, after the comma, we specify which columns we want to get (remembering that we start counting from 0).

Another way to slice indexes with integers is by using the loc[] method. For example, like so:


sliced_rows = df.loc[1:3] print(sliced_rows)

And we get:

Note: Taking a deep look at both loc[] and iloc[] methods, we can see that in .loc[], the start and end labels are both inclusive, while iloc[] includes the start index and excludes the end index.

Also, we want to add that the loc[] method gives us the possibility to slice a Pandas DataFrame with renamed indexes. Let’s see what we mean with an example:

import pandas as pd data = { 'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10], 'C': [11, 12, 13, 14, 15]
} df = pd.DataFrame(data, index=['Row_1', 'Row_2', 'Row_3', 'Row_4', 'Row_5']) sliced_rows = df.loc['Row_2':'Row_4'] print(sliced_rows)

And we get:

 A B C
Row_2 2 7 12
Row_3 3 8 13
Row_4 4 9 14

So, as we can see, now the indexes are no longer integers: they are strings and the loc[] method can be used to slice the data frame as we did.

Boolean Indexing

Boolean indexing involves selecting rows or columns based on a condition expressed as a boolean. The data frame (or the series) will be filtered to include only the rows or columns that satisfy the given condition.

For example, suppose we have a data frame with all numeric values. We want to filter the data frame by indexing a column so that it shows us only the values greater than two. We can do it like so:

import pandas as pd data = { 'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10], 'C': [11, 12, 13, 14, 15]
} df = pd.DataFrame(data) condition = df['A'] > 2 filtered_rows = df[condition] print(filtered_rows)

And we get:

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

So, with condition = df['A'] > 2, we’ve created a Pandas series that gets the values greater than two in column A. Then, with filtered_rows = df[condition], we’ve created the filtered dataframe that shows only the rows that match the condition we imposed on column A.

Of course, we can index a dataframe so that it matches different conditions, even for different columns. For example, say we want to add a condition on column A and on column B. We can do it like so:


condition = (df['A'] > 2) & (df['B'] < 10) filtered_rows = df[condition] print(filtered_rows)

And we get:

 A B C
2 3 8 13
3 4 9 14

So, to add multiple conditions, we use the operator &.

Also, we can even slice an entire data frame. For example, say that we just want to see the columns that have values greater than eight. We can do it like so:


condition = (df > 8).all() filtered_cols = df.loc[:, condition] print(filtered_cols)

And we get:

And so, only column C matches the imposed condition.
So, with the method all(), we’re imposing a condition on the entire data frame.

Setting New Indexes and Resetting to Old Ones

There are situations in which we may take a column of a Pandas data frame and use it as an index for the entire data frame. For example, in cases where this kind of manipulation may result in faster slicing of the indexes.

For example, consider we have a data frame that stores data related to countries, cities, and their respective populations. We may want to set the city column as the index of the data frame. We can do it like so:

import pandas as pd data = { 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'], 'Country': ['USA', 'USA', 'USA', 'USA'], 'Population': [8623000, 4000000, 2716000, 2302000]
} df = pd.DataFrame(data) df.set_index(['City'], inplace=True) print(df)

And we have:

 Country Population
City New York USA 8623000
Los Angeles USA 4000000
Chicago USA 2716000
Houston USA 2302000

Note that we used a similar method before, specifically at the end of the paragraph “Integer-Based Indexing”. That method was used to rename the indexes: we had numbers in the beginning and we renamed them as strings.

In this last case, a column has become the index of the data frame. This means that we can filter it using loc[] as we did before:


sliced_rows = df.loc['New York':'Chicago'] print(sliced_rows)

And the result is:

 Country Population
City New York USA 8623000
Los Angeles USA 4000000
Chicago USA 2716000

Note: When we index a column as we did, the column name “drops down,” meaning it’s no longer at the same level as the names of the other columns, as we can see. In these cases, the indexed column (“City”, in this case) can’t be accessed as we do with columns in Pandas anymore, until we restore it as a column.

So, if we want to restore the classical indexing method, restoring the indexed column(s) as column(s), we can type the following:


df_reset = df.reset_index() print(df_reset)

And we get:

 City Country Population
0 New York USA 8623000
1 Los Angeles USA 4000000
2 Chicago USA 2716000
3 Houston USA 2302000

So, in this case, we’ve created a new DataFrame called df_reset with the method reset_index(), which has restored the indexes, as we can see.

Sorting Indexes

Pandas also gives us the possibility to sort indexes in descending order (the ascending order is the standard one) by using the sort_index() method like so:

import pandas as pd data = { 'B': [6, 7, 8, 9, 10], 'A': [1, 2, 3, 4, 5], 'C': [11, 12, 13, 14, 15]
} df = pd.DataFrame(data) df_sorted = df.sort_index(ascending=False) print(df_sorted)

And this results in:

This methodology can even be used when we rename indexes or when we index a column. For example, say we want to rename the indexes and sort them in descending order:

import pandas as pd data = { 'B': [6, 7, 8, 9, 10], 'A': [1, 2, 3, 4, 5], 'C': [11, 12, 13, 14, 15]
} df = pd.DataFrame(data, index=["row 1", "row 2", "row 3", "row 4", "row 5"]) df_sorted = df.sort_index(ascending=False) print(df_sorted)

And we have:

 B A C
row 5 10 5 15
row 4 9 4 14
row 3 8 3 13
row 2 7 2 12
row 1 6 1 11

So, to achieve this result, we use the sort_index() and pass the ascending=False parameter to it.

Conclusions

In this article, we’ve shown different methodologies to index Pandas data frames.

Some methodologies yield results similar to others, so the choice has to be made keeping in mind the exact result we want to achieve when we’re manipulating our data.

Source: https://stackabuse.com/the-power-of-indexing-boosting-data-wrangling-efficiency-with-pandas/

The Power of Indexing: Boosting Data Wrangling Efficiency with Pandas

Introduction

Introducing Indexing in Pandas

Coding Indexing Techniques in Pandas

Integer-Based Indexing

Boolean Indexing

Setting New Indexes and Resetting to Old Ones

Sorting Indexes

Conclusions

Welcome to

Accessibility Dashboard