Method Chaining With Pandas

This article is going to be slightly shorter than what I usually tend to post, but I hope you will enjoy it nonetheless!

I like a good podcast

I was recently looking for some podcasts I could listen to while running, and stumbled upon a series of interesting interviews with Matt Harrisson. The name sounded familiar, and I realised that I had purchased a couple of his books and thoroughly enjoyed them.

I particularly recommend this one:

alt text

If you haven’t heard of Matt Harrison before, he has a website called MetaSnake through which he teaches Python and a bunch of other stuff.

Though I don’t know him personally, Matt seems to be a pretty nice chap, and he is a very active member of the Pandas community.

Anyway, Matt released a new book in 2021 entitled Effective Pandas, that I unfortunately haven’t had a chance to read yet. But you can listen to him discussing some of its chapters in the following videos:

One of the key takeaways from these conversations is the importance of writing “good Pandas code”, and more specifically to use chaining when calling a series of methods on an object. Though the concept of method chaining isn’t fundamentally new, I must say that I had never seen it being used in the context of Pandas.

So what is method chaining?

At a very high level, we’re going to make good use of the fact that Python ignores spacing and new lines for any code that is written between parentheses.

To illustrate the above statement, we’re first loading a very simple HTML table from Wikipedia onto a Pandas dataframe. Here’s what the original dataset looks like:

alt text

The fun part

We can now open our favorite IDE and import pandas.

import pandas as pd

def getDataframe(url_table,ind):
    df = pd.read_html(url_table)[ind]
    return df

df = getDataframe("https://en.wikipedia.org/wiki/Historical_population_of_Ireland",1)
df.sample(5)

We should get something like:

alt text

So, what if we want to do a bit of cleaning first. Say, we want to drop the df["Rank"] column, keep only rows that contain the Leinster value within the df["Province"] serie, and then sort the dataframe by df["Density (/ km²)"] descending. I guess what most of us would do is:

df = df.drop(columns=["Rank"])
df = df.query("Province == 'Leinster'")
df.sort_values("Density (/ km²)", ascending=False)

Now, this is more or less what the code would look like using the method chaining approach:

(
df
.drop(columns=["Rank"])
.query("Province == 'Leinster'")
.sort_values("Density (/ km²)", ascending=False)
)

By the way, the way I aligned the lines of code to the left is absolutely arbitrary. In fact, we could rewrite this code snippet as follows and it would still work fine:

(
             df
.drop(columns=["Rank"])                              .query("Province == 'Leinster'")
          .sort_values("Density (/ km²)", ascending =
False)
)

In other words, and as mentioned above, Python simply ignores any space or new lines when they are placed between parentheses. The pros of following Matt’s advice aren’t limited to purely syntax-related tastes though. Though readability is hugely improved, using method chaining limits the creation of multiple new dataframes, helps keep the original df unaltered, etc..

We can also use method chaining inside a function, and chain any Pandas supported method to the objects we create, including .plot().

def getPlot(x,y,title):
    result = (
        df
        .drop(columns=["Rank"])
        .rename(columns=lambda x:x.lower())
        .query("province == 'Leinster'")
        .sort_values("density (/ km²)", ascending=True)
        .plot(
            figsize=(12,5)
            , kind="barh"
            , x=x
            , y=y 
            , cmap="Blues_r"
            , grid=True
            , title=title
        )
    )
    return result

getPlot("county","density (/ km²)","2016 Census overview for Leinster")

alt text

That’s it for today! If you want to learn more, I highly recommend you to follow Matt on Twitter.