Pandas DataFrame Rank: A Comprehensive Guide

When working with data, understanding the relative position of a value within a distribution can be as crucial as knowing the value itself. The Pandas library in Python provides a method called rank() , which is used to rank data, sorting them in order. In this blog post, we will delve deep into the Pandas rank() function, exploring its parameters, and providing examples to illustrate its uses.

What is Ranking in Pandas?

link to this section

Ranking in Pandas refers to the assignment of ranks to the elements of an array. The rank of an element is its index label in the sorted list of all data points. In simpler terms, it is the position of a data point in a sorted order. The lowest value gets the rank 1, the second lowest gets rank 2, and so on. In case of identical values, the average of their positions in the sorted array is considered.

How to Use the rank() Function

link to this section

The rank() function can be used on a Pandas DataFrame to rank the values in each column or row. The syntax of the function is as follows:

DataFrame.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False) 
  • axis : {0 or 'index', 1 or 'columns'}, default 0
    • The axis along which to compute the ranks.
  • method : {'average', 'min', 'max', 'first', 'dense'}, default 'average'
    • The method to use for ranking.
  • numeric_only : bool, optional
    • Whether to include only float, int, boolean data.
  • na_option : {'keep', 'top', 'bottom'}, default 'keep'
    • How to rank NaN values.
  • ascending : bool, default True
    • Whether or not the elements should be ranked in ascending order.
  • pct : bool, default False
    • Whether or not to display the returned rankings in percentile form.

Examples of Using rank()

link to this section

Basic Usage

Let’s start with a simple example:

import pandas as pd 
    
data = {'Scores': [90, 85, 92, 88, 95]} 
df = pd.DataFrame(data) 

df['Rank'] = df['Scores'].rank() 
print(df) 

Output:

Scores Rank 
0 90 3.0 
1 85 1.0 
2 92 5.0 
3 88 2.0 
4 95 4.0 

Handling Ties

When there are tied ranks, Pandas by default assigns the average of the ranks that would have been assigned if there were no ties.

data = {'Scores': [90, 85, 92, 88, 88]} 
df = pd.DataFrame(data) 
df['Rank'] = df['Scores'].rank() 
print(df) 

Output:

Scores Rank 
0 90 4.0 
1 85 1.0 
2 92 5.0 
3 88 2.5 
4 88 2.5 

Notice how the ranks for the scores 88 are both 2.5, which is the average of 2 and 3.

Using Different Ranking Methods

You can choose different methods for ranking. For example, using the 'min' method assigns the minimum rank to all the tied ranks.

df['Rank'] = df['Scores'].rank(method='min') 
print(df) 

Output:

Scores Rank 
0 90 4.0 
1 85 1.0 
2 92 5.0 
3 88 2.0 
4 88 2.0 

Ranking in Descending Order

To rank in descending order, set ascending=False .

df['Rank'] = df['Scores'].rank(ascending=False) 
print(df) 

Output:

Scores Rank 
0 90 2.0 
1 85 5.0 
2 92 1.0 
3 88 3.5 
4 88 3.5 

Handling Missing Data

link to this section

The rank() function also allows you to decide how to treat missing data through the na_option parameter.

  • na_option='keep' : Keep NaN values where they are.
  • na_option='top' : Assign the lowest rank to NaN values.
  • na_option='bottom' : Assign the highest rank to NaN values.

Conclusion

link to this section

Pandas' rank() function is a powerful tool for data analysis, helping you to understand the relative positioning of data points within a distribution. Whether you’re dealing with ties, missing data, or you need to rank data in a descending order, rank() provides the flexibility and functionality needed to handle a wide variety of ranking scenarios.