A Comprehensive Guide to Sorting DataFrames in Pandas using sort_values()

Pandas is an indispensable tool in the Python data science stack, primarily due to its powerful and flexible DataFrame objects. One of the critical operations you might need to perform while working with DataFrames is sorting the data based on certain conditions or columns. In this tutorial, we will deep dive into the sort_values() function in Pandas, which helps in sorting a DataFrame based on the values in one or more columns.

Understanding sort_values()

link to this section

The sort_values() function is used for sorting a DataFrame by one or more columns. The basic syntax of the function is:

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None) 
  • by : Single or list of labels to sort by.
  • axis : {0 or ‘index’, 1 or ‘columns’}, default 0. The axis along which to sort.
  • ascending : Boolean or list of booleans, default True. Sort ascending vs. descending.
  • inplace : Boolean, default False. If True, perform operation in-place.
  • kind : {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, default ‘quicksort’. Choice of sorting algorithm.
  • na_position : {‘first’, ‘last’}, default ‘last’. If ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.
  • ignore_index : Boolean, default False. If True, the resulting axis will be labeled 0, 1, …, n - 1.
  • key : Callable, optional. If not None, apply the key function to the values before sorting.

Sorting by a Single Column

link to this section

To sort a DataFrame based on a single column, you can pass the column name to the by parameter:

import pandas as pd 
    
# Sample DataFrame 

df = pd.DataFrame({ 
    'Name': ['John', 'Anna', 'Peter', 'Linda'], 
    'Age': [28, 24, 34, 29], 
    'Salary': [70000, 80000, 120000, 110000] 
}) 

# Sorting by Age 
sorted_df = df.sort_values(by='Age') print(sorted_df) 

Sorting by Multiple Columns

link to this section

You can sort by multiple columns by passing a list of column names to the by parameter. The first column in the list is the primary sorting criterion, the second column is the secondary sorting criterion, and so on.

# Sorting by Age and then by Salary 
sorted_df = df.sort_values(by=['Age', 'Salary']) 
print(sorted_df) 

Descending Sort

link to this section

To sort in descending order, you can set the ascending parameter to False . If you are sorting by multiple columns, you can pass a list of boolean values to ascending to specify the sort order for each column.

# Sorting by Age in descending order 
sorted_df = df.sort_values(by='Age', ascending=False) 
print(sorted_df) 

Handling Missing Data

link to this section

The na_position parameter allows you to control the position of NaN values in the sorted DataFrame. By default, NaNs are placed at the end.

# Sorting with NaN values 
df_with_nan = pd.DataFrame({ 
    'Name': ['John', 'Anna', None, 'Linda'], 
    'Age': [28, None, 34, 29], 
}) 

sorted_df = df_with_nan.sort_values(by='Age', na_position='first') 
print(sorted_df) 

Sorting Algorithms

link to this section

Pandas supports different sorting algorithms specified by the kind parameter. The default is quicksort . Other options include mergesort , heapsort , and stable . Choosing the right algorithm depends on the specific requirements and dataset characteristics.

In-Place Sorting

link to this section

By default, sort_values() returns a new DataFrame and does not modify the original DataFrame. If you wish to perform the sorting in-place, you can set the inplace parameter to True .

# In-place sorting 
df.sort_values(by='Age', inplace=True) 

Conclusion

link to this section

Sorting is a fundamental operation in data analysis and manipulation. Understanding how to efficiently use the sort_values() function in Pandas is crucial for sorting datasets based on various criteria. Whether you are working with large datasets or small ones, mastering sorting operations will empower you to handle your data with ease and precision. Remember to experiment with different parameters and options to find the most suitable sorting method for your specific use case. Happy coding!