Unraveling Pandas DataFrame duplicated(): A Comprehensive Guide

Pandas is an indispensable tool for data manipulation and analysis in Python, providing a plethora of functions to handle diverse data manipulation tasks. Among its many functions, duplicated() plays a crucial role in identifying duplicate rows within a DataFrame. This comprehensive guide dives deep into the workings of the duplicated() function, helping you understand its nuances and apply it effectively in your data analysis workflows.

Introduction to Pandas DataFrame duplicated()

link to this section

The duplicated() function in Pandas is used to return a Boolean Series denoting duplicate rows in a DataFrame. The signature of the function is as follows:

DataFrame.duplicated(subset=None, keep='first') 
  • subset : A single label or list of labels to consider for identifying duplicates. By default, all columns are used.
  • keep : Determines which duplicates (if any) to mark.
    • 'first' (default): Mark duplicates as True except for the first occurrence.
    • 'last' : Mark duplicates as True except for the last occurrence.
    • False : Mark all duplicates as True .

How Does duplicated() Work?

link to this section

Identifying Duplicate Rows

You can use duplicated() to identify and flag duplicate rows based on all columns or a subset of columns.

import pandas as pd 
    
data = { 
    'A': [1, 2, 2, 3, 3], 
    'B': [5, 6, 6, 8, 8], 
    'C': [9, 10, 10, 12, 12] 
} 

df = pd.DataFrame(data) 
print(df.duplicated()) 

Specifying Columns for Identifying Duplicates

If you want to check for duplicates based on specific columns, you can pass the column names to the subset parameter.

print(df.duplicated(subset=['A', 'B'])) 

Handling First and Last Occurrences

By default, the duplicated() function keeps the first occurrence of a duplicate row and marks the subsequent occurrences as duplicates. You can alter this behavior using the keep parameter.

print(df.duplicated(keep='last')) 
print(df.duplicated(keep=False)) 

Using duplicated() in Data Cleaning

link to this section

Removing Duplicate Rows

Identifying duplicate rows is often the first step in cleaning your data. Once you’ve identified the duplicates, you can use the drop_duplicates() function to remove them.

cleaned_df = df.drop_duplicates() 

Customizing Duplicate Removal

Just like with duplicated() , you can customize how duplicates are removed with drop_duplicates() using the subset and keep parameters.

cleaned_df = df.drop_duplicates(subset=['A', 'B'], keep='last') 

Conclusion

link to this section

The duplicated() function in Pandas is a powerful tool for identifying duplicate rows in your DataFrame, playing a pivotal role in the data cleaning process. Whether you’re dealing with large datasets or small data frames, understanding how to effectively utilize this function ensures that your data analysis is based on accurate and duplicate-free data. With this guide, you are now well-equipped to tackle duplicate data in Pandas, ensuring the integrity and reliability of your data analysis endeavors.