Mastering Data Deduplication in Pandas: Using drop_duplicates()

Data duplication is a common issue faced by data analysts and scientists. Duplicates can skew your analysis, leading to inaccurate results. Thankfully, Pandas provides a handy function, drop_duplicates() , to help you identify and remove duplicate rows from your DataFrame, ensuring the accuracy of your data analysis.

Understanding drop_duplicates()

link to this section

The drop_duplicates() function returns a DataFrame with duplicate rows removed, considering all columns.

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False) 
  • subset : column label or sequence of labels, optional. Only considers certain columns for identifying duplicates. By default, all columns are used.
  • keep : {‘first’, ‘last’, False}, default ‘first’. Determines which duplicates (if any) to keep.
    • ‘first’: Drop duplicates except for the first occurrence.
    • ‘last’: Drop duplicates except for the last occurrence.
    • False: Drop all duplicates.
  • inplace : bool, default False. Whether to drop duplicates in place or to return a copy.
  • ignore_index : bool, default False. If True, the resulting axis will be labeled 0, 1, …, n - 1.

Removing All Duplicate Rows

link to this section

To remove all duplicate rows from your DataFrame, simply call the drop_duplicates() method.

import pandas as pd 

# Sample DataFrame 
data = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'John'], 
    'Age': [28, 24, 34, 29, 28], 
    'Salary': [70000, 80000, 120000, 110000, 70000]} 
    
df = pd.DataFrame(data) 

# Removing all duplicate rows 
df_unique = df.drop_duplicates() 
print(df_unique) 

Keeping the Last Occurrence

link to this section

If you want to keep the last occurrence of the duplicate row and remove the rest, you can use the keep='last' parameter.

# Keeping the last occurrence of the duplicate rows 
df_unique = df.drop_duplicates(keep='last') 
print(df_unique) 

Removing All Occurrences of Duplicates

link to this section

If you want to remove all occurrences of duplicate rows, you can use keep=False .

# Removing all occurrences of duplicate rows 
df_unique = df.drop_duplicates(keep=False) 
print(df_unique) 

Specifying Columns to Identify Duplicates

link to this section

You might want to identify duplicates based on specific columns. You can achieve this by passing the column names to the subset parameter.

# Removing duplicates based on specific columns 
df_unique = df.drop_duplicates(subset=['Name', 'Age']) 
print(df_unique) 

In-Place Removal

link to this section

If you wish to remove the duplicates directly in the original DataFrame, you can set the inplace parameter to True.

# Removing duplicates in-place 
df.drop_duplicates(inplace=True) 
print(df) 

Resetting Index After Dropping Duplicates

link to this section

When you drop duplicates, the index of the DataFrame might get messed up. You can reset it using the reset_index() method, combined with ignore_index=True .

# Resetting index after dropping duplicates 
df_unique = df.drop_duplicates().reset_index(drop=True) 
print(df_unique) 

Conclusion

link to this section

Removing duplicate data is a crucial step in the data cleaning process. With Pandas’ drop_duplicates() function, you can easily identify and remove duplicate rows from your DataFrame, ensuring that your data analysis is based on accurate and reliable data. By mastering the use of this function, you can significantly enhance the quality of your data and the reliability of your analysis results. Happy data cleaning!