LeetCode 609: Find Duplicate File in System Solution in Python – A Step-by-Step Guide

Imagine you’re a digital detective, rummaging through a messy file system where folders hold files with names and contents—like "root/a 1.txt(abcd)"—and your mission is to spot the duplicates, files with the same content hiding under different names or paths. That’s the challenge of LeetCode 609: Find Duplicate File in System, a medium-level problem that’s all about grouping files by their content. Using Python, we’ll explore two solutions: the Best Solution, a hash map approach that groups files by content for speed, and an Alternative Solution, a sorting-based method that’s more visual but slower. With detailed examples, beginner-friendly breakdowns—especially for the hash map—and clear code, this guide will help you uncover those duplicates, whether you’re new to coding or leveling up. Let’s dive into the file system and start sleuthing!

What Is LeetCode 609: Find Duplicate File in System?

In LeetCode 609: Find Duplicate File in System, you’re given a list of strings representing file paths, each containing a directory followed by space-separated files in the format filename(content). Your task is to find all groups of duplicate files—those with identical content—and return their full paths. For example, given ["root/a 1.txt(abcd) 2.txt(efgh)", "root/b 3.txt(abcd)"], files 1.txt and 3.txt share "abcd," so you return their paths. This problem tests your ability to parse strings, group data, and handle file system logic efficiently.

Problem Statement

Input: A list of strings, e.g., ["root/a 1.txt(abcd) 2.txt(efgh)"].
Output: A list of lists, each containing full paths of files with identical content (only groups with 2+ files).
Rules:

Path format: directory file1(content1) file2(content2) ....
Full path: directory/filename.
Return only groups with duplicates.

Constraints

1 <= paths.length <= 2 * 10⁴.
1 <= paths[i].length <= 3000.
Total characters across all paths: ≤ 5 * 10⁵.
File names and contents: lowercase letters, digits, spaces, dots.
At least one group of duplicates exists.

Examples

Input: ["root/a 1.txt(abcd) 2.txt(efgh)", "root/b 3.txt(abcd)"]

Output: [["root/a/1.txt", "root/b/3.txt"]]

Input: ["root/a 1.txt(abcd) 2.txt(efgh) 3.txt(efgh)"]

Output: [["root/a/2.txt", "root/a/3.txt"]]

Input: ["root/a 1.txt(abcd)", "root/b 2.txt(abcd)", "root/c 3.txt(abcd)"]

Output: [["root/a/1.txt", "root/b/2.txt", "root/c/3.txt"]]

These examples show the grouping—let’s solve it!

Understanding the Problem: Finding Duplicate Files

To solve LeetCode 609: Find Duplicate File in System in Python, we need to parse each path string, extract file names and contents, group files by content, and return paths of duplicates. A naive approach might compare every file pair, but that’s O(n²)—way too slow. Instead, we’ll use:

Best Solution (Hash Map with Content Grouping): O(n) time, O(n) space—fast and scalable.
Alternative Solution (Sorting-Based Approach): O(n log n) time, O(n) space—visual but less efficient.

Let’s start with the hash map solution, breaking it down for beginners.

Best Solution: Hash Map with Content Grouping

Why This Is the Best Solution

The hash map approach is the top choice for LeetCode 609 because it’s efficient—O(n) time to process all files—and uses a dictionary to group files by content in one pass, making it both fast and memory-friendly. It’s perfect for quick lookups and scales well with large inputs. It’s like sorting files into content-labeled folders as you go!

How It Works

Think of this as organizing a file cabinet:

Step 1: Initialize Hash Map:

Key: File content.
Value: List of full paths.

Step 2: Parse Paths:

Split each path into directory and files.
For each file, extract name and content (e.g., "1.txt(abcd)" → "1.txt", "abcd").
Build full path: directory/filename.

Step 3: Group by Content:

Add each full path to the map under its content key.

Step 4: Filter Duplicates:

Return lists with 2+ paths.

It’s like a content-based filing system—quick and neat!

Step-by-Step Example

Example: ["root/a 1.txt(abcd) 2.txt(efgh)", "root/b 3.txt(abcd)"]

Init: content_map = {}.
Path 1: "root/a 1.txt(abcd) 2.txt(efgh)":

Dir: "root/a".
File 1: "1.txt(abcd)":

Path: "root/a/1.txt".
Content: "abcd".
content_map["abcd"] = ["root/a/1.txt"].

File 2: "2.txt(efgh)":

Path: "root/a/2.txt".
Content: "efgh".
content_map["efgh"] = ["root/a/2.txt"].

Path 2: "root/b 3.txt(abcd)":

Dir: "root/b".
File: "3.txt(abcd)":

Path: "root/b/3.txt".
Content: "abcd".
content_map["abcd"] = ["root/a/1.txt", "root/b/3.txt"].

Filter:

"abcd": 2 paths, include.
"efgh": 1 path, skip.

Output: [["root/a/1.txt", "root/b/3.txt"]].

Code with Detailed Line-by-Line Explanation

Here’s the Python code, explained clearly:

from collections import defaultdict

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        # Step 1: Initialize hash map
        content_map = defaultdict(list)

        # Step 2: Process each path
        for path in paths:
            # Split into directory and files
            parts = path.split()
            directory = parts[0]

            # Process each file
            for file_info in parts[1:]:
                # Extract name and content
                name_end = file_info.index("(")
                filename = file_info[:name_end]
                content = file_info[name_end + 1:-1]  # Remove parentheses

                # Build full path and add to map
                full_path = f"{directory}/{filename}"
                content_map[content].append(full_path)

        # Step 3: Filter groups with duplicates
        result = [group for content, group in content_map.items() if len(group) > 1]

        # Step 4: Return result
        return result

Line 1: Import defaultdict for convenience.
Line 6: content_map: Content to list of paths.
Lines 9-20: Parse:

Split path into dir and files.
For each file, split at ( for name, content.
Build full path, add to map.

Line 23: Filter groups with 2+ paths.
Time Complexity: O(n)—n is total characters across paths.
Space Complexity: O(n)—store all paths.

This is like a file sorter—fast and tidy!

Alternative Solution: Sorting-Based Approach

Why an Alternative Approach?

The sorting-based approach collects all files with their contents, sorts by content, and groups duplicates—O(n log n) time, O(n) space. It’s more visual, letting you see duplicates line up, but it’s slower due to sorting. It’s like alphabetizing files to spot matches!

How It Works

Picture this as a file lineup:

Step 1: Parse into list of [content, path].
Step 2: Sort by content.
Step 3: Group consecutive duplicates.
Step 4: Filter groups with 2+ paths.

It’s like sorting a file stack to find twins!

Step-by-Step Example

Example: ["root/a 1.txt(abcd) 2.txt(efgh)", "root/b 3.txt(abcd)"]

Parse:

["abcd", "root/a/1.txt"].
["efgh", "root/a/2.txt"].
["abcd", "root/b/3.txt"].

Sort:

["abcd", "root/a/1.txt"].
["abcd", "root/b/3.txt"].
["efgh", "root/a/2.txt"].

Group:

"abcd": ["root/a/1.txt", "root/b/3.txt"].
"efgh": ["root/a/2.txt"].

Filter: [["root/a/1.txt", "root/b/3.txt"]].

Code for Sorting Approach

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        # Step 1: Parse into list
        file_list = []
        for path in paths:
            parts = path.split()
            directory = parts[0]
            for file_info in parts[1:]:
                name_end = file_info.index("(")
                filename = file_info[:name_end]
                content = file_info[name_end + 1:-1]
                full_path = f"{directory}/{filename}"
                file_list.append([content, full_path])

        # Step 2: Sort by content
        file_list.sort()

        # Step 3: Group duplicates
        result = []
        i = 0
        while i < len(file_list):
            current_content = file_list[i][0]
            group = []
            while i < len(file_list) and file_list[i][0] == current_content:
                group.append(file_list[i][1])
                i += 1
            if len(group) > 1:
                result.append(group)

        return result

Lines 5-13: Parse into [content, path].
Line 16: Sort by content.
Lines 19-27: Group consecutive duplicates.
Time Complexity: O(n log n)—sorting dominates.
Space Complexity: O(n)—store all files.

It’s a sorted file scan—clear but slower!

Comparing the Two Solutions

Hash Map (Best):

Pros: O(n) time, O(n) space, fast and scalable.
Cons: Less visual.

Sorting (Alternative):

Pros: O(n log n) time, O(n) space, intuitive grouping.
Cons: Slower due to sort.

Hash map wins for efficiency.

Additional Examples and Edge Cases

Input: ["root/a 1.txt(x)"]

Output: [] (no duplicates).

Input: ["root/a 1.txt(x) 2.txt(x)"]

Output: [["root/a/1.txt", "root/a/2.txt"]].

Both handle these well.

Complexity Breakdown

Hash Map: Time O(n), Space O(n).
Sorting: Time O(n log n), Space O(n).

Hash map is optimal.

Key Takeaways

Hash Map: Quick grouping—smart!
Sorting: Visual duplicates—clear!
Strings: Parsing is fun.
Python Tip: Dicts rock—see Python Basics.

Final Thoughts: Uncover Those Duplicates

LeetCode 609: Find Duplicate File in System in Python is a cool file-sleuthing challenge. The hash map offers speed and elegance, while sorting gives a visual twist. Want more? Try LeetCode 49: Group Anagrams or LeetCode 205: Isomorphic Strings. Ready to detect? Head to Solve LeetCode 609 on LeetCode and find those duplicates today!