LeetCode 609: Find Duplicate File in System Solution in Python – A Step-by-Step Guide
Imagine you’re a digital detective, rummaging through a messy file system where folders hold files with names and contents—like "root/a 1.txt(abcd)"—and your mission is to spot the duplicates, files with the same content hiding under different names or paths. That’s the challenge of LeetCode 609: Find Duplicate File in System, a medium-level problem that’s all about grouping files by their content. Using Python, we’ll explore two solutions: the Best Solution, a hash map approach that groups files by content for speed, and an Alternative Solution, a sorting-based method that’s more visual but slower. With detailed examples, beginner-friendly breakdowns—especially for the hash map—and clear code, this guide will help you uncover those duplicates, whether you’re new to coding or leveling up. Let’s dive into the file system and start sleuthing!
What Is LeetCode 609: Find Duplicate File in System?
In LeetCode 609: Find Duplicate File in System, you’re given a list of strings representing file paths, each containing a directory followed by space-separated files in the format filename(content). Your task is to find all groups of duplicate files—those with identical content—and return their full paths. For example, given ["root/a 1.txt(abcd) 2.txt(efgh)", "root/b 3.txt(abcd)"], files 1.txt and 3.txt share "abcd," so you return their paths. This problem tests your ability to parse strings, group data, and handle file system logic efficiently.
Problem Statement
- Input: A list of strings, e.g., ["root/a 1.txt(abcd) 2.txt(efgh)"].
- Output: A list of lists, each containing full paths of files with identical content (only groups with 2+ files).
- Rules:
- Path format: directory file1(content1) file2(content2) ....
- Full path: directory/filename.
- Return only groups with duplicates.
Constraints
- 1 <= paths.length <= 2 * 10⁴.
- 1 <= paths[i].length <= 3000.
- Total characters across all paths: ≤ 5 * 10⁵.
- File names and contents: lowercase letters, digits, spaces, dots.
- At least one group of duplicates exists.
Examples
- Input: ["root/a 1.txt(abcd) 2.txt(efgh)", "root/b 3.txt(abcd)"]
- Output: [["root/a/1.txt", "root/b/3.txt"]]
- Input: ["root/a 1.txt(abcd) 2.txt(efgh) 3.txt(efgh)"]
- Output: [["root/a/2.txt", "root/a/3.txt"]]
- Input: ["root/a 1.txt(abcd)", "root/b 2.txt(abcd)", "root/c 3.txt(abcd)"]
- Output: [["root/a/1.txt", "root/b/2.txt", "root/c/3.txt"]]
These examples show the grouping—let’s solve it!
Understanding the Problem: Finding Duplicate Files
To solve LeetCode 609: Find Duplicate File in System in Python, we need to parse each path string, extract file names and contents, group files by content, and return paths of duplicates. A naive approach might compare every file pair, but that’s O(n²)—way too slow. Instead, we’ll use:
- Best Solution (Hash Map with Content Grouping): O(n) time, O(n) space—fast and scalable.
- Alternative Solution (Sorting-Based Approach): O(n log n) time, O(n) space—visual but less efficient.
Let’s start with the hash map solution, breaking it down for beginners.
Best Solution: Hash Map with Content Grouping
Why This Is the Best Solution
The hash map approach is the top choice for LeetCode 609 because it’s efficient—O(n) time to process all files—and uses a dictionary to group files by content in one pass, making it both fast and memory-friendly. It’s perfect for quick lookups and scales well with large inputs. It’s like sorting files into content-labeled folders as you go!
How It Works
Think of this as organizing a file cabinet:
- Step 1: Initialize Hash Map:
- Key: File content.
- Value: List of full paths.
- Step 2: Parse Paths:
- Split each path into directory and files.
- For each file, extract name and content (e.g., "1.txt(abcd)" → "1.txt", "abcd").
- Build full path: directory/filename.
- Step 3: Group by Content:
- Add each full path to the map under its content key.
- Step 4: Filter Duplicates:
- Return lists with 2+ paths.
It’s like a content-based filing system—quick and neat!
Step-by-Step Example
Example: ["root/a 1.txt(abcd) 2.txt(efgh)", "root/b 3.txt(abcd)"]
- Init: content_map = {}.
- Path 1: "root/a 1.txt(abcd) 2.txt(efgh)":
- Dir: "root/a".
- File 1: "1.txt(abcd)":
- Path: "root/a/1.txt".
- Content: "abcd".
- content_map["abcd"] = ["root/a/1.txt"].
- File 2: "2.txt(efgh)":
- Path: "root/a/2.txt".
- Content: "efgh".
- content_map["efgh"] = ["root/a/2.txt"].
- Path 2: "root/b 3.txt(abcd)":
- Dir: "root/b".
- File: "3.txt(abcd)":
- Path: "root/b/3.txt".
- Content: "abcd".
- content_map["abcd"] = ["root/a/1.txt", "root/b/3.txt"].
- Filter:
- "abcd": 2 paths, include.
- "efgh": 1 path, skip.
- Output: [["root/a/1.txt", "root/b/3.txt"]].
Code with Detailed Line-by-Line Explanation
Here’s the Python code, explained clearly:
from collections import defaultdict
class Solution:
def findDuplicate(self, paths: List[str]) -> List[List[str]]:
# Step 1: Initialize hash map
content_map = defaultdict(list)
# Step 2: Process each path
for path in paths:
# Split into directory and files
parts = path.split()
directory = parts[0]
# Process each file
for file_info in parts[1:]:
# Extract name and content
name_end = file_info.index("(")
filename = file_info[:name_end]
content = file_info[name_end + 1:-1] # Remove parentheses
# Build full path and add to map
full_path = f"{directory}/{filename}"
content_map[content].append(full_path)
# Step 3: Filter groups with duplicates
result = [group for content, group in content_map.items() if len(group) > 1]
# Step 4: Return result
return result
- Line 1: Import defaultdict for convenience.
- Line 6: content_map: Content to list of paths.
- Lines 9-20: Parse:
- Split path into dir and files.
- For each file, split at ( for name, content.
- Build full path, add to map.
- Line 23: Filter groups with 2+ paths.
- Time Complexity: O(n)—n is total characters across paths.
- Space Complexity: O(n)—store all paths.
This is like a file sorter—fast and tidy!
Alternative Solution: Sorting-Based Approach
Why an Alternative Approach?
The sorting-based approach collects all files with their contents, sorts by content, and groups duplicates—O(n log n) time, O(n) space. It’s more visual, letting you see duplicates line up, but it’s slower due to sorting. It’s like alphabetizing files to spot matches!
How It Works
Picture this as a file lineup:
- Step 1: Parse into list of [content, path].
- Step 2: Sort by content.
- Step 3: Group consecutive duplicates.
- Step 4: Filter groups with 2+ paths.
It’s like sorting a file stack to find twins!
Step-by-Step Example
Example: ["root/a 1.txt(abcd) 2.txt(efgh)", "root/b 3.txt(abcd)"]
- Parse:
- ["abcd", "root/a/1.txt"].
- ["efgh", "root/a/2.txt"].
- ["abcd", "root/b/3.txt"].
- Sort:
- ["abcd", "root/a/1.txt"].
- ["abcd", "root/b/3.txt"].
- ["efgh", "root/a/2.txt"].
- Group:
- "abcd": ["root/a/1.txt", "root/b/3.txt"].
- "efgh": ["root/a/2.txt"].
- Filter: [["root/a/1.txt", "root/b/3.txt"]].
Code for Sorting Approach
class Solution:
def findDuplicate(self, paths: List[str]) -> List[List[str]]:
# Step 1: Parse into list
file_list = []
for path in paths:
parts = path.split()
directory = parts[0]
for file_info in parts[1:]:
name_end = file_info.index("(")
filename = file_info[:name_end]
content = file_info[name_end + 1:-1]
full_path = f"{directory}/{filename}"
file_list.append([content, full_path])
# Step 2: Sort by content
file_list.sort()
# Step 3: Group duplicates
result = []
i = 0
while i < len(file_list):
current_content = file_list[i][0]
group = []
while i < len(file_list) and file_list[i][0] == current_content:
group.append(file_list[i][1])
i += 1
if len(group) > 1:
result.append(group)
return result
- Lines 5-13: Parse into [content, path].
- Line 16: Sort by content.
- Lines 19-27: Group consecutive duplicates.
- Time Complexity: O(n log n)—sorting dominates.
- Space Complexity: O(n)—store all files.
It’s a sorted file scan—clear but slower!
Comparing the Two Solutions
- Hash Map (Best):
- Pros: O(n) time, O(n) space, fast and scalable.
- Cons: Less visual.
- Sorting (Alternative):
- Pros: O(n log n) time, O(n) space, intuitive grouping.
- Cons: Slower due to sort.
Hash map wins for efficiency.
Additional Examples and Edge Cases
- Input: ["root/a 1.txt(x)"]
- Output: [] (no duplicates).
- Input: ["root/a 1.txt(x) 2.txt(x)"]
- Output: [["root/a/1.txt", "root/a/2.txt"]].
Both handle these well.
Complexity Breakdown
- Hash Map: Time O(n), Space O(n).
- Sorting: Time O(n log n), Space O(n).
Hash map is optimal.
Key Takeaways
- Hash Map: Quick grouping—smart!
- Sorting: Visual duplicates—clear!
- Strings: Parsing is fun.
- Python Tip: Dicts rock—see Python Basics.
Final Thoughts: Uncover Those Duplicates
LeetCode 609: Find Duplicate File in System in Python is a cool file-sleuthing challenge. The hash map offers speed and elegance, while sorting gives a visual twist. Want more? Try LeetCode 49: Group Anagrams or LeetCode 205: Isomorphic Strings. Ready to detect? Head to Solve LeetCode 609 on LeetCode and find those duplicates today!