LeetCode 187: Repeated DNA Sequences Solution in Python Explained

Finding repeated DNA sequences in a string might feel like uncovering hidden patterns in a genetic code, and LeetCode 187: Repeated DNA Sequences is a medium-level challenge that makes it intriguing! Given a string s representing a DNA sequence (using characters A, C, G, T), you need to return all 10-letter-long subsequences that appear more than once, in a list of strings, ordered arbitrarily. In this blog, we’ll solve it with Python, exploring two solutions—Hash Map with Sliding Window (our best solution) and Brute Force Substring Comparison (a practical alternative). With step-by-step examples, detailed code breakdowns, and tips, you’ll master this problem. Let’s decode those repeats!

Problem Statement

Section link icon

In LeetCode 187, you’re given a string s representing a DNA sequence, composed of characters A, C, G, and T. Your task is to find all 10-letter-long subsequences (contiguous) that occur more than once in s, returning them as a list of strings, ordered arbitrarily. This differs from in-place string manipulation like LeetCode 186: Reverse Words in a String II, focusing on substring frequency detection.

Constraints

  • 1 <= s.length <= 10^5
  • s[i] is one of A, C, G, or T.
  • Subsequences are 10 letters long.

Example

Let’s see some cases:

Input: s = "AAAAACCCCCAAAAACCCCCC"
Output: ["AAAAACCCCC"]
Explanation: "AAAAACCCCC" appears at indices 0 and 10.
Input: s = "AAAAAAAAAA"
Output: []
Explanation: No 10-letter sequence possible (length < 10).
Input: s = "AAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Explanation: "AAAAAAAAAA" appears at indices 0 and 1.

These examples show we’re finding 10-letter repeats.

Understanding the Problem

Section link icon

How do you solve LeetCode 187: Repeated DNA Sequences in Python? Take s = "AAAAACCCCCAAAAACCCCCC". We need to identify all 10-letter substrings (e.g., "AAAAACCCCC") that appear more than once. For a 22-character string, there are 13 possible 10-letter substrings (0-9, 1-10, ..., 12-21), and we must find those with duplicates efficiently. Brute-forcing all pairs is O(n²), so we need a faster approach, likely using a hash map to track counts. This isn’t an SQL task like LeetCode 185: Department Top Three Salaries; it’s about string pattern detection. We’ll use: 1. Hash Map with Sliding Window: O(n) time, O(n) space—our best solution. 2. Brute Force Substring Comparison: O(n²) time, O(n) space—an alternative.

Let’s dive into the best solution.

Best Solution: Hash Map with Sliding Window Approach

Section link icon

Explanation

Hash Map with Sliding Window finds repeated 10-letter sequences by:

  • Using a sliding window of size 10 to extract substrings.
  • Storing each substring in a hash map with its count.
  • Collecting substrings with count > 1 in a result list.

This achieves O(n) time (single pass with fixed-size window), O(n) space (hash map stores up to n-9 substrings), and efficiency by avoiding pairwise comparisons.

For "AAAAACCCCCAAAAACCCCCC":

  • Window slides: "AAAAACCCCC" (count=1), "AAAACCCCCA" (1), ..., "AAAAACCCCC" (2).
  • Result: ["AAAAACCCCC"] (count=2).

Step-by-Step Example

Example 1: s = "AAAAACCCCCAAAAACCCCCC"

Goal: Return ["AAAAACCCCC"].

  • Step 1: Initialize hash map and result.
    • seen = {{, result = [].
  • Step 2: Slide window (size 10).
    • 0-9: "AAAAACCCCC" → seen["AAAAACCCCC"] = 1.
    • 1-10: "AAAACCCCCA" → seen["AAAACCCCCA"] = 1.
    • 2-11: "AAACCCCCAA" → seen["AAACCCCCAA"] = 1.
    • 3-12: "AACCCCCAAA" → seen["AACCCCCAAA"] = 1.
    • 4-13: "ACCCCCAAAA" → seen["ACCCCCAAAA"] = 1.
    • 5-14: "CCCCCAAAAA" → seen["CCCCCAAAAA"] = 1.
    • 6-15: "CCCCAAAAAC" → seen["CCCCAAAAAC"] = 1.
    • 7-16: "CCCAAAAACC" → seen["CCCAAAAACC"] = 1.
    • 8-17: "CCAAAAACCC" → seen["CCAAAAACCC"] = 1.
    • 9-18: "CAAAAACCCC" → seen["CAAAAACCCC"] = 1.
    • 10-19: "AAAAACCCCC" → seen["AAAAACCCCC"] = 2, add to result.
    • 11-20: "AAAACCCCCC" → seen["AAAACCCCCC"] = 1.
    • 12-21: "AAACCCCCCC" → seen["AAACCCCCCC"] = 1.
  • Step 3: Collect repeats.
    • result = ["AAAAACCCCC"] (count > 1).
  • Finish: Return ["AAAAACCCCC"].

Example 2: s = "AAAAAAAAAA"

Goal: Return [].

  • Step 1: Length = 10 < 11, no 10-letter repeats possible.
  • Step 2: No sliding (length < 10+1).
  • Finish: Return [].

How the Code Works (Hash Map with Sliding Window) – Detailed Line-by-Line Explanation

Here’s the Python code with a thorough breakdown:

def findRepeatedDnaSequences(s: str) -> list[str]:
    # Line 1: Check length
    if len(s) < 11:
        # Too short for 10-letter repeats (e.g., "AAAAAAAAAA")
        return []

    # Line 2: Initialize hash map and result
    seen = {{
    # Map for substring counts (e.g., {{)
    result = []
    # List for repeats (e.g., [])

    # Line 3: Slide window of size 10
    for i in range(len(s) - 9):
        # Iterate windows (e.g., 0 to 12 for 22 chars)
        seq = s[i:i+10]
        # Extract 10-letter substring (e.g., "AAAAACCCCC")

        # Line 4: Update hash map
        seen[seq] = seen.get(seq, 0) + 1
        # Increment count (e.g., "AAAAACCCCC" → 1, then 2)

        # Line 5: Add to result on second occurrence
        if seen[seq] == 2:
            # First repeat (e.g., "AAAAACCCCC" count=2)
            result.append(seq)
            # Add to result (e.g., ["AAAAACCCCC"])

    # Line 6: Return repeated sequences
    return result
    # e.g., ["AAAAACCCCC"]

This detailed breakdown clarifies how the hash map with sliding window efficiently finds repeated DNA sequences.

Alternative: Brute Force Substring Comparison Approach

Section link icon

Explanation

Brute Force Substring Comparison checks all pairs of 10-letter substrings:

  • Generate all 10-letter substrings.
  • Compare each pair for equality, tracking repeats.
  • Collect distinct repeats in a set, convert to list.

It’s a practical alternative, O(n²) time (pairwise comparisons), O(n) space (substring storage), but inefficient for large n, included for contrast.

For "AAAAACCCCCAAAAACCCCCC":

  • Compare all pairs: "AAAAACCCCC" matches at 0 and 10.
  • Result: ["AAAAACCCCC"].

Step-by-Step Example (Alternative)

For the same example:

  • Step 1: Generate substrings (0-9, 1-10, ..., 12-21).
    • 13 substrings: "AAAAACCCCC", "AAAACCCCCA", ..., "AAACCCCCCC".
  • Step 2: Compare pairs.
    • (0,10): "AAAAACCCCC" = "AAAAACCCCC" → repeat.
    • Others unique.
  • Step 3: Collect distinct repeats.
    • result = ["AAAAACCCCC"].
  • Finish: Return ["AAAAACCCCC"].

How the Code Works (Brute Force)

def findRepeatedDnaSequencesBrute(s: str) -> list[str]:
    if len(s) < 11:
        return []
    seen = set()
    result = set()
    for i in range(len(s) - 9):
        curr = s[i:i+10]
        if curr in seen:
            result.add(curr)
        seen.add(curr)
    return list(result)

Complexity

  • Hash Map with Sliding Window:
    • Time: O(n) – single pass with window.
    • Space: O(n) – hash map storage.
  • Brute Force Substring Comparison:
    • Time: O(n²) – pairwise comparisons.
    • Space: O(n) – substring storage.

Efficiency Notes

Hash Map with Sliding Window is the best solution with O(n) time and O(n) space, offering efficiency and scalability—Brute Force Substring Comparison uses O(n²) time, making it impractical for large n, though it still uses O(n) space, included for conceptual clarity.

Key Insights

  • Sliding Window: Fixed-size efficiency.
  • Hash Map: Tracks frequency.
  • 10-Letter: Problem-specific length.

Additional Example

Section link icon

For s = "AAAAAACAAAAAAC":

  • Goal: ["AAAAAACAAA"].
  • Window: 0-9 and 5-14 match → ["AAAAAACAAA"].

Edge Cases

Section link icon
  • Short String: < 10[].
  • No Repeats: Unique 10-letters → [].
  • All Same: "A"*20["AAAAAAAAAA"].

Both solutions handle these well.

Final Thoughts

Section link icon

LeetCode 187: Repeated DNA Sequences in Python is a fascinating pattern-matching challenge. The Hash Map with Sliding Window solution excels with its linear efficiency and clarity, while Brute Force Substring Comparison offers a basic approach. Want more? Try LeetCode 3: Longest Substring Without Repeating Characters for substring skills or LeetCode 94: Binary Tree Inorder Traversal for tree skills. Ready to practice? Solve LeetCode 187 on LeetCode with "AAAAACCCCCAAAAACCCCCC", aiming for ["AAAAACCCCC"]—test your skills now!