LeetCode 393: UTF-8 Validation Solution in Python – A Step-by-Step Guide

Imagine you’re given a list of integers—like [197, 130, 1]—representing bytes, and you need to determine if they form a valid UTF-8 encoded sequence, following strict rules about byte patterns. That’s the challenge of LeetCode 393: UTF-8 Validation, a medium-level problem that’s all about bit-level validation and encoding rules. Using Python, we’ll explore two solutions: the Best Solution—bit manipulation for O(n) efficiency—and an Alternative Solution—state machine at O(n). With examples, clear code breakdowns, and a friendly vibe, this guide will help you validate that UTF-8 sequence. Let’s start decoding!

What Is LeetCode 393: UTF-8 Validation?

Section link icon

LeetCode 393: UTF-8 Validation provides an array of integers, each representing a byte (0-255), and your task is to check if they form a valid UTF-8 encoded sequence per UTF-8 rules: single-byte characters start with 0, multi-byte characters follow specific patterns (e.g., 110xxxxx for 2 bytes). It’s a problem that tests your understanding of bit patterns and validation logic!

Problem Statement

  • Input:
    • data: List of integers (0 to 255), representing bytes.
  • Output: Boolean - True if valid UTF-8 encoding, False otherwise.
  • Rules:
    • UTF-8 encoding:
      • 1 byte: 0xxxxxxx.
      • 2 bytes: 110xxxxx 10xxxxxx.
      • 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx.
      • 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
    • Each integer is an 8-bit byte.
    • Sequence must follow UTF-8 pattern exactly.

Constraints

  • 1 <= data.length <= 2 * 10⁴
  • 0 <= data[i] <= 255

Examples

  • Input: data = [197, 130, 1]
    • Output: True
      • 197 (11000101), 130 (10000010) → 2-byte char; 1 (00000001) → 1-byte char.
  • Input: data = [235, 140, 4]
    • Output: False
      • 235 (11101011) → 3-byte start, 140 (10001100) OK, 4 (00000100) invalid (needs 10xxxxxx).
  • Input: data = [255]
    • Output: False
      • 255 (11111111) → Invalid UTF-8 start.

These show the UTF-8 puzzle—let’s solve it!

Understanding the Problem: Validating UTF-8

Section link icon

To tackle LeetCode 393 in Python, we need to: 1. Validate each byte’s role in a UTF-8 sequence (start or continuation). 2. Ensure multi-byte sequences have correct start and continuation bytes. 3. Check the entire sequence efficiently.

A naive approach might use string conversion, but bit manipulation is better. We’ll use:

  • Best Solution: Bit manipulation—O(n) time, O(1) space—fast and optimal.
  • Alternative Solution: State machine—O(n) time, O(1) space—intuitive but more verbose.

Let’s validate with the best solution!

Best Solution: Bit Manipulation

Section link icon

Why This Is the Best Solution

The bit manipulation approach runs in O(n) time (n is array length) with O(1) space by examining each byte’s leading bits directly, using bitwise operations to check UTF-8 patterns. It’s the most efficient—concise, fast, and avoids extra state tracking, making it ideal for this validation task!

How It Works

Think of it like a byte inspector:

  • Patterns:
    • 1-byte: 0xxxxxxx (0-127).
    • 2-byte: 110xxxxx (192-223), 10xxxxxx (128-191).
    • 3-byte: 1110xxxx (224-239), 10xxxxxx.
    • 4-byte: 11110xxx (240-247), 10xxxxxx.
  • Process:
    • Check first byte for valid start (0, 110, 1110, 11110).
    • Count continuation bytes (10xxxxxx) based on start.
    • Validate sequence length and pattern.
  • Result: True if all sequences valid, False if any fail.

It’s like decoding byte headers with bits!

Step-by-Step Example

For data = [197, 130, 1]:

  • Byte 197 (11000101):
    • 110 → 2-byte start (192-223), needs 1 continuation.
    • j = 1, check 130.
  • Byte 130 (10000010):
    • 10 → Continuation (128-191), valid, j = 2.
  • Byte 1 (00000001):
    • 0 → 1-byte (0-127), valid, j = 3.
  • End: j = len(data), return True.

For data = [235, 140, 4]:

  • 235 (11101011):
    • 1110 → 3-byte start (224-239), needs 2 continuations.
    • j = 1, check 140.
  • 140 (10001100):
    • 10 → Continuation, j = 2, check 4.
  • 4 (00000100):
    • 0 → Not 10xxxxxx, invalid, return False.

For data = [255]:

  • 255 (11111111):
    • 11111 → Invalid start, return False.

Code with Explanation

Here’s the Python code using bit manipulation:

class Solution:
    def validUtf8(self, data: List[int]) -> bool:
        # Index to traverse data
        j = 0

        while j < len(data):
            # Get first byte
            byte = data[j]

            # Count continuation bytes needed
            if byte < 128:  # 0xxxxxxx
                num_bytes = 0
            elif 192 <= byte < 224:  # 110xxxxx
                num_bytes = 1
            elif 224 <= byte < 240:  # 1110xxxx
                num_bytes = 2
            elif 240 <= byte < 248:  # 11110xxx
                num_bytes = 3
            else:  # Invalid start
                return False

            # Check if enough bytes remain
            if j + num_bytes >= len(data):
                return False

            # Validate continuation bytes (10xxxxxx)
            for k in range(j + 1, j + num_bytes + 1):
                if not (128 <= data[k] < 192):
                    return False

            # Move to next sequence
            j += num_bytes + 1

        return True

Explanation

  • Line 4: j: Pointer for current byte—O(1).
  • Lines 6-19: Check start byte:
    • < 128: 1-byte—O(1).
    • 192-223: 2-byte—O(1).
    • 224-239: 3-byte—O(1).
    • 240-247: 4-byte—O(1).
    • Else: Invalid—O(1).
  • Lines 21-22: Ensure enough bytes—O(1).
  • Lines 24-26: Check continuations (128-191)—O(1) per byte.
  • Line 29: Move to next sequence—O(1).
  • Line 31: Return True if all valid—O(1).
  • Time: O(n)—single pass, O(1) per byte.
  • Space: O(1)—few variables.

This is like a bit checker—swift and sharp!

Alternative Solution: State Machine

Section link icon

Why an Alternative Approach?

The state machine solution also runs in O(n) time with O(1) space by tracking the expected number of continuation bytes as a state, updating it with each byte. It’s more verbose—great for understanding a step-by-step validation process, though less concise than bit manipulation!

How It Works

  • State: remaining = number of continuation bytes expected.
  • Process:
    • If remaining > 0, expect 10xxxxxx, decrement.
    • If remaining = 0, check start byte, set new remaining.
  • Result: True if remaining = 0 at end, False if invalid.

Step-by-Step Example

For data = [197, 130, 1]:

  • Init: remaining = 0.
  • 197: 11000101, remaining = 0 → 1, valid start.
  • 130: 10000010, remaining = 1, valid continuation, remaining = 0.
  • 1: 00000001, remaining = 0, valid 1-byte, remaining = 0.
  • End: remaining = 0, return True.

For data = [235, 140, 4]:

  • 235: 11101011, remaining = 2.
  • 140: 10001100, remaining = 1, valid.
  • 4: 00000100, remaining = 1, invalid (not 10xxxxxx), return False.

Code with Explanation

class Solution:
    def validUtf8(self, data: List[int]) -> bool:
        # State: number of continuation bytes remaining
        remaining = 0

        # Process each byte
        for byte in data:
            if remaining == 0:
                # Check start byte
                if byte < 128:  # 1-byte
                    continue
                elif 192 <= byte < 224:  # 2-byte
                    remaining = 1
                elif 224 <= byte < 240:  # 3-byte
                    remaining = 2
                elif 240 <= byte < 248:  # 4-byte
                    remaining = 3
                else:
                    return False
            else:
                # Check continuation byte
                if not (128 <= byte < 192):
                    return False
                remaining -= 1

        # Valid if no continuation bytes remain
        return remaining == 0

Explanation

  • Line 4: remaining: State for continuation bytes—O(1).
  • Lines 6-22: Process bytes:
    • If remaining = 0, check start—O(1).
    • If remaining > 0, check continuation, decrement—O(1).
  • Line 24: True if remaining = 0—O(1).
  • Time: O(n)—single pass.
  • Space: O(1)—one variable.

It’s a state tracker—clear but wordier!

Comparing the Two Solutions

Section link icon
  • Bit Manipulation (Best):
    • Time: O(n)—linear.
    • Space: O(1)—minimal.
    • Pros: Fast, concise, elegant.
    • Cons: Bit logic less intuitive.
  • State Machine (Alternative):
    • Time: O(n)—linear.
    • Space: O(1)—minimal.
    • Pros: Step-by-step clarity.
    • Cons: More verbose.

Bit manipulation wins for conciseness and speed!

Additional Examples and Edge Cases

Section link icon
  • [1]: Both return True.
  • [145]: Both return False (10xxxxxx alone).
  • Large n: Both scale linearly.

Bit manipulation is optimal, state machine works too.

Complexity Breakdown

Section link icon
  • Bit Manipulation:
    • Time: O(n).
    • Space: O(1).
  • State Machine:
    • Time: O(n).
    • Space: O(1).

Bit manipulation excels for simplicity.

Key Takeaways

Section link icon
  • Bit Manipulation: Decode with bits—fast and sleek!
  • State Machine: Track with states—clear but longer!
  • UTF-8: Patterns rule validation.
  • Python Tip: Bits rock—see [Python Basics](/python/basics).

Final Thoughts: Validate That UTF-8

Section link icon

LeetCode 393: UTF-8 Validation in Python is a bit-level challenge. Bit manipulation is the efficiency champ, while state machine offers a structured alternative. Want more? Try LeetCode 136: Single Number or LeetCode 191: Number of 1 Bits. Ready to decode? Visit Solve LeetCode 393 on LeetCode and check that UTF-8 today!