LeetCode 393: UTF-8 Validation Solution in Python – A Step-by-Step Guide
Imagine you’re given a list of integers—like [197, 130, 1]—representing bytes, and you need to determine if they form a valid UTF-8 encoded sequence, following strict rules about byte patterns. That’s the challenge of LeetCode 393: UTF-8 Validation, a medium-level problem that’s all about bit-level validation and encoding rules. Using Python, we’ll explore two solutions: the Best Solution—bit manipulation for O(n) efficiency—and an Alternative Solution—state machine at O(n). With examples, clear code breakdowns, and a friendly vibe, this guide will help you validate that UTF-8 sequence. Let’s start decoding!
What Is LeetCode 393: UTF-8 Validation?
LeetCode 393: UTF-8 Validation provides an array of integers, each representing a byte (0-255), and your task is to check if they form a valid UTF-8 encoded sequence per UTF-8 rules: single-byte characters start with 0, multi-byte characters follow specific patterns (e.g., 110xxxxx for 2 bytes). It’s a problem that tests your understanding of bit patterns and validation logic!
Problem Statement
- Input:
- data: List of integers (0 to 255), representing bytes.
- Output: Boolean - True if valid UTF-8 encoding, False otherwise.
- Rules:
- UTF-8 encoding:
- 1 byte: 0xxxxxxx.
- 2 bytes: 110xxxxx 10xxxxxx.
- 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx.
- 4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
- Each integer is an 8-bit byte.
- Sequence must follow UTF-8 pattern exactly.
Constraints
- 1 <= data.length <= 2 * 10⁴
- 0 <= data[i] <= 255
Examples
- Input: data = [197, 130, 1]
- Output: True
- 197 (11000101), 130 (10000010) → 2-byte char; 1 (00000001) → 1-byte char.
- Input: data = [235, 140, 4]
- Output: False
- 235 (11101011) → 3-byte start, 140 (10001100) OK, 4 (00000100) invalid (needs 10xxxxxx).
- Input: data = [255]
- Output: False
- 255 (11111111) → Invalid UTF-8 start.
These show the UTF-8 puzzle—let’s solve it!
Understanding the Problem: Validating UTF-8
To tackle LeetCode 393 in Python, we need to: 1. Validate each byte’s role in a UTF-8 sequence (start or continuation). 2. Ensure multi-byte sequences have correct start and continuation bytes. 3. Check the entire sequence efficiently.
A naive approach might use string conversion, but bit manipulation is better. We’ll use:
- Best Solution: Bit manipulation—O(n) time, O(1) space—fast and optimal.
- Alternative Solution: State machine—O(n) time, O(1) space—intuitive but more verbose.
Let’s validate with the best solution!
Best Solution: Bit Manipulation
Why This Is the Best Solution
The bit manipulation approach runs in O(n) time (n is array length) with O(1) space by examining each byte’s leading bits directly, using bitwise operations to check UTF-8 patterns. It’s the most efficient—concise, fast, and avoids extra state tracking, making it ideal for this validation task!
How It Works
Think of it like a byte inspector:
- Patterns:
- 1-byte: 0xxxxxxx (0-127).
- 2-byte: 110xxxxx (192-223), 10xxxxxx (128-191).
- 3-byte: 1110xxxx (224-239), 10xxxxxx.
- 4-byte: 11110xxx (240-247), 10xxxxxx.
- Process:
- Check first byte for valid start (0, 110, 1110, 11110).
- Count continuation bytes (10xxxxxx) based on start.
- Validate sequence length and pattern.
- Result: True if all sequences valid, False if any fail.
It’s like decoding byte headers with bits!
Step-by-Step Example
For data = [197, 130, 1]
:
- Byte 197 (11000101):
- 110 → 2-byte start (192-223), needs 1 continuation.
- j = 1, check 130.
- Byte 130 (10000010):
- 10 → Continuation (128-191), valid, j = 2.
- Byte 1 (00000001):
- 0 → 1-byte (0-127), valid, j = 3.
- End: j = len(data), return True.
For data = [235, 140, 4]
:
- 235 (11101011):
- 1110 → 3-byte start (224-239), needs 2 continuations.
- j = 1, check 140.
- 140 (10001100):
- 10 → Continuation, j = 2, check 4.
- 4 (00000100):
- 0 → Not 10xxxxxx, invalid, return False.
For data = [255]
:
- 255 (11111111):
- 11111 → Invalid start, return False.
Code with Explanation
Here’s the Python code using bit manipulation:
class Solution:
def validUtf8(self, data: List[int]) -> bool:
# Index to traverse data
j = 0
while j < len(data):
# Get first byte
byte = data[j]
# Count continuation bytes needed
if byte < 128: # 0xxxxxxx
num_bytes = 0
elif 192 <= byte < 224: # 110xxxxx
num_bytes = 1
elif 224 <= byte < 240: # 1110xxxx
num_bytes = 2
elif 240 <= byte < 248: # 11110xxx
num_bytes = 3
else: # Invalid start
return False
# Check if enough bytes remain
if j + num_bytes >= len(data):
return False
# Validate continuation bytes (10xxxxxx)
for k in range(j + 1, j + num_bytes + 1):
if not (128 <= data[k] < 192):
return False
# Move to next sequence
j += num_bytes + 1
return True
Explanation
- Line 4: j: Pointer for current byte—O(1).
- Lines 6-19: Check start byte:
- < 128: 1-byte—O(1).
- 192-223: 2-byte—O(1).
- 224-239: 3-byte—O(1).
- 240-247: 4-byte—O(1).
- Else: Invalid—O(1).
- Lines 21-22: Ensure enough bytes—O(1).
- Lines 24-26: Check continuations (128-191)—O(1) per byte.
- Line 29: Move to next sequence—O(1).
- Line 31: Return True if all valid—O(1).
- Time: O(n)—single pass, O(1) per byte.
- Space: O(1)—few variables.
This is like a bit checker—swift and sharp!
Alternative Solution: State Machine
Why an Alternative Approach?
The state machine solution also runs in O(n) time with O(1) space by tracking the expected number of continuation bytes as a state, updating it with each byte. It’s more verbose—great for understanding a step-by-step validation process, though less concise than bit manipulation!
How It Works
- State: remaining = number of continuation bytes expected.
- Process:
- If remaining > 0, expect 10xxxxxx, decrement.
- If remaining = 0, check start byte, set new remaining.
- Result: True if remaining = 0 at end, False if invalid.
Step-by-Step Example
For data = [197, 130, 1]
:
- Init: remaining = 0.
- 197: 11000101, remaining = 0 → 1, valid start.
- 130: 10000010, remaining = 1, valid continuation, remaining = 0.
- 1: 00000001, remaining = 0, valid 1-byte, remaining = 0.
- End: remaining = 0, return True.
For data = [235, 140, 4]
:
- 235: 11101011, remaining = 2.
- 140: 10001100, remaining = 1, valid.
- 4: 00000100, remaining = 1, invalid (not 10xxxxxx), return False.
Code with Explanation
class Solution:
def validUtf8(self, data: List[int]) -> bool:
# State: number of continuation bytes remaining
remaining = 0
# Process each byte
for byte in data:
if remaining == 0:
# Check start byte
if byte < 128: # 1-byte
continue
elif 192 <= byte < 224: # 2-byte
remaining = 1
elif 224 <= byte < 240: # 3-byte
remaining = 2
elif 240 <= byte < 248: # 4-byte
remaining = 3
else:
return False
else:
# Check continuation byte
if not (128 <= byte < 192):
return False
remaining -= 1
# Valid if no continuation bytes remain
return remaining == 0
Explanation
- Line 4: remaining: State for continuation bytes—O(1).
- Lines 6-22: Process bytes:
- If remaining = 0, check start—O(1).
- If remaining > 0, check continuation, decrement—O(1).
- Line 24: True if remaining = 0—O(1).
- Time: O(n)—single pass.
- Space: O(1)—one variable.
It’s a state tracker—clear but wordier!
Comparing the Two Solutions
- Bit Manipulation (Best):
- Time: O(n)—linear.
- Space: O(1)—minimal.
- Pros: Fast, concise, elegant.
- Cons: Bit logic less intuitive.
- State Machine (Alternative):
- Time: O(n)—linear.
- Space: O(1)—minimal.
- Pros: Step-by-step clarity.
- Cons: More verbose.
Bit manipulation wins for conciseness and speed!
Additional Examples and Edge Cases
- [1]: Both return True.
- [145]: Both return False (10xxxxxx alone).
- Large n: Both scale linearly.
Bit manipulation is optimal, state machine works too.
Complexity Breakdown
- Bit Manipulation:
- Time: O(n).
- Space: O(1).
- State Machine:
- Time: O(n).
- Space: O(1).
Bit manipulation excels for simplicity.
Key Takeaways
- Bit Manipulation: Decode with bits—fast and sleek!
- State Machine: Track with states—clear but longer!
- UTF-8: Patterns rule validation.
- Python Tip: Bits rock—see [Python Basics](/python/basics).
Final Thoughts: Validate That UTF-8
LeetCode 393: UTF-8 Validation in Python is a bit-level challenge. Bit manipulation is the efficiency champ, while state machine offers a structured alternative. Want more? Try LeetCode 136: Single Number or LeetCode 191: Number of 1 Bits. Ready to decode? Visit Solve LeetCode 393 on LeetCode and check that UTF-8 today!