How to use Unicode in Python
Your guide to using Unicode in Python. Learn different methods, get tips, see real-world applications, and find out how to debug errors.

Unicode is crucial for Python applications that need to process global text. Python's native support for diverse languages and symbols makes it a powerful tool for international software and data analysis.
In this article, you'll explore essential techniques and tips for effective Unicode management. You will also see real-world applications and get practical advice to debug common encoding errors.
Creating and displaying Unicode strings
text = "Hello, 世界!" # Contains English and Chinese characters
print(text)--OUTPUT--Hello, 世界!
In modern Python, all strings are Unicode by default. This means you can create a variable like text = "Hello, 世界!" and Python automatically understands it contains characters from different languages. There's no need for special syntax or libraries for basic string creation, which simplifies handling international text.
When you use the print() function, it correctly interprets the Unicode string and displays the characters. It communicates with your terminal's settings to ensure the output renders properly, making it straightforward to work with and display multilingual content directly.
Essential Unicode operations
Beyond creating and displaying text, you'll need to manage Unicode more precisely, from using escape sequences to converting strings with encode() and decode().
Using Unicode escape sequences with \u and \U
greek_pi = "\u03C0" # Unicode for π (pi)
arabic_text = "\u0645\u0631\u062D\u0628\u0627" # مرحبا (Hello in Arabic)
print(f"Greek pi symbol: {greek_pi}")
print(f"Arabic greeting: {arabic_text}")--OUTPUT--Greek pi symbol: π
Arabic greeting: مرحبا
When you can't type a character directly, you can use its Unicode escape sequence. This method lets you embed any character into a string using its unique hexadecimal code point, which is great for maintaining script portability across different systems. Python offers two primary escape sequences:
\uXXXX: Use this for characters represented by four hexadecimal digits. For example,\u03C0becomes the pi symbol (π).\UXXXXXXXX: This is for characters needing eight hexadecimal digits, often used for less common symbols or emojis.
Converting between characters and code points with ord() and chr()
char = 'é'
code_point = ord(char)
print(f"The character '{char}' has code point: {code_point} (hex: {hex(code_point)})")
print(f"Converting back: {chr(code_point)}")--OUTPUT--The character 'é' has code point: 233 (hex: 0xe9)
Converting back: é
Python provides built-in functions to convert between characters and their numerical Unicode values, called code points. This is useful when you need to programmatically work with a character's specific numeric identity.
- The
ord()function takes a single character and returns its integer code point. - The
chr()function does the reverse, taking an integer and returning the corresponding character.
These two functions are essential for any task that involves manipulating text at a lower level, beyond just displaying it.
Encoding and decoding with encode() and decode()
original = "こんにちは" # Japanese "Hello"
encoded = original.encode('utf-8')
print(f"UTF-8 encoded bytes: {encoded}")
decoded = encoded.decode('utf-8')
print(f"Decoded back to string: {decoded}")--OUTPUT--UTF-8 encoded bytes: b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf'
Decoded back to string: こんにちは
Encoding converts a string into a sequence of bytes, which is necessary for storage or network transmission. Decoding reverses the process, turning bytes back into a readable string.
- The
encode()method takes a string and returns abytesobject. You must specify an encoding format, like'utf-8'. - The
decode()method works on abytesobject and returns a string. You must use the same encoding to avoid garbled text or errors.
Advanced Unicode techniques
While encoding and decoding are fundamental, Python's unicodedata module unlocks advanced control for normalizing strings, inspecting properties, and looking up characters by name.
Normalizing Unicode with unicodedata.normalize()
import unicodedata
composed = "café" # Single 'é' character
decomposed = "cafe\u0301" # 'e' followed by combining acute accent
print(f"Composed form: {composed}, length: {len(composed)}")
print(f"Decomposed form: {decomposed}, length: {len(decomposed)}")
print(f"After NFC normalization equal? {unicodedata.normalize('NFC', composed) == unicodedata.normalize('NFC', decomposed)}")--OUTPUT--Composed form: café, length: 4
Decomposed form: café, length: 5
After NFC normalization equal? True
Some Unicode characters can be represented in multiple ways. For example, "é" can be a single character or an "e" followed by a combining accent. While they look identical, they're different strings to Python, which can cause bugs in string comparisons. The unicodedata.normalize() function solves this by converting strings to a standard form so that different representations are treated as equal.
NFC(Normalization Form C) combines characters into their pre-composed form. It's the most common choice for consistency.NFD(Normalization Form D) decomposes characters into base characters and combining marks.
Inspecting character properties with unicodedata
import unicodedata
text = "café"
for char in text:
category = unicodedata.category(char)
name = unicodedata.name(char, 'Unknown')
print(f"{char!r}: {category} - {name}")--OUTPUT--'c': Ll - LATIN SMALL LETTER C
'a': Ll - LATIN SMALL LETTER A
'f': Ll - LATIN SMALL LETTER F
'é': Ll - LATIN SMALL LETTER E WITH ACUTE
The unicodedata module lets you look under the hood to see the properties of any character. This is incredibly useful for text processing tasks that require more than just displaying strings. You can programmatically identify and handle characters based on their specific attributes.
- The
unicodedata.category()function classifies a character, telling you if it's a lowercase letter (Ll), a number, or punctuation. - With
unicodedata.name(), you can retrieve the character's official Unicode name, which is helpful for debugging or documentation.
Using Unicode character name lookup with \N{}
lambda_char = "\N{GREEK SMALL LETTER LAMBDA}"
snowman = "\N{SNOWMAN}"
check = "\N{CHECK MARK}"
print(f"Lambda: {lambda_char}, Snowman: {snowman}, Check: {check}")
print(f"Character lookup by code point: {chr(0x1F4BB)} (Laptop)")--OUTPUT--Lambda: λ, Snowman: ☃, Check: ✓
Character lookup by code point: 💻 (Laptop)
For better readability, Python lets you insert Unicode characters by their official names using the \N{...} escape sequence. This is often clearer than memorizing hexadecimal codes, especially for symbols or characters you don't use frequently. It makes your code more self-documenting.
- Simply place the character's official Unicode name inside the curly braces.
- For example,
\N{SNOWMAN}is much more intuitive than looking up its code point.
This method complements using chr(), giving you a more descriptive way to handle specific characters.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.
For the Unicode techniques we've explored, Replit Agent can turn them into production-ready tools:
- Build a text normalization utility that cleans user-submitted text by converting strings to a consistent form with
unicodedata.normalize(). - Create a Unicode character inspector that reveals the name, category, and code point of any character using functions like
ord()andunicodedata.name(). - Deploy a file encoding converter that translates text files between formats like UTF-8 and UTF-16, preventing errors from mismatched
encode()anddecode()operations.
Describe your app idea, and Replit Agent writes the code, tests it, and fixes issues automatically, all in your browser. Try Replit Agent to bring your next project to life.
Common errors and challenges
Navigating Unicode can introduce tricky errors, but understanding their causes makes them much easier to solve when they appear.
- Handling
UnicodeEncodeErrorwhen printing to terminals: This error usually happens when your code tries to print a character that the terminal or console doesn't support. For instance, a terminal configured for Latin characters might fail when trying to display an emoji. The fix often involves configuring the terminal to use UTF-8 or handling the error within your Python script to replace or omit unsupported characters. - Fixing
UnicodeDecodeErrorwhen working with bytes anddecode(): You'll hit this error if you try to decode a byte sequence using the wrong encoding. It's like trying to unlock a door with the wrong key. If a string was encoded with UTF-8, you must usedecode('utf-8')to turn it back into a string; using another encoding will likely garble the data and raise this error. - Resolving string comparison issues with
unicodedata.normalize(): Sometimes, two strings that look identical will fail an equality check (==). This occurs because some characters have multiple valid byte representations. Usingunicodedata.normalize()to convert both strings to a standard form before comparing them ensures that the comparison is based on visual appearance, not their underlying byte structure.
Handling UnicodeEncodeError when printing to terminals
A UnicodeEncodeError pops up when your terminal doesn't recognize a character you're trying to print. This often occurs on systems not set up for UTF-8. The following code demonstrates this by trying to display a rare symbol, which may fail.
exotic_text = "Here's a rare character: \U0001F9FF" # 🧿 (Nazar Amulet)
print(exotic_text) # May cause UnicodeEncodeError on some terminals
The print() function fails when the terminal's character set doesn't include the Nazar Amulet symbol. This mismatch triggers a UnicodeEncodeError. The code below shows how to handle this gracefully without crashing your program.
exotic_text = "Here's a rare character: \U0001F9FF" # 🧿 (Nazar Amulet)
try:
print(exotic_text)
except UnicodeEncodeError:
print(exotic_text.encode('ascii', 'replace').decode('ascii'))
To prevent a crash, the code wraps the print() call in a try...except block. If a UnicodeEncodeError happens, the except block runs. It uses encode('ascii', 'replace') to convert the string to bytes, replacing any characters the terminal can't display with a placeholder. Then, decode('ascii') turns it back into a string for safe printing. This ensures your program doesn't stop when encountering unexpected characters during output.
Fixing UnicodeDecodeError when working with bytes and decode()
A UnicodeDecodeError is a common roadblock when converting bytes back into text. This error signals a mismatch between the original encoding and the one you're using with the decode() method. The code below shows what happens when you get it wrong.
data = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82' # Russian text
text = data.decode('latin-1')
print(text)
The code fails because the data bytes, which are UTF-8 encoded Russian text, are incorrectly decoded using latin-1. This mismatch garbles the output. See the correct way to decode these bytes in the example below.
data = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82' # Russian text
text = data.decode('utf-8')
print(text) # Prints: Привет
The fix is simple: you must use the same encoding to decode bytes that was used to create them. Since the data bytes represent UTF-8 text, calling data.decode('utf-8') correctly translates them back into a readable string. You'll often encounter this error when reading files or fetching web content where the encoding isn't specified. It's crucial to match your decoding method to the source's format to avoid garbled text.
Resolving string comparison issues with unicodedata.normalize()
It's a classic Unicode puzzle: two strings look identical but fail an equality check with the == operator. This is due to different underlying byte representations. Using unicodedata.normalize() solves this. The following code illustrates the problem you'll learn to fix.
string1 = "café" # é as a single character
string2 = "cafe\u0301" # e + combining acute accent
print(string1 == string2) # False
The == operator returns False because Python sees two different strings. The first, "café", has four characters, while the second, "cafe\u0301", has five. The code below shows how to get Python to treat them as equivalent.
import unicodedata
string1 = "café" # é as a single character
string2 = "cafe\u0301" # e + combining acute accent
norm1 = unicodedata.normalize('NFC', string1)
norm2 = unicodedata.normalize('NFC', string2)
print(norm1 == norm2) # True
The fix is to convert both strings to a standard form before comparing them. By calling unicodedata.normalize('NFC', ...) on each string, you ensure they use the same underlying representation. After normalization, the == operator works as expected and returns True.
Keep an eye out for this problem when you're processing text from different sources, such as user input or files, where character encoding might be inconsistent.
Real-world applications
With a grasp on troubleshooting, you can now build robust applications, like validating international usernames or cleaning text with unicodedata.normalize().
Validating user input with Unicode isalnum() method
Python's isalnum() method simplifies this process by automatically recognizing letters and numbers from any script, not just English.
def validate_username(username):
"""Validate username containing only letters and numbers"""
for char in username:
# Check if the character is a letter or digit in any script
if not char.isalnum():
return False
return True
usernames = ["john_doe", "user123", "marie-claire", "张伟", "नरेंद्र"]
for name in usernames:
status = "valid" if validate_username(name) else "invalid"
print(f"Username '{name}' is {status}")
The validate_username function checks if a string contains only letters and numbers. It loops through each character and uses the isalnum() method to see if it's alphanumeric.
- If a character is not alphanumeric—like an underscore or a hyphen—the function immediately returns
False. - Only if every character passes this check does the function complete its loop and return
True, confirming the username is valid.
Creating a text cleaner with unicodedata.normalize() and combining()
A practical application of these tools is creating a text cleaning function that uses unicodedata.normalize() to separate base characters from their accents, followed by unicodedata.combining() to strip those accents away for consistent processing.
def clean_text_for_processing(text):
import unicodedata
# Normalize to decomposed form
text = unicodedata.normalize('NFD', text)
# Remove diacritical marks
clean_text = ""
for char in text:
if not unicodedata.combining(char):
clean_text += char
# Convert to lowercase for better matching
clean_text = clean_text.lower()
return clean_text
phrases = ["Café", "Cafè", "CAFÉ"]
cleaned = [clean_text_for_processing(p) for p in phrases]
print("Original phrases:", phrases)
print("Cleaned for processing:", cleaned)
print("All match after cleaning:", len(set(cleaned)) == 1)
This function standardizes text to make different variations match. It's a three-step process for cleaning strings so they can be compared reliably.
- First,
unicodedata.normalize('NFD', text)separates combined characters, like an accented letter, into their fundamental parts. - Next, it loops through the string and uses
unicodedata.combining()to identify and discard any diacritical marks. - Finally,
lower()converts the result to lowercase for case-insensitive matching.
This ensures strings like "Café" and "CAFÉ" are treated as identical for data processing or search features.
Get started with Replit
Turn these concepts into a real tool. Give Replit Agent a prompt, like “build a text cleaner that normalizes Unicode” or “create a utility to convert file encodings between UTF-8 and UTF-16.”
The agent writes the code, tests for errors, and deploys your app automatically. Start building with Replit to bring your project to life.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

.png)
.png)
.png)