How to Handle Character Encoding Conversions

About us: Personal website of Timofey Bugaevsky and the company Zetka Interactive

Guides: A comprehensive collection of technical guides written from a senior developer's perspective. Each article provides in-depth explanations, practical code examples, and production-ready patterns.

Databases: Database administration, optimization, and patterns

Character encoding issues plague applications handling international text. From mojibake (garbled characters) to data corruption, encoding problems can destroy data integrity and user experience. Understanding encoding fundamentals and knowing recovery techniques is essential for any developer working with text data. This guide covers encoding conversions from a senior developer's perspective.

Why Encoding Matters

Proper encoding handling enables:

Data Integrity: Text stored and retrieved correctly
Internationalization: Support for all languages
Interoperability: Data exchange between systems
User Experience: No garbled characters
Legal Compliance: Proper handling of user data

Understanding Character Encodings

Common Encodings

Encoding	Bytes/Char	Characters	Use Case
ASCII	1	128	English only
ISO-8859-1 (Latin-1)	1	256	Western European
Windows-1251 (cp1251)	1	256	Cyrillic
Windows-1252	1	256	Windows Western
UTF-8	1-4	1,112,064	Universal (recommended)
UTF-16	2-4	1,112,064	Windows internals

UTF-8 Byte Patterns

UTF-8 uses variable-length encoding:

Range | Binary Pattern
------------------|------------------------------------
U+0000 to U+007F | 0xxxxxxx (1 byte)
U+0080 to U+07FF | 110xxxxx 10xxxxxx (2 bytes)
U+0800 to U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx (3 bytes)
U+10000 to U+10FFFF| 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (4 bytes)

How Mojibake Happens

Original: "Привет" (Russian "Hello")
UTF-8 bytes: D0 9F D1 80 D0 B8 D0 B2 D0 B5 D1 82
Misinterpretation as cp1251:
D0 → Р
9F → (control char)
D1 → С...
Result: "Привет" → "РџСЂРёРІРµС‚" (mojibake)

Database Encoding Issues

MySQL Encoding Configuration

-- Check current settings
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
-- Proper database creation
CREATE DATABASE mydb
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- Table with correct encoding
CREATE TABLE users (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
email VARCHAR(255)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- Connection settings
SET NAMES 'utf8mb4';
SET CHARACTER_SET_CLIENT = 'utf8mb4';
SET CHARACTER_SET_CONNECTION = 'utf8mb4';
SET CHARACTER_SET_RESULTS = 'utf8mb4';

PostgreSQL Encoding

-- Check encoding
SHOW server_encoding;
SHOW client_encoding;
-- Create database with UTF-8
CREATE DATABASE mydb
ENCODING 'UTF8'
LC_COLLATE 'en_US.UTF-8'
LC_CTYPE 'en_US.UTF-8';
-- Set client encoding
SET client_encoding TO 'UTF8';

Recovering Corrupted Data

MySQL: Recover cp1251 from UTF-8 Mojibake

When Cyrillic text was stored with the wrong encoding, the pattern "Привет" becomes "Ð¿Ñ€Ð¸Ð²ÐµÑ‚":

-- Diagnostic: Check what the data looks like
SELECT
id,
text_column,
HEX(text_column) as hex_value
FROM broken_table
LIMIT 10;
-- Recovery query
SELECT
id,
CAST(
CONVERT(
CAST(CONVERT(text_column USING cp1251) AS BINARY)
USING utf8
)
AS CHAR CHARACTER SET cp1251
) COLLATE cp1251_general_ci AS fixed_text
FROM broken_table
WHERE text_column LIKE '%Ð%'; -- Pattern indicating mojibake
-- Update corrupted data
UPDATE broken_table
SET text_column = CAST(
CONVERT(
CAST(CONVERT(text_column USING cp1251) AS BINARY)
USING utf8
)
AS CHAR CHARACTER SET cp1251
) COLLATE cp1251_general_ci
WHERE text_column REGEXP '^[Ð-ß]';

Understanding the Recovery Process

The conversion chain works because:

CONVERT(text_column USING cp1251) - Interprets UTF-8 bytes as cp1251
CAST(... AS BINARY) - Gets raw bytes
CONVERT(... USING utf8) - Reinterprets bytes as UTF-8
Final cast restores to target charset

PostgreSQL: Encoding Conversion

-- Convert column encoding
UPDATE broken_table
SET text_column = convert_from(
convert_to(text_column, 'WIN1251'),
'UTF8'
)
WHERE text_column ~ '^[А-Яа-я]';
-- Or use a bytea intermediate
UPDATE broken_table
SET text_column = convert_from(
text_column::bytea,
'UTF8'
);

PHP Encoding Handling

Detection and Conversion

<?php
// Detect encoding
function detectEncoding(string $text): string
{
$encodings = ['UTF-8', 'Windows-1251', 'ISO-8859-1', 'KOI8-R'];
foreach ($encodings as $encoding) {
if (mb_check_encoding($text, $encoding)) {
// Verify by round-trip conversion
$converted = mb_convert_encoding($text, 'UTF-8', $encoding);
$back = mb_convert_encoding($converted, $encoding, 'UTF-8');
if ($back === $text) {
return $encoding;
}
}
}
return 'unknown';
}
// Convert to UTF-8
function toUtf8(string $text, ?string $fromEncoding = null): string
{
if ($fromEncoding === null) {
$fromEncoding = mb_detect_encoding($text, ['UTF-8', 'Windows-1251', 'ISO-8859-1'], true);
}
if ($fromEncoding === false || $fromEncoding === 'UTF-8') {
return $text;
}
return mb_convert_encoding($text, 'UTF-8', $fromEncoding);
}
// Fix double-encoded UTF-8
function fixDoubleEncoding(string $text): string
{
// Check if it looks like double-encoded UTF-8
if (preg_match('/[\xC0-\xDF][\x80-\xBF]/', $text)) {
$fixed = mb_convert_encoding($text, 'Windows-1251', 'UTF-8');
if (mb_check_encoding($fixed, 'UTF-8')) {
return $fixed;
}
}
return $text;
}
?>

Database Connection

<?php
// PDO with UTF-8
$pdo = new PDO(
'mysql:host=localhost;dbname=mydb;charset=utf8mb4',
$username,
$password,
[
PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci",
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
]
);
// MySQLi with UTF-8
$mysqli = new mysqli('localhost', $username, $password, 'mydb');
$mysqli->set_charset('utf8mb4');
// Verify connection charset
if ($mysqli->character_set_name() !== 'utf8mb4') {
throw new RuntimeException('Failed to set UTF-8 encoding');
}
?>

HTTP Headers and HTML

<?php
// Set response encoding
header('Content-Type: text/html; charset=UTF-8');
// For JSON responses
header('Content-Type: application/json; charset=UTF-8');
echo json_encode($data, JSON_UNESCAPED_UNICODE);
?>
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>

Python Encoding Handling

# Read files with encoding
def read_file_safely(path: str) -> str:
encodings = ['utf-8', 'cp1251', 'iso-8859-1', 'koi8-r']
for encoding in encodings:
try:
with open(path, 'r', encoding=encoding) as f:
content = f.read()
# Verify by encoding back
content.encode(encoding)
return content
except (UnicodeDecodeError, UnicodeEncodeError):
continue
raise ValueError(f"Could not decode {path} with any known encoding")
# Convert encodings
def convert_to_utf8(text: bytes, source_encoding: str = None) -> str:
if source_encoding:
return text.decode(source_encoding)
# Try detection
import chardet
detected = chardet.detect(text)
return text.decode(detected['encoding'] or 'utf-8')
# Fix mojibake
def fix_mojibake(text: str) -> str:
try:
# Common pattern: UTF-8 interpreted as cp1251
fixed = text.encode('cp1251').decode('utf-8')
return fixed
except (UnicodeDecodeError, UnicodeEncodeError):
return text

JavaScript/Node.js Encoding

const iconv = require('iconv-lite');
// Convert buffer to UTF-8
function toUtf8(buffer, encoding = 'win1251') {
return iconv.decode(buffer, encoding);
}
// Convert string to a different encoding
function convertEncoding(text, from, to) {
const buffer = iconv.encode(text, from);
return iconv.decode(buffer, to);
}
// Read file with specific encoding
const fs = require('fs');
function readFileWithEncoding(path, encoding = 'utf-8') {
const buffer = fs.readFileSync(path);
return iconv.decode(buffer, encoding);
}
// Express.js middleware for UTF-8
app.use((req, res, next) => {
res.setHeader('Content-Type', 'application/json; charset=utf-8');
next();
});

Command Line Tools

iconv

# Convert file encoding
iconv -f CP1251 -t UTF-8 input.txt > output.txt
# Convert with transliteration for unmappable chars
iconv -f CP1251 -t UTF-8//TRANSLIT input.txt > output.txt
# Check file encoding
file -bi document.txt
# Output: text/plain; charset=utf-8
# Batch convert files
for f in *.txt; do
iconv -f CP1251 -t UTF-8 "$f" > "${f%.txt}_utf8.txt"
done

MySQL Command Line

# Import with encoding
mysql --default-character-set=utf8mb4 -u user -p database < dump.sql
# Export with encoding
mysqldump --default-character-set=utf8mb4 database > dump.sql

Prevention Best Practices

Always Use UTF-8

# Database: UTF-8
# Files: UTF-8 with BOM for Windows compatibility
# HTTP: Content-Type with charset
# HTML: <meta charset="UTF-8">
# API: JSON with UTF-8

Validate Input

<?php
function validateUtf8Input(string $input): string
{
if (!mb_check_encoding($input, 'UTF-8')) {
// Try to fix or reject
$fixed = mb_convert_encoding($input, 'UTF-8', 'auto');
if (!mb_check_encoding($fixed, 'UTF-8')) {
throw new InvalidArgumentException('Invalid UTF-8 input');
}
return $fixed;
}
return $input;
}
?>

Key Takeaways

Default to UTF-8: Use utf8mb4 in MySQL, UTF-8 everywhere else
Set encoding explicitly: Never assume default encoding
Match client and server: Database connection must match database encoding
Validate input: Check encoding before processing
Test with real data: Include international characters in test data
Document encoding: Note expected encoding in APIs and file formats

Character encoding is a solved problem when handled consistently—the complexity comes from dealing with legacy systems and corrupted data. Master these recovery techniques and you'll save countless hours debugging encoding issues.

Site materials may be used with reference to the source. License

About us

Guides

Databases

How to Handle Character Encoding Conversions

Свернуть [_]

Log In

How to Handle Character Encoding Conversions

Why Encoding Matters

Understanding Character Encodings

Common Encodings

UTF-8 Byte Patterns

How Mojibake Happens

Database Encoding Issues

MySQL Encoding Configuration

PostgreSQL Encoding

Recovering Corrupted Data

MySQL: Recover cp1251 from UTF-8 Mojibake

Understanding the Recovery Process

PostgreSQL: Encoding Conversion

PHP Encoding Handling

Detection and Conversion

Database Connection

HTTP Headers and HTML

Python Encoding Handling

JavaScript/Node.js Encoding

Command Line Tools

iconv

MySQL Command Line

Prevention Best Practices

Always Use UTF-8

Validate Input

Key Takeaways