About us Guides Projects Contacts
Админка
please wait

Character encoding issues plague applications handling international text. From mojibake (garbled characters) to data corruption, encoding problems can destroy data integrity and user experience. Understanding encoding fundamentals and knowing recovery techniques is essential for any developer working with text data. This guide covers encoding conversions from a senior developer's perspective.

Why Encoding Matters

Proper encoding handling enables:

  1. Data Integrity: Text stored and retrieved correctly
  2. Internationalization: Support for all languages
  3. Interoperability: Data exchange between systems
  4. User Experience: No garbled characters
  5. Legal Compliance: Proper handling of user data

Understanding Character Encodings

Common Encodings

EncodingBytes/CharCharactersUse Case
ASCII1128English only
ISO-8859-1 (Latin-1)1256Western European
Windows-1251 (cp1251)1256Cyrillic
Windows-12521256Windows Western
UTF-81-41,112,064Universal (recommended)
UTF-162-41,112,064Windows internals

UTF-8 Byte Patterns

UTF-8 uses variable-length encoding:

Range | Binary Pattern
------------------|------------------------------------
U+0000 to U+007F | 0xxxxxxx (1 byte)
U+0080 to U+07FF | 110xxxxx 10xxxxxx (2 bytes)
U+0800 to U+FFFF | 1110xxxx 10xxxxxx 10xxxxxx (3 bytes)
U+10000 to U+10FFFF| 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (4 bytes)

How Mojibake Happens

Original: "Привет" (Russian "Hello")
UTF-8 bytes: D0 9F D1 80 D0 B8 D0 B2 D0 B5 D1 82
Misinterpretation as cp1251:
D0 → Р
9F → (control char)
D1 → С...
Result: "Привет" → "Привет" (mojibake)

Database Encoding Issues

MySQL Encoding Configuration

-- Check current settings
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
-- Proper database creation
CREATE DATABASE mydb
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- Table with correct encoding
CREATE TABLE users (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
email VARCHAR(255)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- Connection settings
SET NAMES 'utf8mb4';
SET CHARACTER_SET_CLIENT = 'utf8mb4';
SET CHARACTER_SET_CONNECTION = 'utf8mb4';
SET CHARACTER_SET_RESULTS = 'utf8mb4';

PostgreSQL Encoding

-- Check encoding
SHOW server_encoding;
SHOW client_encoding;
-- Create database with UTF-8
CREATE DATABASE mydb
ENCODING 'UTF8'
LC_COLLATE 'en_US.UTF-8'
LC_CTYPE 'en_US.UTF-8';
-- Set client encoding
SET client_encoding TO 'UTF8';

Recovering Corrupted Data

MySQL: Recover cp1251 from UTF-8 Mojibake

When Cyrillic text was stored with the wrong encoding, the pattern "Привет" becomes "привет":

-- Diagnostic: Check what the data looks like
SELECT
id,
text_column,
HEX(text_column) as hex_value
FROM broken_table
LIMIT 10;
-- Recovery query
SELECT
id,
CAST(
CONVERT(
CAST(CONVERT(text_column USING cp1251) AS BINARY)
USING utf8
)
AS CHAR CHARACTER SET cp1251
) COLLATE cp1251_general_ci AS fixed_text
FROM broken_table
WHERE text_column LIKE '%Ð%'; -- Pattern indicating mojibake
-- Update corrupted data
UPDATE broken_table
SET text_column = CAST(
CONVERT(
CAST(CONVERT(text_column USING cp1251) AS BINARY)
USING utf8
)
AS CHAR CHARACTER SET cp1251
) COLLATE cp1251_general_ci
WHERE text_column REGEXP '^[Ð-ß]';

Understanding the Recovery Process

The conversion chain works because:

  1. CONVERT(text_column USING cp1251) - Interprets UTF-8 bytes as cp1251
  2. CAST(... AS BINARY) - Gets raw bytes
  3. CONVERT(... USING utf8) - Reinterprets bytes as UTF-8
  4. Final cast restores to target charset

PostgreSQL: Encoding Conversion

-- Convert column encoding
UPDATE broken_table
SET text_column = convert_from(
convert_to(text_column, 'WIN1251'),
'UTF8'
)
WHERE text_column ~ '^[А-Яа-я]';
-- Or use a bytea intermediate
UPDATE broken_table
SET text_column = convert_from(
text_column::bytea,
'UTF8'
);

PHP Encoding Handling

Detection and Conversion

<?php
// Detect encoding
function detectEncoding(string $text): string
{
$encodings = ['UTF-8', 'Windows-1251', 'ISO-8859-1', 'KOI8-R'];
foreach ($encodings as $encoding) {
if (mb_check_encoding($text, $encoding)) {
// Verify by round-trip conversion
$converted = mb_convert_encoding($text, 'UTF-8', $encoding);
$back = mb_convert_encoding($converted, $encoding, 'UTF-8');
if ($back === $text) {
return $encoding;
}
}
}
return 'unknown';
}
// Convert to UTF-8
function toUtf8(string $text, ?string $fromEncoding = null): string
{
if ($fromEncoding === null) {
$fromEncoding = mb_detect_encoding($text, ['UTF-8', 'Windows-1251', 'ISO-8859-1'], true);
}
if ($fromEncoding === false || $fromEncoding === 'UTF-8') {
return $text;
}
return mb_convert_encoding($text, 'UTF-8', $fromEncoding);
}
// Fix double-encoded UTF-8
function fixDoubleEncoding(string $text): string
{
// Check if it looks like double-encoded UTF-8
if (preg_match('/[\xC0-\xDF][\x80-\xBF]/', $text)) {
$fixed = mb_convert_encoding($text, 'Windows-1251', 'UTF-8');
if (mb_check_encoding($fixed, 'UTF-8')) {
return $fixed;
}
}
return $text;
}
?>

Database Connection

<?php
// PDO with UTF-8
$pdo = new PDO(
'mysql:host=localhost;dbname=mydb;charset=utf8mb4',
$username,
$password,
[
PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci",
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
]
);
// MySQLi with UTF-8
$mysqli = new mysqli('localhost', $username, $password, 'mydb');
$mysqli->set_charset('utf8mb4');
// Verify connection charset
if ($mysqli->character_set_name() !== 'utf8mb4') {
throw new RuntimeException('Failed to set UTF-8 encoding');
}
?>

HTTP Headers and HTML

<?php
// Set response encoding
header('Content-Type: text/html; charset=UTF-8');
// For JSON responses
header('Content-Type: application/json; charset=UTF-8');
echo json_encode($data, JSON_UNESCAPED_UNICODE);
?>
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>

Python Encoding Handling

# Read files with encoding
def read_file_safely(path: str) -> str:
encodings = ['utf-8', 'cp1251', 'iso-8859-1', 'koi8-r']
for encoding in encodings:
try:
with open(path, 'r', encoding=encoding) as f:
content = f.read()
# Verify by encoding back
content.encode(encoding)
return content
except (UnicodeDecodeError, UnicodeEncodeError):
continue
raise ValueError(f"Could not decode {path} with any known encoding")
# Convert encodings
def convert_to_utf8(text: bytes, source_encoding: str = None) -> str:
if source_encoding:
return text.decode(source_encoding)
# Try detection
import chardet
detected = chardet.detect(text)
return text.decode(detected['encoding'] or 'utf-8')
# Fix mojibake
def fix_mojibake(text: str) -> str:
try:
# Common pattern: UTF-8 interpreted as cp1251
fixed = text.encode('cp1251').decode('utf-8')
return fixed
except (UnicodeDecodeError, UnicodeEncodeError):
return text

JavaScript/Node.js Encoding

const iconv = require('iconv-lite');
// Convert buffer to UTF-8
function toUtf8(buffer, encoding = 'win1251') {
return iconv.decode(buffer, encoding);
}
// Convert string to a different encoding
function convertEncoding(text, from, to) {
const buffer = iconv.encode(text, from);
return iconv.decode(buffer, to);
}
// Read file with specific encoding
const fs = require('fs');
function readFileWithEncoding(path, encoding = 'utf-8') {
const buffer = fs.readFileSync(path);
return iconv.decode(buffer, encoding);
}
// Express.js middleware for UTF-8
app.use((req, res, next) => {
res.setHeader('Content-Type', 'application/json; charset=utf-8');
next();
});

Command Line Tools

iconv

# Convert file encoding
iconv -f CP1251 -t UTF-8 input.txt > output.txt
# Convert with transliteration for unmappable chars
iconv -f CP1251 -t UTF-8//TRANSLIT input.txt > output.txt
# Check file encoding
file -bi document.txt
# Output: text/plain; charset=utf-8
# Batch convert files
for f in *.txt; do
iconv -f CP1251 -t UTF-8 "$f" > "${f%.txt}_utf8.txt"
done

MySQL Command Line

# Import with encoding
mysql --default-character-set=utf8mb4 -u user -p database < dump.sql
# Export with encoding
mysqldump --default-character-set=utf8mb4 database > dump.sql

Prevention Best Practices

Always Use UTF-8

# Database: UTF-8
# Files: UTF-8 with BOM for Windows compatibility
# HTTP: Content-Type with charset
# HTML: <meta charset="UTF-8">
# API: JSON with UTF-8

Validate Input

<?php
function validateUtf8Input(string $input): string
{
if (!mb_check_encoding($input, 'UTF-8')) {
// Try to fix or reject
$fixed = mb_convert_encoding($input, 'UTF-8', 'auto');
if (!mb_check_encoding($fixed, 'UTF-8')) {
throw new InvalidArgumentException('Invalid UTF-8 input');
}
return $fixed;
}
return $input;
}
?>

Key Takeaways

  1. Default to UTF-8: Use utf8mb4 in MySQL, UTF-8 everywhere else
  2. Set encoding explicitly: Never assume default encoding
  3. Match client and server: Database connection must match database encoding
  4. Validate input: Check encoding before processing
  5. Test with real data: Include international characters in test data
  6. Document encoding: Note expected encoding in APIs and file formats

Character encoding is a solved problem when handled consistently—the complexity comes from dealing with legacy systems and corrupted data. Master these recovery techniques and you'll save countless hours debugging encoding issues.

 
 
 
Языки
Темы
Copyright © 1999 — 2026
ZK Interactive