Text Encoding

Text encoding defines how characters are represented as bytes in computer systems. In Linux, understanding encoding is crucial for proper file handling, internationalization, and avoiding data corruption when working with different languages and character sets.

Key Concepts

Character Set: Collection of characters (ASCII, Unicode)
Encoding: Method to represent characters as bytes (UTF-8, ISO-8859-1)
Locale: System setting defining language, region, and encoding
BOM: Byte Order Mark - optional marker at file start
Code Point: Numeric value assigned to each character

Command Syntax

file [options] filename - Detect file encoding iconv [options] -f from-encoding -t to-encoding file locale [options] - Display locale information

Common Options

-i - Show MIME encoding (with file command) -f - Source encoding (iconv) -t - Target encoding (iconv) -o - Output file (iconv) -c - Skip invalid characters (iconv) -l - List available encodings (iconv)

Practical Examples

Example 1: Check file encoding

1
2


file -i document.txt
document.txt: text/plain; charset=utf-8

Shows the file contains UTF-8 encoded text

Example 2: Convert encoding

1

iconv -f ISO-8859-1 -t UTF-8 old.txt -o new.txt

Converts file from Latin-1 to UTF-8 encoding

Example 3: List available encodings

1
2
3
4
5
6


iconv -l | head -10
ANSI_X3.4-1968
ANSI_X3.4-1986
ASCII
BIG5
BIG5-HKSCS

Shows supported character encodings

Example 4: Check system locale

1
2
3
4


locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"

Displays current locale settings

Example 5: Set locale temporarily

1

export LC_ALL=en_US.UTF-8

Changes locale for current session

Use Cases

Converting files between different character sets
Fixing corrupted text from wrong encoding
Preparing files for international distribution
Debugging character display issues
Processing legacy files with old encodings

hexdump - View raw bytes in files od - Octal/hex dump of file contents
chardet - Python tool for encoding detection uconv - ICU encoding converter (alternative) localectl - Control system locale settings

Tips & Troubleshooting

UTF-8 is the standard encoding for modern Linux
Always backup files before encoding conversion
Use -c flag with iconv to handle invalid chars
Check $LANG and $LC_* environment variables
Some terminals may not display all characters
BOM can cause issues with UTF-8 files in Linux
Test converted files thoroughly before deployment

Common Encoding Issues

Mojibake: Garbled text from wrong encoding
Question marks: Characters not in target set
Missing characters: Incomplete conversion
BOM problems: Windows UTF-8 files in Linux