Tokenization Explained: Why Chinese Costs More?

A few days ago, my friend complained to me: "Look at this - the same text costs 30 cents in Chinese when calling GPT API, but only 10 cents when translated to English. Isn't this obvious discrimination against Chinese users?"
I chuckled while sipping my coffee, "Buddy, I used to think the same way. But after researching for a while, I found it's not discrimination at all - there are deep technical principles behind it. Today let me explain tokenization to you, and you'll understand why Chinese is inherently more 'expensive' than English."
1. What is a Token?
1.1 Not the Token You Think
Many people hear "token" and think of blockchain or cryptocurrency - I used to think the same. But in large language models, tokens are completely different.
In simple terms, a token is the smallest unit that large models "consume" text with. Just like how we break sentences into words when reading, large models need to split your input text into chunks called tokens to understand it.
1.2 An Intuitive Example
Let's see how the same sentence gets tokenized in different languages:
English: "Hello, how are you today?"
- Token count: 6
- Roughly split as: ["Hello", ",", " how", " are", " you", " today", "?"]
Chinese: "你好,今天过得怎么样?"
- Token count: 12
- Basically one character per token: ["你", "好", ",", "今", "天", "过", "得", "怎", "么", "样", "?"]
See the problem? The same meaning requires 2x more tokens in Chinese! This is why the bill is more expensive.
2. Technical Principles of Tokenization
2.1 BPE Algorithm: The Core of Tokenization
Most mainstream large models use the BPE (Byte Pair Encoding) algorithm for tokenization. The core idea is:
- Start with characters: Initially, each character is a token
- Count frequencies: Find the most frequent character pairs in training data
- Merge high-frequency pairs: Combine frequent character pairs into new tokens
- Iterate repeatedly: Keep repeating this process until reaching the preset vocabulary size
2.2 Why is English More "Cost-Effective"?
2.2.1 Character Set Size Differences
English Character Set:
- 26 letters (52 with uppercase/lowercase)
- 10 digits
- Common punctuation marks
- Total: less than 100 basic characters
Chinese Character Set:
- 3000+ common Chinese characters
- Tens of thousands of rare characters
- Punctuation marks
- Total: tens of thousands of basic characters
The larger the character set, the harder it is for BPE algorithm to find high-frequency character combinations, leading to lower tokenization efficiency.
2.2.2 Language Characteristic Differences
English Advantages:
Clear word boundaries: "machine learning" → ["machine", " learning"]
High repetition of common words: "the", "and", "is" etc.
Root and affix patterns: "learn", "learning", "learned" share common parts
Chinese Challenges:
No natural word segmentation: "机器学习" → ["机", "器", "学", "习"]
Flexible combination methods: "学习机器" vs "机器学习"
High semantic density: one Chinese character often contains complete concepts
2.3 Actual Tokenization Comparison
Let me demonstrate using GPT-4's tokenizer:
English text: "Artificial intelligence is transforming our world"
- Token count: 8
- Average per token: 1.125 words
Chinese text: "人工智能正在改变我们的世界"
- Token count: 12
- Average per token: 0.75 Chinese characters
Same meaning, Chinese needs 50% more tokens!
3. Different Models' Tokenization Strategies
3.1 GPT Series: English-First
OpenAI's GPT series models are primarily trained on English data, with tokenization strategies clearly favoring English:
# GPT-4 Tokenizer Example
English: "machine learning" → 2 tokens
Chinese: "机器学习" → 4 tokens
3.2 Claude: Relatively Balanced
Anthropic's Claude does better with multilingual support:
# Claude Tokenizer Example
English: "machine learning" → 2 tokens
Chinese: "机器学习" → 3 tokens
3.3 Chinese Models: Chinese-Friendly
Chinese models like ERNIE and Qwen are specifically optimized for Chinese:
# ERNIE Tokenizer Example
English: "machine learning" → 3 tokens
Chinese: "机器学习" → 2 tokens
4. Specific Cost Difference Calculations
4.1 GPT-4 API Pricing Analysis
Using GPT-4 as an example (2024 prices):
- Input: $0.03/1K tokens
- Output: $0.06/1K tokens
Cost comparison for same content:
Assuming a 1000-word article:
English version:
- About 1200 tokens
- Cost: $0.036 (about ¥0.26)
Chinese version:
- About 2000 tokens
- Cost: $0.06 (about ¥0.43)
Chinese version costs 65% more!
4.2 Real Project Data: Lessons I Learned
Last year I built a multilingual customer service bot for a company. I didn't consider this token difference initially, and was shocked when the monthly bill came:
Language | Avg Tokens/Conversation | Monthly Cost (10K conversations) |
---|---|---|
English | 150 | $45 |
Chinese | 240 | $72 |
Japanese | 280 | $84 |
Arabic | 320 | $96 |
See? The further from English, the more expensive. The boss asked why Chinese customer service cost so much more than English, and I was completely confused at the time.
5. How to Reduce Chinese Token Costs?
5.1 Choose the Right Model
Chinese models are great:
- ERNIE: Specifically optimized for Chinese, high token efficiency
- Qwen: Alibaba's product, reasonably priced
- GLM: From Tsinghua, really good Chinese understanding
For international models:
- Claude is most Chinese-friendly, GPT-4 second, GPT-3.5 worst (ranked by Chinese friendliness)
5.2 Optimize Input Text
Reduce redundancy:
# Before optimization
"Please help me analyze the specific solution to this problem"
# After optimization
"Please analyze this problem's solution"
Use concise expressions:
# Before optimization
"In this situation, we need to consider various different factors"
# After optimization
"Consider various factors"
5.3 Mixed Language Strategy
For technical documentation, consider Chinese-English mixing:
# Pure Chinese (12 tokens)
"使用机器学习算法进行数据分析"
# Chinese-English mix (8 tokens)
"使用machine learning进行数据分析"
5.4 Batch Processing
Combine multiple small requests into one large request:
# Low efficiency (system prompt overhead for each call)
for question in questions:
response = call_api(system_prompt + question)
# High efficiency (process multiple questions at once)
batch_questions = "\n".join(questions)
response = call_api(system_prompt + batch_questions)
6. Future Development Trends
6.1 Multilingual Tokenization Optimization
Major companies are optimizing multilingual support:
- Google's SentencePiece: Better multilingual support
- Meta's RoBERTa: Training specialized tokenizers for different languages
- OpenAI's new versions: GPT-5 reportedly will significantly improve Chinese tokenization
6.2 Dynamic Tokenization
Future models might automatically adjust tokenization strategies based on language:
# Possible future API call
response = call_api(
text="你好世界",
language="zh", # Automatically optimize Chinese tokenization
optimize_tokens=True
)
6.3 Cost Transparency
More platforms are providing token estimation tools:
# Token cost estimation
estimate = estimate_cost(
text="Your text content",
model="gpt-4",
language="zh"
)
print(f"Estimated cost: ${estimate.cost}")
print(f"Token count: {estimate.tokens}")
7. Practical Tool Recommendations
7.1 Token Calculators
Online tools:
- OpenAI Tokenizer: https://platform.openai.com/tokenizer
- Hugging Face Tokenizers: Support multiple models
Python libraries:
import tiktoken
# GPT-4 tokenizer
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Your text")
print(f"Token count: {len(tokens)}")
7.2 Cost Monitoring
class TokenCostTracker:
def __init__(self):
self.total_tokens = 0
self.total_cost = 0
def track_request(self, tokens, model="gpt-4"):
cost = self.calculate_cost(tokens, model)
self.total_tokens += tokens
self.total_cost += cost
return cost
def calculate_cost(self, tokens, model):
rates = {
"gpt-4": 0.03 / 1000, # $0.03 per 1K tokens
"gpt-3.5": 0.002 / 1000,
"claude": 0.008 / 1000
}
return tokens * rates.get(model, 0.03 / 1000)
7.3 Text Optimization Assistant
def optimize_chinese_text(text):
"""Optimize Chinese text to reduce token consumption"""
# Remove excessive punctuation
text = re.sub(r'[,。!?;:]{2,}', lambda m: m.group()[0], text)
# Replace redundant expressions
replacements = {
'在这种情况下': '此时',
'根据以上分析': '综上',
'需要注意的是': '注意',
}
for old, new in replacements.items():
text = text.replace(old, new)
return text
8. Summary
Tokenization mechanisms make Chinese more "expensive" than English - this isn't discrimination, but determined by technical characteristics:
- Root cause: Large Chinese character set, complex language characteristics
- Cost difference: Chinese typically costs 30-70% more than English
- Optimization strategies: Choose appropriate models, optimize text, batch processing
- Future trends: Multilingual optimization, dynamic tokenization, cost transparency
After all this explanation, there's one simple principle: don't blindly chase the newest, most expensive models. For Chinese applications, domestic models are often more cost-effective with comparable understanding capabilities.
My current strategy is simple: prioritize Chinese models for Chinese business, only consider GPT for English or multilingual needs. This saves quite a bit of money each month.
What to do with the saved money? Buy coffee, upgrade equipment, or treat the team to dinner - isn't that better than paying OpenAI? 😄
Have you encountered cost issues when using large language model APIs? Feel free to share your experiences and optimization tips in the comments!
Follow WeChat Official Account

Scan to get:
- • Latest tech articles
- • Exclusive dev insights
- • Useful tools & resources
💬 评论讨论
欢迎对《Tokenization Explained: Why Chinese Costs More?》发表评论,分享你的想法和经验