Tokenization Explained: Why Chinese Costs More?

Large Language ModelsTokenizationAPI CostsChinese ProcessingBPE AlgorithmGPTClaudeTechnical PrinciplesCost OptimizationMultilingual
Laofu Code
Laofu Code
-- views

A few days ago, my friend complained to me: "Look at this - the same text costs 30 cents in Chinese when calling GPT API, but only 10 cents when translated to English. Isn't this obvious discrimination against Chinese users?"

I chuckled while sipping my coffee, "Buddy, I used to think the same way. But after researching for a while, I found it's not discrimination at all - there are deep technical principles behind it. Today let me explain tokenization to you, and you'll understand why Chinese is inherently more 'expensive' than English."

1. What is a Token?

1.1 Not the Token You Think

Many people hear "token" and think of blockchain or cryptocurrency - I used to think the same. But in large language models, tokens are completely different.

In simple terms, a token is the smallest unit that large models "consume" text with. Just like how we break sentences into words when reading, large models need to split your input text into chunks called tokens to understand it.

1.2 An Intuitive Example

Let's see how the same sentence gets tokenized in different languages:

English: "Hello, how are you today?"

  • Token count: 6
  • Roughly split as: ["Hello", ",", " how", " are", " you", " today", "?"]

Chinese: "你好,今天过得怎么样?"

  • Token count: 12
  • Basically one character per token: ["你", "好", ",", "今", "天", "过", "得", "怎", "么", "样", "?"]

See the problem? The same meaning requires 2x more tokens in Chinese! This is why the bill is more expensive.

2. Technical Principles of Tokenization

2.1 BPE Algorithm: The Core of Tokenization

Most mainstream large models use the BPE (Byte Pair Encoding) algorithm for tokenization. The core idea is:

  1. Start with characters: Initially, each character is a token
  2. Count frequencies: Find the most frequent character pairs in training data
  3. Merge high-frequency pairs: Combine frequent character pairs into new tokens
  4. Iterate repeatedly: Keep repeating this process until reaching the preset vocabulary size

2.2 Why is English More "Cost-Effective"?

2.2.1 Character Set Size Differences

English Character Set:

  • 26 letters (52 with uppercase/lowercase)
  • 10 digits
  • Common punctuation marks
  • Total: less than 100 basic characters

Chinese Character Set:

  • 3000+ common Chinese characters
  • Tens of thousands of rare characters
  • Punctuation marks
  • Total: tens of thousands of basic characters

The larger the character set, the harder it is for BPE algorithm to find high-frequency character combinations, leading to lower tokenization efficiency.

2.2.2 Language Characteristic Differences

English Advantages:

Clear word boundaries: "machine learning" → ["machine", " learning"]
High repetition of common words: "the", "and", "is" etc.
Root and affix patterns: "learn", "learning", "learned" share common parts

Chinese Challenges:

No natural word segmentation: "机器学习" → ["机", "器", "学", "习"]
Flexible combination methods: "学习机器" vs "机器学习"
High semantic density: one Chinese character often contains complete concepts

2.3 Actual Tokenization Comparison

Let me demonstrate using GPT-4's tokenizer:

English text: "Artificial intelligence is transforming our world"

  • Token count: 8
  • Average per token: 1.125 words

Chinese text: "人工智能正在改变我们的世界"

  • Token count: 12
  • Average per token: 0.75 Chinese characters

Same meaning, Chinese needs 50% more tokens!

3. Different Models' Tokenization Strategies

3.1 GPT Series: English-First

OpenAI's GPT series models are primarily trained on English data, with tokenization strategies clearly favoring English:

# GPT-4 Tokenizer Example
English: "machine learning" → 2 tokens
Chinese: "机器学习" → 4 tokens

3.2 Claude: Relatively Balanced

Anthropic's Claude does better with multilingual support:

# Claude Tokenizer Example
English: "machine learning" → 2 tokens
Chinese: "机器学习" → 3 tokens

3.3 Chinese Models: Chinese-Friendly

Chinese models like ERNIE and Qwen are specifically optimized for Chinese:

# ERNIE Tokenizer Example
English: "machine learning" → 3 tokens
Chinese: "机器学习" → 2 tokens

4. Specific Cost Difference Calculations

4.1 GPT-4 API Pricing Analysis

Using GPT-4 as an example (2024 prices):

  • Input: $0.03/1K tokens
  • Output: $0.06/1K tokens

Cost comparison for same content:

Assuming a 1000-word article:

English version:

  • About 1200 tokens
  • Cost: $0.036 (about ¥0.26)

Chinese version:

  • About 2000 tokens
  • Cost: $0.06 (about ¥0.43)

Chinese version costs 65% more!

4.2 Real Project Data: Lessons I Learned

Last year I built a multilingual customer service bot for a company. I didn't consider this token difference initially, and was shocked when the monthly bill came:

Language Avg Tokens/Conversation Monthly Cost (10K conversations)
English 150 $45
Chinese 240 $72
Japanese 280 $84
Arabic 320 $96

See? The further from English, the more expensive. The boss asked why Chinese customer service cost so much more than English, and I was completely confused at the time.

5. How to Reduce Chinese Token Costs?

5.1 Choose the Right Model

Chinese models are great:

  • ERNIE: Specifically optimized for Chinese, high token efficiency
  • Qwen: Alibaba's product, reasonably priced
  • GLM: From Tsinghua, really good Chinese understanding

For international models:

  • Claude is most Chinese-friendly, GPT-4 second, GPT-3.5 worst (ranked by Chinese friendliness)

5.2 Optimize Input Text

Reduce redundancy:

# Before optimization
"Please help me analyze the specific solution to this problem"

# After optimization
"Please analyze this problem's solution"

Use concise expressions:

# Before optimization
"In this situation, we need to consider various different factors"

# After optimization
"Consider various factors"

5.3 Mixed Language Strategy

For technical documentation, consider Chinese-English mixing:

# Pure Chinese (12 tokens)
"使用机器学习算法进行数据分析"

# Chinese-English mix (8 tokens)
"使用machine learning进行数据分析"

5.4 Batch Processing

Combine multiple small requests into one large request:

# Low efficiency (system prompt overhead for each call)
for question in questions:
    response = call_api(system_prompt + question)

# High efficiency (process multiple questions at once)
batch_questions = "\n".join(questions)
response = call_api(system_prompt + batch_questions)

6.1 Multilingual Tokenization Optimization

Major companies are optimizing multilingual support:

  • Google's SentencePiece: Better multilingual support
  • Meta's RoBERTa: Training specialized tokenizers for different languages
  • OpenAI's new versions: GPT-5 reportedly will significantly improve Chinese tokenization

6.2 Dynamic Tokenization

Future models might automatically adjust tokenization strategies based on language:

# Possible future API call
response = call_api(
    text="你好世界",
    language="zh",  # Automatically optimize Chinese tokenization
    optimize_tokens=True
)

6.3 Cost Transparency

More platforms are providing token estimation tools:

# Token cost estimation
estimate = estimate_cost(
    text="Your text content",
    model="gpt-4",
    language="zh"
)
print(f"Estimated cost: ${estimate.cost}")
print(f"Token count: {estimate.tokens}")

7. Practical Tool Recommendations

7.1 Token Calculators

Online tools:

Python libraries:

import tiktoken

# GPT-4 tokenizer
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Your text")
print(f"Token count: {len(tokens)}")

7.2 Cost Monitoring

class TokenCostTracker:
    def __init__(self):
        self.total_tokens = 0
        self.total_cost = 0
    
    def track_request(self, tokens, model="gpt-4"):
        cost = self.calculate_cost(tokens, model)
        self.total_tokens += tokens
        self.total_cost += cost
        return cost
    
    def calculate_cost(self, tokens, model):
        rates = {
            "gpt-4": 0.03 / 1000,  # $0.03 per 1K tokens
            "gpt-3.5": 0.002 / 1000,
            "claude": 0.008 / 1000
        }
        return tokens * rates.get(model, 0.03 / 1000)

7.3 Text Optimization Assistant

def optimize_chinese_text(text):
    """Optimize Chinese text to reduce token consumption"""
    # Remove excessive punctuation
    text = re.sub(r'[,。!?;:]{2,}', lambda m: m.group()[0], text)
    
    # Replace redundant expressions
    replacements = {
        '在这种情况下': '此时',
        '根据以上分析': '综上',
        '需要注意的是': '注意',
    }
    
    for old, new in replacements.items():
        text = text.replace(old, new)
    
    return text

8. Summary

Tokenization mechanisms make Chinese more "expensive" than English - this isn't discrimination, but determined by technical characteristics:

  1. Root cause: Large Chinese character set, complex language characteristics
  2. Cost difference: Chinese typically costs 30-70% more than English
  3. Optimization strategies: Choose appropriate models, optimize text, batch processing
  4. Future trends: Multilingual optimization, dynamic tokenization, cost transparency

After all this explanation, there's one simple principle: don't blindly chase the newest, most expensive models. For Chinese applications, domestic models are often more cost-effective with comparable understanding capabilities.

My current strategy is simple: prioritize Chinese models for Chinese business, only consider GPT for English or multilingual needs. This saves quite a bit of money each month.

What to do with the saved money? Buy coffee, upgrade equipment, or treat the team to dinner - isn't that better than paying OpenAI? 😄


Have you encountered cost issues when using large language model APIs? Feel free to share your experiences and optimization tips in the comments!

Follow WeChat Official Account

WeChat Official Account QR Code

Scan to get:

  • • Latest tech articles
  • • Exclusive dev insights
  • • Useful tools & resources

💬 评论讨论

欢迎对《Tokenization Explained: Why Chinese Costs More?》发表评论,分享你的想法和经验