2026-04-14

Prompt Engineering for Translation: Getting LLMs to Translate Like Professionals

System prompts, formality controls, glossary injection, few-shot examples, and markup preservation techniques for high-quality LLM translation.

Using ChatGPT or Claude for translation by typing "translate this to Spanish" works surprisingly well for casual use. But it leaves a lot of quality on the table. With the right prompting, you can get output that approaches professional human translation — and for some content types, matches it.

Here's what actually moves the needle.

The baseline system prompt

A bare "translate to German" prompt produces generic, middle-of-the-road translations. A good system prompt sets the ground rules:

You are a professional translator specializing in software documentation.
Translate the following text from English to German.
Rules:
Use formal register (Sie, not du)
Preserve all Markdown formatting exactly
Do not translate text inside code blocks or inline code
Do not translate URLs or file paths
Keep brand names and product names in English
Use German technical terminology where established terms exist
  (e.g., "Datenbank" not "Database", but keep "API" as "API")
Output only the translated text with no explanations or notes

Every sentence in that prompt prevents a specific class of error I've seen in production. Let's break them down.

Formality control

Many languages have formal and informal registers. German has du/Sie, French has tu/vous, Japanese has multiple levels of keigo (honorific speech), Korean has seven speech levels.

Without explicit instruction, LLMs default to a mix — sometimes formal, sometimes informal, sometimes switching mid-document. Specify the register:

Use formal register throughout:
German: Sie-Form
French: vouvoiement
Japanese: です/ます form (desu/masu)
Korean: 합쇼체 (formal polite)

For user-facing product copy, formal is usually safer. For developer docs, informal often reads better. For marketing, it depends on the brand voice. The point is to decide explicitly rather than letting the model guess.

Glossary injection

Technical products have terminology that must be translated consistently. "Workspace" should always be "Arbeitsbereich" in German, not sometimes "Arbeitsplatz" or "Arbeitsumgebung."

Inject a glossary in the system prompt:

Use this glossary for consistent terminology:
workspace → Arbeitsbereich
deployment → Bereitstellung
pipeline → Pipeline (keep in English)
repository → Repository (keep in English)
pull request → Pull Request (keep in English)
branch → Branch (keep in English)
dashboard → Dashboard (keep in English)
endpoint → Endpunkt

Keep the glossary under 50 terms — LLMs follow shorter, focused glossaries more reliably than exhaustive ones. Prioritize terms where inconsistency would confuse users or where the translation is non-obvious.

For larger glossaries, a two-pass approach works: first identify which glossary terms appear in the source text, then include only those in the prompt.

relevant_terms = {k: v for k, v in glossary.items() if k.lower() in source_text.lower()}

Few-shot examples

If you have existing human translations that match your desired style, include them as examples:

Here are examples of the translation style to follow: English: "Click Save to apply your changes." German: "Klicken Sie auf Speichern, um Ihre Änderungen zu übernehmen." English: "Your deployment is in progress. This usually takes 2-3 minutes." German: "Ihre Bereitstellung wird durchgeführt. Dies dauert in der Regel 2–3 Minuten."

Now translate the following:

Three to five examples are enough. More than that and you're spending tokens without proportional quality gains. Choose examples that demonstrate:

The correct formality level
How to handle UI element names (bold, quotes, etc.)
Technical terminology preferences
Sentence length and structure preferences

Markup preservation

This is the most common failure mode when using LLMs for translation in production. The model "helpfully" reformats your Markdown, changes HTML tags, or breaks JSX syntax.

Explicit instructions help but aren't sufficient alone:

CRITICAL: Preserve all markup exactly. This includes:
Markdown: *, , ,

, [], (), #, -, >

HTML tags: , , , etc.


Template variables: {userName}, {{count}}, %s, %d
JSX: 
Do not add, remove, or modify any markup elements.

For reliability, use a two-step process:
Pre-process: Replace markup with numbered placeholders
Translate: Send the cleaned text with placeholders
Post-process: Restore markup from placeholderspython
import re
def protect_markup(text):
    placeholders = {}
    counter = [0]
def replace(match):
        counter[0] += 1
        key = f"__PH{counter[0]}__"
        placeholders[key] = match.group(0)
        return key
# Protect code blocks
    protected = re.sub(r'``[\s\S]*?`', replace, text)
    # Protect inline code
    protected = re.sub(r'[^]+', replace, protected)
    # Protect links
    protected = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', lambda m: f'{m.group(1)}})', protected)
    # Protect template variables
    protected = re.sub(r'\{[^}]+\}', replace, protected)
return protected, placeholders
def restore_markup(translated, placeholders):
    for key, value in placeholders.items():
        translated = translated.replace(key, value)
    return translated

This is more robust than relying on the LLM to preserve markup, though it does remove context that might help translation (like knowing that {userName} is a person's name).
Temperature and sampling
For translation, use temperature 0 or very close to it. Translation has a relatively narrow range of "correct" outputs, and higher temperatures introduce unnecessary variation:
Temperature 0: Deterministic, consistent output
Temperature 0.1-0.3: Slight variation, sometimes produces more natural phrasing
Temperature 0.7+: Too much variation, inconsistent terminology

If you're translating the same content repeatedly (e.g., regenerating after a prompt change), temperature 0 ensures identical output for identical input, which matters for caching and diffing.
Chunking strategy
LLMs have context windows, and translation uses roughly 2x the input length in tokens (input + output). For long documents, you need to chunk — but where you split matters enormously.
Bad: Split every N tokens regardless of content boundaries.
Good: Split on paragraph boundaries, and include the previous paragraph as context:
python
def chunk_for_translation(paragraphs, max_tokens=2000):
    chunks = []
    current_chunk = []
    current_tokens = 0
for i, para in enumerate(paragraphs):
        para_tokens = count_tokens(para)
        if current_tokens + para_tokens > max_tokens and current_chunk:
            # Include last paragraph of previous chunk as context
            context = current_chunk[-1] if current_chunk else ""
            chunks.append({
                "context": context,
                "translate": current_chunk
            })
            current_chunk = [para]
            current_tokens = para_tokens
        else:
            current_chunk.append(para)
            current_tokens += para_tokens
if current_chunk:
        chunks.append({"context": "", "translate": current_chunk})
return chunks

Post-translation validation

Even with good prompts, validate the output:python
def validate_translation(source, translated):
    errors = []
# Check placeholder preservation
    source_placeholders = re.findall(r'\{[^}]+\}', source)
    translated_placeholders = re.findall(r'\{[^}]+\}', translated)
    if set(source_placeholders) != set(translated_placeholders):
        errors.append("Placeholder mismatch")
# Check markup balance
    for char in ['**', '', '[', ']']:
        if source.count(char) != translated.count(char):
            errors.append(f"Unbalanced markup: {char}")
# Check for untranslated content (heuristic)
    if len(translated) < len(source) * 0.5:
        errors.append("Translation suspiciously short")
# Check for added explanations
    if "Note:" in translated and "Note:" not in source:
        errors.append("Model may have added explanatory text")
return errors
```
If validation fails, retry with a more explicit prompt. If it fails repeatedly, flag for human review.
Putting it together
Services like auto18n handle most of this plumbing internally — glossary management, markup protection, formality settings, quality validation. But if you're building your own translation pipeline on raw LLM APIs, these are the techniques that separate "passable" from "professional-grade" output.
The single highest-ROI technique is the glossary. Consistent terminology is the number one thing professional translators enforce, and it's the easiest thing to automate with prompt engineering. Start there, add formality controls, then layer on few-shot examples if you need tighter style control.