Machine Translation

5 min readDec 10, 2024

— ML System Design

Project Scope

Business objective:

The goal is to build a multilingual system capable of translating input into any specified language. From a business perspective, the objective is to enhance user engagement and trust by providing seamless and reliable translations, encouraging users to adopt and continue using the product.

This multilingual system can be integrated with any social media applications which require language translation.

Clarifying questions to ask:

Upstream/Downstream Tasks: Clarify if there are domain-specific requirements or dependencies to consider while designing the system.
Real-Time Constraints: Determine if the system needs to work in real-time, requiring low latency and potentially smaller model sizes.
Languages Scope: Will the system support hundreds of languages, including high-resource and low-resource ones?
Dataset Availability: Confirm the existence of labeled datasets, including translations for high and low-resource languages, derived from language experts specifically for low resource languages.
Baseline Systems: Check for any existing systems or business/offline metrics to serve as a baseline for improvements.

Functional Requirements:

The ML system should enhance user engagement, be efficient, and translate even low-resource languages accurately.

Non-Functional Requirements:

Scalability: Handle billions of users with low latency, and real-time performance.
Availability: Ensure high system uptime.
Computational Resources: Optimize resource use (GPUs, CPUs) and balance model size with data size within the computation budget.
Tooling: System monitoring, Debugging tools, and Alerts for production issues, Research experiment tracking and model registries via tools like Weights & Biases or MLflow.

Offline & Online Metrics

Offline Metrics: Rouge, Bleu Score, Cfq++ Score, Bert, or LASER Score using multilingual models and human evaluations.

Online Metrics: Measure real-world system performance, balancing business goals and user experience.

User interactions (implicit): Retention rate and user-system interaction statistics.
User feedback (explicit)
Task success rate
System Performance Metrics: Latency, Throughput, Cost per Moderated Content.

System Workflow Overview

Data Pipeline

1. Data Collection

Sources:

Parallel Data: Sentence pairs from public datasets, web scraping, data mining (e.g., LASER for bitext mining), and crowdsourcing for low-resource languages.
Monolingual Data: Used for back-translation when parallel data is scarce.

Challenges:

Low-resource languages require data mining, back-translation, and crowdsourcing.
Domain-specific needs may necessitate curated data.

2. Data Annotation and Labels

Ensure sentence pairs are consistent and accurate, especially for low-resource languages.
Human labellers validate alignment and translation quality.

3. Features Involved

User Input: Text to translate.
Languages: Source and target languages, including low-resource ones.
Context: Domain-specific information or embeddings.
Quality Indicators: Sentence length, alignment, and language detection.

4. Data Preprocessing

Tokenization: Use subword approaches like Byte Pair Encoding (BPE) or SentencePiece for robust handling of multiple languages.
Cleaning: Remove noisy sentences, non-text elements, or mismatched pairs.
Normalization: Standardize casing, punctuation, and special characters.
Perform language detection and ensure proper sentence alignment: Ensure that parallel sentences are properly aligned as intended use.
Use language-specific preprocessing for complex languages (e.g., Chinese segmentation or Arabic diacritics).

Augmentation:

Back-translation and transfer learning from related high-resource languages.
Cross-lingual embeddings for low-resource languages.

5. Data Splitting

Typical split: 70% training, 15% validation, 15% testing.
Use temporal splits for time-sensitive data (e.g., news).
Ensure domain-specific splitting to generalize across multiple domains.
Prevent data leakage by strictly separating datasets.

6. Synthetic Data Generation

Use LLMs to generate new sentence pairs while maintaining context.
Validate generated data consistency with another LLM.

7. Distribution Analysis

Analyze data distribution for consistency and diversity.
Check for bias and under-representation of specific groups.

8. Data Privacy and Security

Anonymization: Encrypt or anonymize PII to safeguard privacy.
Compliance: Adhere to regulations like GDPR and CCPA.

9. Pipeline Automation (ETL)

Automate the Extract, Transform, Load (ETL) process for efficient handling of new data.
Regularly update datasets to reflect current trends and maintain system relevance.

Model Pipeline

1. Baselines and Starting Point:

Use open-source baselines or pre-trained models like LASER, NLLB, LASER 2/3 for reference.
Fine-tune pre-trained models to align with specific objectives and datasets.
Set up initial benchmarks considering computation and latency constraints.

2. Translation Approaches:

Single-to-Single Approach (Direct Models):

Pros: High-quality translations, domain fine-tuning possible.
Cons: Requires separate models for each language pair; scaling and cross-lingual transfer are challenging.

Pivot-Based Approach:

Pros: Reduces dependency on parallel data for low-resource languages.
Cons: Quality degradation due to error propagation, increased latency.

Single Multilingual Model (Many-to-Many):

Pros: Efficient scaling, cross-lingual knowledge transfer, and zero/few-shot capabilities.
Cons: Requires strategies for data imbalance and model complexity. Risk of interference and catastrophic forgetting.

3. Model Architectures:

Sequence-to-Sequence Models (Seq2Seq): Early models with limitations in handling long-range dependencies.
RNNs/LSTMs: Improved sequence modeling but struggle with long-range dependencies and parallelization.
Transformers: Self-attention mechanisms for parallel processing, better handling of long dependencies, and superior performance.

4. Training Strategies:

Self-Supervised Learning (SSL): Train on monolingual corpora to improve low-resource language performance.
Back-Translation: Generate synthetic parallel data for low-resource languages.
Knowledge Distillation: Use large teacher models to train smaller, efficient student models.
Conditional Compute (MoE): Sparsely activate sub-models to reduce interference and scale capacity efficiently.

5. Loss Function and Optimizer:

Loss: Cross-Entropy for sequence-to-sequence tasks.
Optimizer: AdamW for better convergence and regularization in large-scale transformers.

6. Challenges and Mitigations:

Low-Resource Languages: Use back-translation, cross-lingual embeddings, and transfer learning.
Interference: Apply conditional compute or curriculum learning.
Bias and Safety: Incorporate diverse data, content moderation, and factuality checks.

7. Post-Processing:

Address toxicity, hallucination, bias, and relevance to the input context.

8. Iterative Improvements:

Perform regular error analysis on false positives/negatives.
Revisit the data pipeline for improved datasets or enhance the model pipeline.

9. Compression and Deployment:

Apply quantization, model pruning, and knowledge distillation for efficiency.
Distributed Training: Use DDP or ZeRO for large-scale training.

10. Advanced Considerations:

Bias Mitigation and Robustness in Machine Translation Models:

Diverse Data: Use datasets that represent various demographics, cultures, and languages to reduce bias.
Post-Training Techniques: Apply bias detection algorithms and adversarial training to minimize biased outputs.
Content Moderation: Use a parallel classifier to detect harmful or biased content in translations.
Factuality Checking: Integrate external knowledge sources like Wikipedia or knowledge graphs to verify the accuracy of translations.
Tradeoff: Ensuring safety and reducing bias can compromise the model’s fluency and diversity in responses, requiring a balance between fairness and naturalness.

This pipeline combines state-of-the-art techniques with scalability, addressing challenges like low-resource languages and computational constraints while ensuring fairness, robustness, and efficiency.