- Published at
Translating Text with M2M100: A Practical Guide
This article demonstrates how to use the M2M100 model from Hugging Face to translate text between languages. It's a great example for intermediate learners.
- Authors
-
-
- Name
- James Lau
- Indie App Developer at Self-employed
-
Table of Contents
This blog post explores how to translate text using the M2M100 model from Hugging Face Transformers. M2M100 is a powerful multilingual translation model capable of translating between many different languages.
What is M2M100?
M2M100 (Many-to-Many multilingual translation) is a transformer-based neural machine translation model developed by Facebook AI. It’s trained on a massive dataset of parallel text from hundreds of languages, making it capable of translating between many language pairs.
Prerequisites
Before we start, make sure you have the following installed:
pip install langid
Code Explanation
The provided Python code demonstrates a simple translation process. Let’s break it down step by step:
- Import Libraries: We import necessary libraries:
transformersfor the M2M100 model,langidto detect the source language, andtiming_decoratorto measure the execution time. - Model Initialization: We initialize the
M2M100ForConditionalGenerationmodel using thefacebook/m2m100_418Mpre-trained weights. This line downloads the model if it’s not already present on your system. - Language Detection: The
detect_languagefunction uses thelangidlibrary to identify the language of the input text. - Translation Function: The
translatefunction takes the text and the target language as input. It first detects the source language, then encodes the text using the tokenizer with the source language token. It generates the translation using the model, and finally decodes the generated tokens to produce the translated text. - Main Function: The
mainfunction defines a sample text in Chinese (“不要插手巫師的事,因為他們既狡猾且容易發怒。”) and calls thetranslatefunction to translate it into English. The translated text is then printed to the console.
Real-World Applications
This technique can be applied to various real-world scenarios, such as:
- Customer Support: Automatically translate customer inquiries from different languages to provide faster and more efficient support.
- Global Content Creation: Translate articles, marketing materials, and other content into multiple languages to reach a wider audience.
- International Communication: Facilitate communication between people who speak different languages.
Conclusion
The M2M100 model provides a powerful and convenient way to translate text between languages. It’s an excellent tool for anyone working with multilingual data or needing to communicate with people from different cultural backgrounds. By understanding the underlying principles and techniques, you can leverage this technology to solve a wide range of real-world problems.
Source Code
Here is the complete Python code:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import langid
from decorator_helper import timing_decorator
model_name = "facebook/m2m100_418M"
model = M2M100ForConditionalGeneration.from_pretrained(model_name)
model.eval()
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
def detect_language(text):
lang, _ = langid.classify(text)
return lang
def translate(text, target_lang="en"):
# Detect the source language
source_lang = detect_language(text)
print(f"Detected source language: {source_lang}")
# Prepare the text with the language tokens
tokenizer.src_lang = source_lang
encoded_text = tokenizer(text, return_tensors="pt")
# Generate translation
# translate to english
generated_tokens = model.generate("**encoded_text**", forced_bos_token_id=tokenizer.get_lang_id(target_lang))
translated_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
return translated_text
@timing_decorator
def main():
text = "不要插手巫師的事,因為他們既狡猾且容易發怒。"
translated_text = translate(text=text)
print(translated_text)
if __name__ == "__main__":
main()