Published at

Translating Text with M2M100: A Practical Guide

Translating Text with M2M100: A Practical Guide

This article demonstrates how to use the M2M100 model from Hugging Face to translate text between languages. It's a great example for intermediate learners.

Authors
  • avatar
    Name
    James Lau
    Twitter
  • Indie App Developer at Self-employed
Sharing is caring!
Table of Contents

This blog post explores how to translate text using the M2M100 model from Hugging Face Transformers. M2M100 is a powerful multilingual translation model capable of translating between many different languages.

What is M2M100?

M2M100 (Many-to-Many multilingual translation) is a transformer-based neural machine translation model developed by Facebook AI. It’s trained on a massive dataset of parallel text from hundreds of languages, making it capable of translating between many language pairs.

Prerequisites

Before we start, make sure you have the following installed:

pip install langid

Code Explanation

The provided Python code demonstrates a simple translation process. Let’s break it down step by step:

  1. Import Libraries: We import necessary libraries: transformers for the M2M100 model, langid to detect the source language, and timing_decorator to measure the execution time.
  2. Model Initialization: We initialize the M2M100ForConditionalGeneration model using the facebook/m2m100_418M pre-trained weights. This line downloads the model if it’s not already present on your system.
  3. Language Detection: The detect_language function uses the langid library to identify the language of the input text.
  4. Translation Function: The translate function takes the text and the target language as input. It first detects the source language, then encodes the text using the tokenizer with the source language token. It generates the translation using the model, and finally decodes the generated tokens to produce the translated text.
  5. Main Function: The main function defines a sample text in Chinese (“不要插手巫師的事,因為他們既狡猾且容易發怒。”) and calls the translate function to translate it into English. The translated text is then printed to the console.

Real-World Applications

This technique can be applied to various real-world scenarios, such as:

  • Customer Support: Automatically translate customer inquiries from different languages to provide faster and more efficient support.
  • Global Content Creation: Translate articles, marketing materials, and other content into multiple languages to reach a wider audience.
  • International Communication: Facilitate communication between people who speak different languages.

Conclusion

The M2M100 model provides a powerful and convenient way to translate text between languages. It’s an excellent tool for anyone working with multilingual data or needing to communicate with people from different cultural backgrounds. By understanding the underlying principles and techniques, you can leverage this technology to solve a wide range of real-world problems.

Source Code

Here is the complete Python code:

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import langid
from decorator_helper import timing_decorator

model_name = "facebook/m2m100_418M"
model = M2M100ForConditionalGeneration.from_pretrained(model_name)
model.eval()

tokenizer = M2M100Tokenizer.from_pretrained(model_name)

def detect_language(text):
    lang, _ = langid.classify(text)
    return lang

def translate(text, target_lang="en"):
    # Detect the source language
    source_lang = detect_language(text)

    print(f"Detected source language: {source_lang}")

    # Prepare the text with the language tokens
    tokenizer.src_lang = source_lang
    encoded_text = tokenizer(text, return_tensors="pt")

    # Generate translation
    # translate to english
    generated_tokens = model.generate("**encoded_text**", forced_bos_token_id=tokenizer.get_lang_id(target_lang))
    translated_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

    return translated_text

@timing_decorator
def main():
    text = "不要插手巫師的事,因為他們既狡猾且容易發怒。"
    translated_text = translate(text=text)
    print(translated_text)

if __name__ == "__main__":
    main()
Sharing is caring!