Software

2 minute read

Fine-Tuning T5-Small Model for a Completely New Language: Limbu

December 21, 2024

fine-tuning-t5-small-model-for-a-completely-new-language:-limbu

Introduction

Natural Language Processing (NLP) is expanding its reach into underserved languages. In this blog, we’ll explore how to fine-tune the T5-Small model to translate between English and Limbu, a Tibeto-Burman language spoken in Nepal and neighboring regions.

Preparing the Data

We created an English-Limbu translation dataset in JSON format, containing over 1,500 pairs. Below is a sample of the data:

[
   {
        "id": 1,
        "translation": {
            "en": "hi",
            "lim": "ᤜᤠᤤ ॥"
        }
    },
    {
        "id": 2,
        "translation": {
           "en": "Let's eat.",
           "lim": "ᤀᤠᤏᤡ᤹ ᤆᤠᤶ ॥ "
        }
    },
    {
        "id": 3,
        "translation": {
            "en": "We saw it.",
            "lim": "ᤀᤏᤡᤃᤧ ᤁᤴ ᤏᤡᤔᤠᤏᤠ ॥ "
        }
    },
    ...
]

The dataset was saved as limbu-english.json.

Setting Up the Environment

Install the required libraries in Google Colab:

!pip install transformers datasets evaluate sacrebleu
!pip install transformers[sentencepiece]
!pip install sentencepiece

Load the dataset:

from datasets import load_dataset

path = 'limbu-english.json'
translations = load_dataset('json', data_files=path)
translations = translations["train"].train_test_split(test_size=0.2)

Loading the Pretrained Model

We initialized the T5-Small model:

from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

Tokenizing the Dataset

We generated a custom tokenizer and tokenized the dataset:

def get_training_corpus():
    dataset = translations["train"]
    for start_idx in range(0, len(dataset), 1000):
        yield [item['lim'] for item in dataset[start_idx:start_idx + 1000]['translation']]

lim_tokenizer = tokenizer.train_new_from_iterator(get_training_corpus(), 52000)

source_lang = "en"
target_lang = "lim"
prefix = "translate English to Limbu: "

def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    return lim_tokenizer(inputs, text_target=targets, max_length=128, truncation=True)

tokenized_translations = translations.map(preprocess_function, batched=True)

Preparing for Training

The tokenized data was prepared for the TensorFlow model:

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=lim_tokenizer, model=checkpoint)

tf_train_set = model.prepare_tf_dataset(
    tokenized_translations["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_translations["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

Training the Model

We used AdamWeightDecay for optimization:

from transformers import AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)

Let’s define the metrics to observe while training

from transformers.keras_callbacks import KerasMetricCallback

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = lim_tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, lim_tokenizer.pad_token_id)
    decoded_labels = lim_tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_test_set)

These metrics can be seen in log file as well, but instead we will store it into the huggingface

from huggingface_hub import notebook_login

notebook_login()

and push the logs into huggingface as

from transformers.keras_callbacks import PushToHubCallback
push_to_hub_callback = PushToHubCallback(output_dir="eng-limbu-t5-001", tokenizer=lim_tokenizer)

callbacks = [push_to_hub_callback, tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)]
history = model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=500, callbacks=callbacks)

Visualizing Training Progress

We visualized the training loss:

import matplotlib.pyplot as plt

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

Testing the Model

We tested the model using the pipeline module:

from transformers import pipeline

translator = pipeline("text2text-generation", model="bedus-creation/eng-limbu-t5-001")
result = translator("translate English to Limbu: Hello")
print(result)

Evaluating with BLEU Score

Finally, we calculated the BLEU score for translation accuracy:

bleu = evaluate.load("bleu")

predictions = [
    "Hi",
    ]
references = [
    ["ᤜᤠᤤ ॥"],
]

results = bleu.compute(predictions=predictions, references=references)

print(results)

Conclusion

Fine-tuning the T5-Small model for Limbu demonstrates the potential of NLP in preserving and advancing underrepresented languages. With more training data and optimization, such models can become invaluable tools for language preservation and cross-cultural communication.

Cowboy Casino Major Update

December 20, 2024

Software

Identifying Duplicate Elements in Arrays: A Comprehensive Guide for Aspiring Programmers

December 21, 2024

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Building a Splunk Investigator Agent with Strands Agents and Amazon Bedrock AgentCore

AI search strategy: A guide for modern marketing teams

Flutter App Development: Why You Should Choose Flutter for Your Project

Trending Tags

Fine-Tuning T5-Small Model for a Completely New Language: Limbu

Introduction

Preparing the Data

Setting Up the Environment

Loading the Pretrained Model

Tokenizing the Dataset

Preparing for Training

Training the Model

Visualizing Training Progress

Testing the Model

Evaluating with BLEU Score

Conclusion

Leave a Reply Cancel reply

Previous Post

Cowboy Casino Major Update

Next Post

Identifying Duplicate Elements in Arrays: A Comprehensive Guide for Aspiring Programmers

Fine-Tuning T5-Small Model for a Completely New Language: Limbu

Introduction

Preparing the Data

Setting Up the Environment

Loading the Pretrained Model

Tokenizing the Dataset

Preparing for Training

Training the Model

Visualizing Training Progress

Testing the Model

Evaluating with BLEU Score

Conclusion

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts