SwabianGPT: How I finetuned a LLM for dialect-translation

1/2/2025

Summary

A practical journey through fine-tuning a LLAMA 3.1 model to translate between Standard German and Swabian dialect, covering data preparation with synthetic context, QLoRA-based training techniques, and real-world evaluation results.

Introduction

In October 2024, I began pursuing a Bachelor's degree in Data Science and Artificial Intelligence. As part of our initial coursework, we students were tasked with developing a Machine Learning or AI project to gain practical experience in the field.
Having previously studied the architecture and mechanisms of Transformers and Large Language Models (LLMs), I developed a strong interest in this area. This led me to focus my project on model fine-tuning, specifically developing a translation system between Standard German and Swabian dialect.
When selecting a model architecture, I considered specialized translation models like T5. However, given my interest in exploring LLM fine-tuning techniques, I opted for the LLAMA 3.1 8B model, despite its potential limitations for translation tasks. This decision aligned with my learning objectives, even though specialized translation models might achieve superior performance.

Data Preparation

To start with the fine-tuning process, I needed to find a suitable dataset. Fortunately, I discovered www.schwaebisch-schwaetza.de, a website hosting a comprehensive wordbook containing over 12,000 Standard German to Swabian dialect translations. The website administrator generously shared this data with me in CSV format.
However, the dataset presented two main challenges: it was unstructured and lacked contextual information around the word pairs, which is crucial for the LLM to learn dialectal patterns effectively.Here's an example of the original dataset:
Swabian: A blaus Mol
Standard German: Bluterguss
To add context, I finally got the idea of using a State-of-the-Art LLM to create synthetic context around the original words. So I used the Claude-API and XAI's Grok-API, which I prompted to generate natural sentences around each word pair:Here's the example after this step:
Swabian: Du hosch ja a blaus Mol am Arm! Wa isch denn do bassiert?
Standard German: Du hast ja einen schlimmen Bluterguss am Arm! Was ist denn da passiert?
This improved the data quality significantly, though some translations weren't perfect. I uploaded the enhanced dataset to Hugging Face but unfortunately can't share it publicly due to my agreement with the website owner.

SFT-Fine-Tuning

To get started with the main part, I divided the fine-tuning process into two components: SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization). But before diving into these methods, let me explain the key concept of QLoRA, which makes efficient fine-tuning possible.
Traditionally, fine-tuning an LLM requires updating all parameters in the LLM on the new data, demanding significant computational resources. LoRA takes a different approach: instead of updating all parameters, it freezes them and adds a small number of trainable parameters on top of the existing ones. This innovative method makes fine-tuning highly efficient, enabling training on a single GPU.The "Q" in QLoRA stands for Quantization, which refers to reducing the precision of model weights. Standard model weights are typically stored as 32-bit floating-point numbers, but we can reduce this precision to various levels - 16-bit, 8-bit, 4-bit, or even experimental 1-bit representations. While reducing precision may slightly impact model effectiveness, it dramatically decreases the computational resources required for both training and inference.These fundamental concepts underpin both fine-tuning methods I employed.For the SFT phase, I enhanced the previously created dataset by adding specific prompts to each data point. These prompts included instructions like „translate the following sentence into Standard German: {Swabian sentence}" or vice versa for the opposite direction.I implemented the training using UnslothAI's notebook for LLAMA 3.1 8B. UnslothAI is a framework that efficiently implements these fine-tuning techniques and streamlines the training process. It is easy to use and integrates perfectly with Hugging Face. The training was conducted on Google Colab for 2 epochs, resulting in a satisfactory loss value of approximately 0.8, which is appropriate for this translation task.

DPO-Fine-Tuning

You may have encountered ChatGPT's feedback interface, which relates to the second stage of fine-tuning. Traditionally, this stage used RLHF (Reinforcement Learning from Human Feedback), where models learn by comparing pairs of outputs to determine which is better. DPO (Direct Preference Optimization) follows a similar principle of learning from human preferences.
DPO fine-tuning requires a preference dataset containing three key columns: prompt, chosen, and rejected. To efficiently create this dataset, I selected 300 examples from my initial dataset and generated two model responses for each prompt, saving them in a CSV file.I then manually evaluated the pairs of outputs, selecting the better response for the „chosen" column and placing the alternative in the „rejected" column. While this manual evaluation process was tedious, it was necessary for ensuring quality. Alternative approaches like community voting could work but would also be time-consuming.Here's an example from the final dataset:• Prompt: „Übersetze ins Hochdeutsche: A blaus Mol"
Chosen: „Bluterguss"
Rejected: „Ein blauer Fleck"
I uploaded the curated preference dataset to Hugging Face. Fortunately, UnslothAI provides a Colab notebook specifically designed for DPO fine-tuning, allowing me to maintain consistency in my development framework. I conducted the training for 3 epochs and published the resulting model to Hugging Face.

Evaluation and Conclusion

Rather than implementing formal benchmarks, I evaluated the model through practical testing, conducting various translation attempts. While the model often produced correct translations, it wasn't consistently accurate. Some translation errors can be attributed to dataset quality issues, particularly in the samples generated by Claude and Grok, which occasionally contained inaccurate or misleading translations.
Let me demonstrate this with a real example from a Swabian song: "Oinr isch immer der Arsch, und er woiß id mol warum." The model produced varying translations:• Translation 1: „Einer ist immer der Arsch, und er weiß nicht mal warum." -- This is an excellent translation that captures the meaning perfectly.
Translation 2: „Ein Mann ist immer der Arsch, und er weiß nicht mal warum." -- This translation is less accurate as it interprets "Oinr" specifically as „Ein Mann" (a man) rather than the more general „Einer" (someone).
A more fundamental challenge lies in the nature of the Swabian dialect itself. As a primarily spoken language without standardized spelling or usage, Swabian varies significantly across different regions. Even as a lifelong resident of a Swabian-speaking area in Germany, I encountered unfamiliar expressions while reviewing the dataset. This regional variation makes creating a comprehensive, "optimal" dataset particularly challenging.Several potential improvements could enhance the model's performance:• Using a specialized translation model as the base model could provide better results.
• Additionally, incorporating audio capabilities could significantly improve performance, though acquiring suitable audio datasets presents its own challenges.
• Future enhancements could include implementing a RAG (Retrieval-Augmented Generation) pipeline, where translations from the dataset would be stored in a vector database. The system could then retrieve relevant translations and incorporate them into the prompt before generating responses.
• Another promising direction would be developing an agential system capable of self-review and improvement. Such a system could autonomously search web resources or databases for correct translations when encountering unfamiliar terms.
This approach represents an exciting area for future exploration and development.

Resources & Code

The code for this project, including the fine-tuning notebooks and model configurations, is available in my GitHub repository: SwabianGPT. While you can use the code as reference or inspiration for similar projects, please note that reproducing the exact results won't be possible since the training dataset cannot be shared due to licensing agreements.

The repository includes

• Data preparation scripts
• Fine-tuning notebooks for both SFT and DPO
• Model configurations and hyperparameters
• Example inference code

Final words

I hope you enjoyed reading this article.
If you have any questions or remarks feel free to contact me.
Your Mario 💚
SwabianGPT: How I finetuned a LLM for dialect-translation | Mario Raach