Featured image
Large Language Models

LeoLM (Linguistically Enhanced Open Language Model), the first comprehensive suite of German-language Large Language Models

avatar

Sven

September 28th, 2023

~ 4 min read

German language enthusiasts and researchers, rejoice! A groundbreaking development has arrived in the world of language models. LAION e.V. released LeoLM (Linguistically Enhanced Open Language Model), the first comprehensive suite of German-language Foundation Language Models. Developed in collaboration with HessianAI on their new supercomputer 42, LeoLM is set to revolutionize German open-source and commercial LLM (Language Model) research.

LeoLM is built on Llama-2, a powerful platform known for its capabilities in the English language. However, thanks to a compute grant at HessianAI's supercomputer 42, LeoLM has extended Llama-2's prowess into German through continued pretraining on a vast corpus of high-quality German text. This groundbreaking initiative brings two foundation models, LeoLM-7B and LeoLM-13B, into the limelight, with LeoLM-70B on the horizon. Additionally, LeoLM offers a collection of exceptionally proficient German and bilingual chat models.

Enhancing Proficiency in German

To ensure the proficiency of LeoLM in the German language, a meticulous stage-2 pretraining methodology was employed. Llama-2 models were initially pretrained on 2 trillion tokens of predominantly English text. The LeoLM models were then initialized using Llama-2 weights and further trained on a large German text corpus consisting of 65 billion tokens. Deliberately filtered and deduplicated web text from the OSCAR-2301 corpus formed the basis of this training. Notably, the methodology focused on mitigating the forgetting or loss of previously learned knowledge or capabilities.

Finetuning Datasets: A Multifaceted Approach

To enable LeoLM to excel in chat and instruction tasks, a diverse range of high-quality instruction datasets were translated from English to German. The translation process was facilitated by OpenAI's gpt-3.5-turbo API, ensuring the accuracy and integrity of complex instructions containing code, equations, or formatted data. Additionally, datasets from the MultilingualSIFT project, such as FreedomIntelligence/evol-instruct-deutsch and FreedomIntelligence/alpaca-gpt4-deutsch, were utilized. The inclusion of German poems and songs written by GPT4 further augmented LeoLM's creative writing capabilities.

Evaluation and Results: A Thorough Analysis

Evaluating the performance of language models, especially chat models, is a complex process. In the case of LeoLM, a comprehensive evaluation approach was adopted, including benchmarks based on multiple choice and automatic evaluation methods. Translated versions of English benchmarks were used to evaluate LeoLM's capabilities in the German language. Notably, the results showcased impressive improvements in benchmark scores for German tasks, while there was a slight reduction in scores for English tasks. The mean increase in German benchmark scores significantly outweighed the decrease in performance on English benchmarks, demonstrating LeoLM's ability to learn a new language without forgetting previously acquired knowledge.

Qualitative Results: Unleashing LeoLM's Potential

While benchmarks provide valuable insights, they can sometimes feel abstract. To experience LeoLM's capabilities firsthand, demos are available for users to interact with LeoLM-7B and LeoLM-13B. Alternatively, users can run the models themselves using HuggingFace Transformers. By exploring LeoLM's abilities, users can witness its potential for various tasks and applications.

A Pioneering Step for German Language Research

The release of LeoLM marks a significant milestone for the German open-source research community. LeoLM not only introduces a suite of German Foundation Language Models but also establishes a comprehensive evaluation approach tailored specifically for German language models. By enabling large-scale continued pretraining without significant forgetting or loss of previous capabilities, LeoLM paves the way for language acquisition in pretrained models. Furthermore, LeoLM's availability under a permissive license empowers the German research community to reduce dependence on closed-source commercial sources.

In conclusion, LeoLM is set to revolutionize German-language LLM research, offering a suite of powerful models and a comprehensive evaluation approach. With LeoLM, the doors to innovative research and widespread adoption of German language models are wide open. So, dive in and discover the possibilities with LeoLM!

Links:
Try out LeoLM/leo-hessianai-7b-chat and LeoLM/leo-hessianai-13b-chat on HuggingFace Spaces!

LAION e.V. - Announcement