Software

TU Dresden: Multilingual and open source – OpenGPT-X research project publishes large AI language model

November 26, 2024. OpenGPT-X is now making its large AI language model available for download. Following the launch of the European LLM Leaderboard in mid-July, the consortium of the research project funded by the Federal Ministry for Economic Affairs and Climate Protection (BMWK) – with the participation of TU Dresden – has now published the underlying model “Teuken-7B”. It was trained from scratch with the 24 official languages of the EU and comprises seven billion parameters. As a technological basis, the free model can be adapted, supplemented and specialized for applications of Generative Artificial Intelligence (AI). It can also be used to implement a wide range of AI applications.

Photo: Robert Gommlich

Generative AI from a strong network – with a European perspective

Teuken-7B was launched as a freely usable open source model with a European perspective. Ten partners, including TU Dresden with the two CIDS departments ZIH and ScaDS.AI Dresden/Leipzig, worked closely together in the BMWK-funded joint project OpenGPT-X under the leadership of the Fraunhofer Institutes for Intelligent Analysis and Information Systems IAIS and for Integrated Circuits IIS.

“I am delighted about today’s publication of the Gaia-X-based AI language model Teuken-7B and congratulate the OpenGPT-X project on reaching this important milestone. Teuken-7B also enables the secure use of sensitive company data, as the Gaia-X standards guarantee data storage and processing in accordance with the highest European data protection and security regulations. Innovations such as these strengthen digital sovereignty, competitiveness and also the resilience of Germany and Europe. This is why the BMWK is funding the project with around 14 million euros,” says Dr. Franziska Brantner, Parliamentary State Secretary at the BMWK.

The TU Dresden (alongside Forschungszentrum Jülich) provided infrastructure for the project. It also supported the setup and installation for the model training sessions and evaluations. For the training, the efficiency was examined and optimized using GPU utilization and various parallelization strategies, for example. The trained models were evaluated in terms of their various capabilities, including logical thinking and translation capability. The results can be viewed in the previously published leaderboard.

Highlights of the language model

Improved tokenizer increases the efficiency of language models in non-English languages

OpenGPT-X placed great emphasis on the (energy-) efficient use of computing resources during model development and conducted intensive research on the tokenizer in particular. As a central element of large AI language models, tokenizers break down words into individual word components. The fewer tokens, the faster language models generate an answer.

Access via the European Gaia-X infrastructure

The project was funded as part of the BMWK funding program “Innovative and practical applications and data spaces in the Gaia-X digital ecosystem” with the aim of enabling players in the Gaia-X ecosystem to develop innovative language applications and translate them into concrete application scenarios in their respective domains.

Free use for research-related and commercial purposes

Developers can download Teuken-7B from Hugging Face free of charge and work with it in their own development environment. The model has already been optimized for chat applications through instruction tuning. Instruction tuning is used to adapt large AI language models so that the model correctly understands instructions from users. The model is available in a version for research purposes and a version under the “Apache 2.0” license, which companies can also use for commercial purposes and integrate into their own AI applications.

Further links

Model download and model cards: https://huggingface.co/openGPT-X
Technical information, benchmarks and research results on OpenGPT-X: https://opengpt-x.de/en/models/teuken-7b
Publications from OpenGPT-X: https://opengpt-x.de/news-de
European LLM Leaderboard: https://huggingface.co/spaces/openGPT-X/european-llm-leaderboard
Feedback and technical questions: https://discord.com/invite/RvdHpGMvB3
Make a demo appointment: www.iais.fraunhofer.de/opengpt-x
Gaia-X: https://gaia-x-hub.de/

About OpenGPT-X

The OpenGPT-X project started on January 1, 2022 with funding from the BMWK of around 14 million euros and will end on March 31, 2025. The ten project partners are Fraunhofer IAIS, Fraunhofer IIS, Forschungszentrum Jülich, KI Bundesverband, TU Dresden, DFKI, IONOS, Aleph Alpha, ControlExpert and WDR. Under the leadership of Fraunhofer IAIS and Fraunhofer IIS, the project is researching the entire value chain of generative AI: from the highly scalable, GPU-based infrastructure and data for training large language models, through the development of the models, to productive application in the form of prototypes and proofs of concepts (PoCs).

The overarching goal of the project was to develop its own large AI language model, which is made available open source for research and companies and is geared in particular to the multilingual needs of Europe. With the release of Teuken-7B, the project has achieved this goal, providing a public research-derived alternative for future scientific investigations and commercial applications of Generative AI.

About the CIDS

As a unifying element across all research and teaching areas, digitalization is a central strategic focus of TU Dresden, as digital transformation is changing organizational structures, processes and products. In science, it offers new opportunities to research forward-looking solutions and make a contribution to society. Adapting to new technologies such as edge and cloud computing is therefore universally necessary today. The required capabilities, which are closely linked to HPC, big data, data analytics and AI, make it necessary for future infrastructures to act dynamically and autonomously, optimizing resource usage while preserving data sovereignty for users.

The Center for Interdisciplinary Digital Sciences (CIDS) underlines TU Dresden’s commitment to being a leader in the fields of digitalization, HPC and AI and positions it as a competitive center for interdisciplinary research and innovation. With its two departments ZIH and ScaDS.AI Dresden/Leipzig, CIDS integrates two competence centers for HPC and AI.

– – – – – –