Teuken-7B is currently one of the few AI language models that are trained multilingually from scratch as a basic model. In addition to its multilingualism, the highlights include a multilingual pre-processing stage (“tokenizer”), which ensures more efficient training and operation, as well as embedding in the infrastructure of the European Gaia-X ecosystem. The provision as an open source model allows companies and organizations to operate their own adapted models in real applications. This is intended to address the need for transparent and customizable solutions in generative AI in both science and industry.
Generative AI from a strong network – with a European perspective
Teuken-7B was launched as a freely usable open source model with a European perspective. Ten partners, including TU Dresden with the two CIDS departments ZIH and ScaDS.AI Dresden/Leipzig, worked closely together in the BMWK-funded joint project OpenGPT-X under the leadership of the Fraunhofer Institutes for Intelligent Analysis and Information Systems IAIS and for Integrated Circuits IIS.
“I am delighted about today’s publication of the Gaia-X-based AI language model Teuken-7B and congratulate the OpenGPT-X project on reaching this important milestone. Teuken-7B also enables the secure use of sensitive company data, as the Gaia-X standards guarantee data storage and processing in accordance with the highest European data protection and security regulations. Innovations such as these strengthen digital sovereignty, competitiveness and also the resilience of Germany and Europe. This is why the BMWK is funding the project with around 14 million euros,” says Dr. Franziska Brantner, Parliamentary State Secretary at the BMWK.
The TU Dresden (alongside Forschungszentrum JĂĽlich) provided infrastructure for the project. It also supported the setup and installation for the model training sessions and evaluations. For the training, the efficiency was examined and optimized using GPU utilization and various parallelization strategies, for example. The trained models were evaluated in terms of their various capabilities, including logical thinking and translation capability. The results can be viewed in the previously published leaderboard.
Highlights of the language model
Improved tokenizer increases the efficiency of language models in non-English languages
OpenGPT-X placed great emphasis on the (energy-) efficient use of computing resources during model development and conducted intensive research on the tokenizer in particular. As a central element of large AI language models, tokenizers break down words into individual word components. The fewer tokens, the faster language models generate an answer.
Access via the European Gaia-X infrastructure
The project was funded as part of the BMWK funding program “Innovative and practical applications and data spaces in the Gaia-X digital ecosystem” with the aim of enabling players in the Gaia-X ecosystem to develop innovative language applications and translate them into concrete application scenarios in their respective domains.
Free use for research-related and commercial purposes
Developers can download Teuken-7B from Hugging Face free of charge and work with it in their own development environment. The model has already been optimized for chat applications through instruction tuning. Instruction tuning is used to adapt large AI language models so that the model correctly understands instructions from users. The model is available in a version for research purposes and a version under the “Apache 2.0” license, which companies can also use for commercial purposes and integrate into their own AI applications.
Further links
About OpenGPT-X
The OpenGPT-X project started on January 1, 2022 with funding from the BMWK of around 14 million euros and will end on March 31, 2025. The ten project partners are Fraunhofer IAIS, Fraunhofer IIS, Forschungszentrum JĂĽlich, KI Bundesverband, TU Dresden, DFKI, IONOS, Aleph Alpha, ControlExpert and WDR. Under the leadership of Fraunhofer IAIS and Fraunhofer IIS, the project is researching the entire value chain of generative AI: from the highly scalable, GPU-based infrastructure and data for training large language models, through the development of the models, to productive application in the form of prototypes and proofs of concepts (PoCs).
The overarching goal of the project was to develop its own large AI language model, which is made available open source for research and companies and is geared in particular to the multilingual needs of Europe. With the release of Teuken-7B, the project has achieved this goal, providing a public research-derived alternative for future scientific investigations and commercial applications of Generative AI.
About the CIDS
As a unifying element across all research and teaching areas, digitalization is a central strategic focus of TU Dresden, as digital transformation is changing organizational structures, processes and products. In science, it offers new opportunities to research forward-looking solutions and make a contribution to society. Adapting to new technologies such as edge and cloud computing is therefore universally necessary today. The required capabilities, which are closely linked to HPC, big data, data analytics and AI, make it necessary for future infrastructures to act dynamically and autonomously, optimizing resource usage while preserving data sovereignty for users.
The Center for Interdisciplinary Digital Sciences (CIDS) underlines TU Dresden’s commitment to being a leader in the fields of digitalization, HPC and AI and positions it as a competitive center for interdisciplinary research and innovation. With its two departments ZIH and ScaDS.AI Dresden/Leipzig, CIDS integrates two competence centers for HPC and AI.
– – – – – –
Further links
👉 https://tu-dresden.de
Photo: Robert Gommlich