Language technology is essential for the survival of small languages – researchers using supercomputers to develop Finnish language models
Assistant Professors Sampo Pyysalo and Filip Ginter from the University of Turku’s Department of Information Technology, Finland are part of the TurkuNLP (Natural Language Processing) research group, which will be among the first research groups to test out the GPU partition of the LUMI supercomputer. The aim of the group is to develop Finnish language models to support both cutting-edge research in the field and the development and use of Finnish-language applications based on artificial intelligence.
– Nowadays, there are linguistic models underpinning all AI systems for language processing. Generative language models have been a focus in recent years, and especially the Generative Pre-trained Transformer 3 (GPT-3) developed by OpenAI. This model has broken new ground in many ways: the texts it produces are very difficult to distinguish from texts written by humans, Pyysalo explains.
AI for language is special: it cannot be developed into one universal model as, for example, machine vision can. The uniqueness of the Finnish language also poses challenges when developing models.
– If the goal is to make language models or AI that understand Finnish, they must be made in Finnish. As a relatively small language area, there is very little interest in Finnish from the large, international commercial operators such as Google, Facebook and Baidu, who have developed the most advanced English and Chinese language models in the world, Pyysalo continues.
Are all the Moomins in the valley?
Currently, the most advanced language model is the probability-based GPT-3 model. After some text is entered, the model is able to predict, for example, what the following words will be. The model can help, for example, with the machine translation of languages and document classification. The aim of Pyysalo and Ginter’s research group is to develop Finnish language models towards the GPT-3 level.
– Above all, this model lays the foundations for the next generation of language technology applications in Finnish. It is hoped that our research will provide a better basis for almost all language technology applications, as well as opening the way for applications that have not previously been possible in Finnish. We are also working together with Aalto University’s speech recognition researchers, Pyysalo says.
– With this language model, we are also moving towards AI-based language understanding. For example, the Finnish sayings ‘he doesn’t have all his Moomins in the valley’ and ‘he isn’t the sharpest pencil in the pencil case’ mean the same thing; people get this, but the language models used today do not. Understanding such connections through artificial intelligence can, among other things, help search engines to cope with different search queries in Finnish. If we have a good Finnish language model, these kinds of applications can be developed also in Finnish, Ginter says.
Tens of billions of parameters
The group has made earlier language models using the artificial intelligence capacity of CSC’s Puhti supercomputer, but the GPU performance of the machine is still very limited compared to LUMI’s AI capacity. LUMI’s vast GPU computing capacity is needed to further develop these language models based on deep learning neural networks.
– The increase in size from model to model is exponential. The language model we constructed on the Puhti supercomputer had 110 million parameters. The model to be derived in the LUMI pilot project is aiming for tens of billions of parameters; each parameter is a neural network weight set in training, Pyysalo continues.
Indeed, language technology is one of the scientific disciplines that is making increasing use of computational methods.
– Computational methods have been moving forward in our field at a really, really fast pace. Even five years ago, we didn’t have any sense of where we would now find ourselves. Many great strides forward have been made, Pyysalo explains.
Collecting massive data sets
The development of language models is based on huge data sets, which are used to train deep neural networks to create a new language model. Ginter has been working in the field since the early 2000s, and a project which he previously led went through the entire Finnish-language Internet and collected it as data to use for developing language models.
– We were the first to go and collect data in Finnish. We downloaded from the Internet as much Finnish as possible: a total of 8 billion words. Even then, I realised there are basically no data sets available for the Finnish language, Ginter recalls.
In addition to the entire Finnish-language Internet, there are also texts from many other sources. The real problem is that there is simply not enough written Finnish available as source material for developing an entire GPT-3 model. Data has been compiled from sources such as the Kielipankki (Language Bank), maintained by CSC and FIN-CLARIN, which makes available the news archives of the Finnish Broadcasting Company Yle and the Finnish News Agency STT and the online discussions from the Suomi24 website from the last twenty years. In addition, the research group is collaborating with the Finnish National Library.
Saviours of the Finnish language
For the Finnish language, this research – and the development of language models in general – are extremely valuable.
– Language technology is essential to the survival of small languages, says Pyysalo, who has spent two decades working in the field.
After the LUMI pilot project, the group will continue to develop the language model in the LUMI Extreme Scale project, for which the group was granted 2 million GPU hours from the share of LUMI capacity reserved for Finnish researchers. The language model being developed in this project will aim for one hundred billion parameters.
The research group is also involved in the High Performance Language Technologies project of the Horizon Europe framework programme, which will begin next autumn. The project will produce language models for all EU languages and has received for this task 3 million GPU hours from LUMI.
– If we succeed in developing a new language model, then Finnish will be in a pretty good position: a fairly small language which is nevertheless the subject of one of the largest language models in the world. Furthermore, our models are freely available for research and commercial use, Pyysalo concludes.
Have a look at the interview video:
Author: Anni Jakobsson, CSC, Finland