Pretrained Language Models (PLMs) have achieved state-of-the-art performance across a wide range of Natural Language Understanding tasks in recent years. This success has relied on vast amounts of training data, which are not readily available for low-resource languages. Ancient Greek, despite its historical and cultural significance, has limited publicly available data, posing a major obstacle to effective language modelling. The BabyLM Challenge is a shared task for the development of sample-efficient architectures. These data-efficient models show promise for reducing the reliance of language models on extensive training data. In this study, we explore the suitability of BabyLMs for the low-resource context of Ancient Greek. We train an ELC-BERT (winner of the first BabyLM Challenge), a GPT-BERT (winner of the second BabyLM Challenge), and a baseline BERT model. We finetune all models on Part-of-Speech tagging and dependency parsing. Our results show that GPT-BERT comfortably outperforms both ELC-BERT and BERT across these tasks. Furthermore, GPT-BERT outperforms an existing Ancient Greek model trained on substantially more data, emphasising its potential as a data-efficient architecture for Ancient Greek language modelling.
Watch presentations, demos, and related content
A slideshow presentation describing the training, evaluation, and results of BERT and BabyLM architectures for Ancient Greek Language Modelling.
Like, comment, and subscribe on YouTube to support the creator!
A slideshow presentation describing the training, evaluation, and results of BERT and BabyLM architectures for Ancient Greek Language Modelling.
Explore the visual story of this exhibit
Title Image