BabyLMs for Sample-Efficient Language Modelling of Ancient Greek
Project Showcase

BabyLMs for Sample-Efficient Language Modelling of Ancient Greek

Investigating the suitability of BabyLM architectures for language modelling in the low-resource context of Ancient Greek.

By: Dimitri Dalakas , Stelio Dalakas , Kimon Christelis

Supervised by: Francois Meyer


About

Abstract

Pretrained Language Models (PLMs) have achieved state-of-the-art performance across a wide range of Natural Language Understanding tasks in recent years. This success has relied on vast amounts of training data, which are not readily available for low-resource languages. Ancient Greek, despite its historical and cultural significance, has limited publicly available data, posing a major obstacle to effective language modelling. The BabyLM Challenge is a shared task for the development of sample-efficient architectures. These data-efficient models show promise for reducing the reliance of language models on extensive training data. In this study, we explore the suitability of BabyLMs for the low-resource context of Ancient Greek. We train an ELC-BERT (winner of the first BabyLM Challenge), a GPT-BERT (winner of the second BabyLM Challenge), and a baseline BERT model. We finetune all models on Part-of-Speech tagging and dependency parsing. Our results show that GPT-BERT comfortably outperforms both ELC-BERT and BERT across these tasks. Furthermore, GPT-BERT outperforms an existing Ancient Greek model trained on substantially more data, emphasising its potential as a data-efficient architecture for Ancient Greek language modelling.

Videos 1

Watch presentations, demos, and related content

Documents 1

Downloadable resources and documentation

Click "View Full" to open documents in a new window

Gallery 2

Explore the visual story of this exhibit