A Comparison of Data Augmentation Techniques for Nguni Language Statistical and Neural Machine Translation Models

Talk to us

No meetings are currently scheduled.

Machine Translation (MT) has shown significant improvement over recent years. Nonetheless, their performance degrades when the amount of training data is limited, as in the case of low resource languages. South African languages being under-resourced have achieved low performance in the machine translation paradigm. To address this issue, we train Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) models to compare the performance of two data augmentation techniques on Nguni languages, namely, IsiZulu and IsiXhosa, against a baseline model. The first data augmentation technique makes use of target-side monolingual data to augment the amount of parallel data via back-translation and the second technique involves training a multilingual model on a joint set of bilingual corpora containing both the IsiXhosa and the IsiZulu language. The KenLM and the GIZA++ toolkits were used to build the SMT models and the Fairseq toolkit was used to build the NMT models. We use The Bilingual Evaluation Understudy (BLEU) method of evaluating translation to evaluate the performance of the different models. In the NMT context, both the multilingual and back-translation models outperformed the baseline models for English-to-IsiXhosa translation; a similar conclusion was reached in the SMT context. For English-to-IsiZulu translation, however, in the SMT context, the baseline systems outperformed the multilingual with the back-translation system yielding the best BLEU scores for both IsiZulu and IsiXhosa translation. In the NMT context, the baseline model for IsiZulu was seen to overfit the data due to its small size, thus, the baseline model had the best performance, with the back-translated system outperforming the multilingual models in both IsiXhosa and IsiZulu cases.

Videos

Visit the video on YouTube to like and join the discussion in the comment section.

Documents

Download

Images

SMT-NMT Title Image