Augmenting Performance of SMT Models by Deploying Fine Tokenization of the Text and Part-of-Speech Tag


  •  Abraham Nedjo    
  •  Huang Degen    

Abstract

This paper presents our study of exploiting the languages’ word class information augmented with some rule-based processing for phrase-based Statistical Machine Translation (SMT). In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Oromo, this problem can be particularly severe. In addition, there is variant nature or use of different symbols for ‘hudhaa’ (the diacritical marker) in Oromo language which intrudes another severe data sparsity problem. In this work, we show that using fine tokenization of words considering intra-word behavior of words consisting hudhaa, and POS tag to modify the Oromo input and see how it improves Oromo-English machine translation system. The models were trained on a very small parallel corpus of data set (usually unacceptable for normal SMT system) and also the quality of the parallel corpus both in translation and spelling errors were not so good. Yet, our final system achieves a BLEU score of 2.88, as compared to 2.56 for the baseline system.



This work is licensed under a Creative Commons Attribution 4.0 License.
  • ISSN(Print): 1913-8989
  • ISSN(Online): 1913-8997
  • Started: 2008
  • Frequency: quarterly

Journal Metrics

WJCI (2020): 0.439

Impact Factor 2020 (by WJCI): 0.247

Google Scholar Citations (March 2022): 6907

Google-based Impact Factor (2021): 0.68

h-index (December 2021): 37

i10-index (December 2021): 172

(Click Here to Learn More)

Contact