Augmenting Performance of SMT Models by Deploying Fine Tokenization of the Text and Part-of-Speech Tag
- Abraham Nedjo
- Huang Degen
Abstract
This paper presents our study of exploiting the languages’ word class information augmented with some rule-based processing for phrase-based Statistical Machine Translation (SMT). In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Oromo, this problem can be particularly severe. In addition, there is variant nature or use of different symbols for ‘hudhaa’ (the diacritical marker) in Oromo language which intrudes another severe data sparsity problem. In this work, we show that using fine tokenization of words considering intra-word behavior of words consisting hudhaa, and POS tag to modify the Oromo input and see how it improves Oromo-English machine translation system. The models were trained on a very small parallel corpus of data set (usually unacceptable for normal SMT system) and also the quality of the parallel corpus both in translation and spelling errors were not so good. Yet, our final system achieves a BLEU score of 2.88, as compared to 2.56 for the baseline system.
- Full Text: PDF
- DOI:10.5539/cis.v8n1p119
Journal Metrics
WJCI (2022): 0.636
Impact Factor 2022 (by WJCI): 0.419
h-index (January 2024): 43
i10-index (January 2024): 193
h5-index (January 2024): N/A
h5-median(January 2024): N/A
( The data was calculated based on Google Scholar Citations. Click Here to Learn More. )
Index
- Academic Journals Database
- BASE (Bielefeld Academic Search Engine)
- CiteFactor
- CNKI Scholar
- COPAC
- CrossRef
- DBLP (2008-2019)
- EBSCOhost
- EuroPub Database
- Excellence in Research for Australia (ERA)
- Genamics JournalSeek
- Google Scholar
- Harvard Library
- Infotrieve
- LOCKSS
- Mendeley
- PKP Open Archives Harvester
- Publons
- ResearchGate
- Scilit
- SHERPA/RoMEO
- Standard Periodical Directory
- The Index of Information Systems Journals
- The Keepers Registry
- UCR Library
- Universe Digital Library
- WJCI Report
- WorldCat
Contact
- Chris LeeEditorial Assistant
- cis@ccsenet.org