Descripció del projecte

Evolving through time, machine translation task has seen various effective approaches, for example, rule-based, example-based, statistical method, and neural networks. In general, we can divide them into two major categories: rule-based and corpus-based (also called data-driven) (Forcada 2017). Neural machine translation, a corpus-based approach, is currently the state-of-the-art.

In order to enhance NMT performance, there are numerous papers exploring model designs or training techniques; however, this model-centric mindset has some inherent shortcomings. Firstly, complex models require large amount of energy and time. Secondly, it is sometimes impossible to achieve performance enhancement by just twisting models (DeepLearningAI 2021). As TAUS argues, complex models cannot outperform more and better data (TAUS Videos 2021). This is especially true for low-resourced language pairs as data is crucial to corpus-based approaches. Unfortunately, there are relatively few researches on the data part of machine learning (DeepLearningAI 2021).

Hence, this research aims to develop and summarize different methods to enhance neural machine translation performance from and to Chinese focusing on the data used to train MT systems. It has two major concerns: the volume of parallel corpus and its quality. Desired results could include catalogues of reliable corpus resources and crawlable bilingual/multilingual websites, a better way to crawl parallel corpus on the Internet, methods to augment existing parallel without adding new bitext, and methods to clean existing parallel corpus to achieve better machine translation results.

Although understudied, there are some papers addressing similar questions. For example, Fadaee, Bisazza, and Monz proposed a data augmentation method to tackle rare words problem in NMT (Fadaee, Bisazza, and Monz 2017). Wang et al. also confirms how we can purposely select data to reduce training noise for NMT models (2018). Similar papers can be the starting point for our research.

To conduct our experiment, we choose Chinese <> English, Spanish, Catalan as the language pairs. They represent language pairs of sufficient, low, and extremely low resources. This decision also considers current MT research scenario. We want to shift from the English-centric and small-linguistic-distance research paradigm. Besides, we will use the vanilla Transformer model proposed by Vaswani et al. (2017) for all our training process. Corpus used to train different systems will be constructed from various available datasets for machine translation as well as web crawling done during the research.



MÉS INFORMACIÓ

Si t’interessa l’oferta, omple el pdf amb les teves dades i envia´l a doctorats.industrials.recerca@gencat.cat