Descripción del proyecto
Evolving through time, machine translation task has seen various effective approaches, for example, rule-based, example-based, statistical method, and neural networks. Por lo general, nos pueden dividir en las dos categorías más altas: rule-based y corpus-based (alas called data-driven) (Forcada 2017). Neural machine translation, en corpus-based approach, es currently the state-of-the-art.
In order to enhance NMT performance, there are numerosos papeles exploring modelo designs or training techniques; However, este modelo-centro mindset has some inherente shortcomings. Firstly, complejo modelos requiere large amount of energy and time. Secondly, it is sometimes imposible to achieve performance enhancement by just twisting models (DeepLearningAI 2021). En TAUS argas, complejo modelos cannot outperform more and better data (TAUS Videos 2021). Este es especialmente true para low-resourced language padres a la fecha es crucial al corpus-based approaches. Unfortunadamente, esta relativamente few researches en la fecha parte de machine learning (DeepLearningAI 2021).
Hence, este research aims para el desarrollo y sumario diferentes métodos a la transferencia de transferencia transversal neural de la forma de la inglesa y en China focusing en la fecha utilizada para train MT systems. It has two mayores concerns: the volume of paralel corpus and its quality. Desired resultados include catalogas de reliable corpus resources and crawlable bilingual/multilingual websites, a better way to crawl paralelo corpus on the Internet, métodos al aumento existing paralelo sin add new bitext, y métodos a clean existing para resultados.
Although understudied, there are some papeles addressing similar preguntas. Por ejemplo, Fadaee, Bisazza, y Monz proponen a fecha aumentación método para tackle rare words problema en NMT (Fadaee, Bisazza, y Monz 2017). Wang et al. También confirms como una purposely seleccionada fecha para reducir el training noise para NMT modelos (2018). Similar papeles puede ser el startpoint para nuestro research.
To conduct our experiment, we choose Chinese <> English, Spanish, Catalan as the language pairs. They represent language pairs of sufficient, low, and extremely low resources. This decision also considers current MT research scenario. We want to shift from the English-centric and small-linguistic-distance research paradigm. Besides, we will use the vanilla Transformer model proposed by Vaswani et al. (2017) for all our training process. Corpus used to train different systems will be constructed from various available datasets for machine translation as well as web crawling done during the research.