Project description

Understanding the topics covered by scientific publications is a problem that has been focused on natural language processing research for more than twenty years. With this motivation, several initiatives have tried to simplify complexity and extract knowledge of scientific productions through the use of language technologies on their textual content, with proposals ranging from automatic text summary, entity recognition, relationship extraction, automatic answer of questions, to automatic text classification. Text in natural language is an extremely rich source of information, although extracting knowledge or knowledge can take a long time and a huge challenge due to its unstructured nature.

Les fonts de dades sobre ciència i tecnologia estan creixent en nombre, dimensió, cobertura, qualitat i riquesa de dades. Els governs i els organismes públics estan obrint les dades de les seves polítiques de ciència i innovació, vinculant els projectes individuals als seus resultats científics, tecnològics i socioeconòmics. A més, les dades relacionades amb la demanda pública com la contractació pública, la contractació d’innovació i els documents de planificació de polítiques (relacionats amb reptes socials, d’impuls a tecnologies o de transformació de les pràctiques sectorials) són cada cop més accessibles. Els textos relacionats amb aquestes dades contenen una gran quantitat d’informació textual que detalla els reptes actuals, els avenços proposats o demostrats, les tecnologies usades i l’impacte previst del procés de recerca i innovació.

Research in the field of Natural Language Processing has progressed very rapidly in recent years. The incorporation of techniques based on deep learning and the emergence of pre-trained models of state-of-the-art language (such as BERT, GPT and their successors) have changed the rules of the game in a few years. These advances have reduced feature engineering efforts, training more generalizable systems, improving performance and reducing the need for computational resources. The application of these techniques has also had special importance/relevance in documents of scientific literature or patents, and even in specific domains, such as biomedical or clinical. This type of scientific and technical documents present a set of specific challenges and difficulties due to their complexity, which translate into the difficulty of covering specific concepts of each domain, disambiguation of acronyms or identification of denials, among others.

To this end, SIRIS Academic and Pompeu Fabra University propose the creation of an Industrial Doctorate place for the exploration and development of natural language processing methodologies and machine learning for the mapping of science, technology and innovation activities in different domains, on heterogeneous textual records obtained from different repositories (such as scientific publications, research projects, patents, news or social networks). The systems to be developed will have to face emerging domains and without a clear definition, challenges of society or topics from a value chain perspective, to classify R&D&I activities that respond to these fields and assign these categories to text units such as phrases, paragraphs or documents.



MORE INFORMATION

If you are interested in the offer, fill in the pdf with your details and send it to doctorats.industrials.recerca@gencat.cat