NLP for the classification of science, technology and innovation literature.

State: Awarded

Business environment: SIRIS Academic

Academic Environment: Pompeu Fabra University -

Municipality: Barcelona

Ambitions: PE6 Computer Science and Informatics -

Qualification required: Degree in Computer Engineering. Master's Degree in Natural Language Processing or Computational Linguistics.

Project description

Understanding the topics covered by scientific publications is a problem that has been focused on natural language processing research for more than twenty years. With this motivation, several initiatives have tried to simplify complexity and extract knowledge of scientific productions through the use of language technologies on their textual content, with proposals ranging from automatic text summary, entity recognition, relationship extraction, automatic answer of questions, to automatic text classification. Text in natural language is an extremely rich source of information, although extracting knowledge or knowledge can take a long time and a huge challenge due to its unstructured nature.

Les fonts de dades sobre ciència i tecnologia estan creixent en nombre, dimensió, cobertura, qualitat i riquesa de dades. Els governs i els organismes públics estan obrint les dades de les seves polítiques de ciència i innovació, vinculant els projectes individuals als seus resultats científics, tecnològics i socioeconòmics. A més, les dades relacionades amb la demanda pública com la contractació pública, la contractació d’innovació i els documents de planificació de polítiques (relacionats amb reptes socials, d’impuls a tecnologies o de transformació de les pràctiques sectorials) són cada cop més accessibles. Els textos relacionats amb aquestes dades contenen una gran quantitat d’informació textual que detalla els reptes actuals, els avenços proposats o demostrats, les tecnologies usades i l’impacte previst del procés de recerca i innovació.

Research in the field of Natural Language Processing has progressed very rapidly in recent years. The incorporation of techniques based on deep learning and the emergence of pre-trained models of state-of-the-art language (such as BERT, GPT and their successors) have changed the rules of the game in a few years. These advances have reduced feature engineering efforts, training more generalizable systems, improving performance and reducing the need for computational resources. The application of these techniques has also had special importance/relevance in documents of scientific literature or patents, and even in specific domains, such as biomedical or clinical. This type of scientific and technical documents present a set of specific challenges and difficulties due to their complexity, which translate into the difficulty of covering specific concepts of each domain, disambiguation of acronyms or identification of denials, among others.

To this end, SIRIS Academic and Pompeu Fabra University propose the creation of an Industrial Doctorate place for the exploration and development of natural language processing methodologies and machine learning for the mapping of science, technology and innovation activities in different domains, on heterogeneous textual records obtained from different repositories (such as scientific publications, research projects, patents, news or social networks). The systems to be developed will have to face emerging domains and without a clear definition, challenges of society or topics from a value chain perspective, to classify R&D&I activities that respond to these fields and assign these categories to text units such as phrases, paragraphs or documents.

MORE INFORMATION

If you are interested in the offer, fill in the pdf with your details and send it to doctorats.industrials.recerca@gencat.cat

Back to projects list

Biscuit	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
wpEmojiSettingsSupports	Session	WordPress sets this cookie when a user interacts with emojis on a WordPress site. It helps determine if the user's browser can display emojis properly.

Biscuit	Duration	Description
_Ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_Ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
_gat_gtag_UA_55600303_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user's first session. It stores a true/false value, indicating whether it was the first time Hotjar saw this user.
_hjIncludedInPageviewSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's pageview limit.
_hjIncludedInSessionSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's daily session limit.
_hjSession_*	30 minutes	Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_hjSessionUser_*	1 year	Hotjar sets this cookie to ensure data from subsequent visits to the same site is attributed to the same user ID, which persists in the Hotjar User ID, which is unique to that site.
_hjTLDTest	Session	To determine the most generic cookie path that has to be used instead of the page hostname, Hotjar sets the _hjTLDTest cookie to store different URL substring alternatives until it fails.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Biscuit	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
NID	6 months	Google sets the cookie for advertising purposes; to limit the number of times the user sees an ad, to unwanted mute ads, and to measure the effectiveness of ads.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
VISITOR_PRIVACY_METADATA	6 months	YouTube sets this cookie to store the user's cookie consent state for the current domain.
YSC	Session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Biscuit	Duration	Description
_hjIncludedInSessionSample_2950888	2 minutes	Description is currently not available.
_hjSession_2950888	30 minutes	No description
_hjSessionUser_2950888	1 year	No description
BROWNIE	Session	Description is currently not available.

NLP for the classification of science, technology and innovation literature.

Project description

MORE INFORMATION

Do you want to be well informed OR INFORMED?

Copyright 2024 © Industrial Doctorates of the Generalitat

Project description

MORE INFORMATION

Do you want to be well informed OR INFORMED?

Copyright 2024 © Industrial Doctorates of the Generalitat

Consent