Descripció del projecte

Textual information appears in a wide range of contexts, apart from traditional documents. Over the last five years, the computer vision community has turned its attention on reading systems that operate on images acquired in unconstrained conditions, such as scene images, video sequences, born-digital images, wearable camera, lifelog feeds, social media images, etc.

Text content is not always present in scene images, but when it is, it tends to be important for understanding the scene. Recent statistics indicate that about 50% of urban scene images contain some form of textual information. When present, text offers high-level semantic information. More often than not, this is orthogonal to information acquired by analysing the visual content of the scene. As such, it requires an explicit and accurate recognition, while properly modelling the interaction between the visual and textual domains.

The availability of large-scale scene-text datasets, combined with latest developments in the Deep Learning paradigm, should enable for a holistic contextual reasoning between scene text and the rest of the scene contents.

Following the long trajectory of the research group on Robust Reading Systems, the objective of this PhD project will be to develop a unified scene understanding model where visual, textual and user information provide mutual context for each other and for the holistic interpretation of scene images.

At EURECAT, the research results of this PhD Thesis will highly impact the company’s activities on understanding the role of images in online social networking. In particular, we will be exploiting the correlation of what it is said on social networks with the knowledge extracted from the holistic interpretation of images developed in the thesis.