Descripció del projecte

Many studies in the healthcare field are observational, i.e., empirical and non-experimental (there is no intervention from the researchers in the data-generating process), and the corpus of observational data in this field keeps growing at high speed. Traditionally, statistical methods that aim to find correlations in the data have been the most widespread techniques employed for data analysis in this type of studies. Recently, machine-learning techniques, which also work by finding correlations, are becoming very popular. This has meant enhanced predictive capabilities and partly a shift of the focus of the analysis from description to prediction[1,2,3].

These approaches, nevertheless, do not explicitly take into account a fundamental property of the data-generating process: causal relationships. These relationships may be of great interest for the researchers, as indeed many studies aim to answer fundamentally causal questions: did the implementation of a protocol of interest cause the desired effect on the population of interest? How will react a specific individual treated with the protocol, or how would have another individual reacted if untreated? Do genes or living habits cause a given disease? Approaches that obviate causal relationships constitute an epistemological limitation, and trying to answer causal questions using correlation as a proxy of causation is nowadays a poor strategy[4].

The objective of this thesis is two-fold: on one hand, to benchmark do-calculus-based[5,6,7,8] versus machine-learning-based[9,10,11] causal analysis algorithms, comparing aspects like the computer performance and the versatility. On the other hand, to develop a general-purpose algorithm (for healthcare settings) that employs a combination of them, with the aim of computing joint probability distributions of variables of interest, under a set of assumptions and conditions. This task will be carried on employing open-source programming languages and specific libraries (for example, dowhy library in Python) and state-of-the-art computing resources. The output will be tested and validated on various datasets managed by AQuAS and the GCAT cohort[8], and will try to answer to healthcare-related relevant questions.

1. A guide to deep learning in healthcare, Esteva et al.

2. Guidelines for reinforcement learning in healthcare, Gottesman et al.

3. High-performance medicine: the convergence of human and artificial intelligence. Topol.

4. Causality, Pearl.

5. The Do¬-calculus revisited, Pearl.

6. Pearl’s Calculus of Intervention Is Complete, Huang, Valtorta.

7. Identification of Conditional Interventional Distributions, Shpitser, Pearl.

8. Transportability of Causal and Statistical Relations: A Formal Approach, Pearl, Bareinboim.

9. The Blessings of Multiple Causes, Wang, Blei.

10. Learning Representations for Counterfactual Inference, Johansson et al.

11. A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms, Bengio et al.

12. GCAT|Genomes for life: a prospective cohort study of the genomes of Catalonia, Obón-Santacana et al.