The topic has sparked controversy worldwide. With the increasing amount of data that is available on a daily basis, can we affirm that clinical trials are the only or best way to obtain robust conclusions about the efficacy and safety of clinical procedures, such as the diagnosis and treatment of a number of diseases, or, on the contrary, can these be replaced by new methods, which ensure greater comprehensiveness at lower cost?

Well-known data scientists have published innovative methods capable of extracting information on health interventions from health data stored in administrative databases, clinical records and other electronic systems. They claim that clinical trials are no longer necessary and that statistical associations can actually tell us what works. However, this is not the opinion of traditional researchers. Who is right?

Pedro Pereira Rodrigues, principal investigator of the thematic line Health Data and Decision Sciences & Information Technologies of CINTESIS – Center for Health Technology and Services Research, a University of Porto unit, analyzes, in an article published in the prestigious International Journal of Data Science and Analytics, the pros and cons of using Data Science and Big Data in healthcare. There are three controversies, aligned with three questions: what data can we reuse, how should we analyze it, and who do we trust to do so?

“Perhaps the most compelling argument for reusing health data in research is that it allows for a very broad set of data collected during a very long follow-up period to be analyzed, with a cost that is much lower than that of a randomized clinical trial,” says the investigator.

Data science can be more comprehensive and cheaper, but can it replace conventional methods? Can the analysis of large databases be an alternative to conventional research carried out mainly by the pharmaceutical industry?

Although there are situations where data science cannot replace clinical trials (for example, when testing a new drug versus a standard treatment), Pedro Pereira Rodrigues says that in some cases, the results of the analysis obtained from electronic health data may even outweigh clinical tests results in that they include “real world” patients, namely complex, multi-pathological, polymedicated patients, and potentially, without commercial bias.

In addition, these analyzes may address important questions that cannot be answered in any other way, such as the effect of low-dose radiation exposure on the long-term risk of developing cancer.

The third controversy concerns data protection, taking into account fears of “privacy breaches” and “breaches of trust”, especially with the new, much more restrictive European directives. The researcher sees the other side of the coin, referring to situations in which it is difficult or impossible to obtain informed consent, such as in severe mentally ill patients, unconscious or deceased patients, and the potential bias that a voluntary selection of data access will bring to the investigation, stating that “failure to share health data will necessarily lead to worse decision-making processes and poorer health outcomes.”

According to the researcher, the solution involves regulatory mechanisms that certify the reliability of all participants. “Data scientists should create tools to help individuals protect their privacy, empower them to have control over what happens to their personal data, and at the same time maximize their benefits for society,” he advises.

The article “Three controversies in health data science” was carried out in collaboration with Professor Niels Peek of the University of Manchester, in the United Kingdom, and also presents the results of a peer-to-peer discussion at the first European Data Science Conference held in Luxembourg in November 2016.