Data science on drug-resistant tuberculosis in Brazil

Rejane Sobrino Pinheiro

Project name

Data science on drug-resistant tuberculosis in Brazil


The project´s hypothesis was that individual and social characteristics associated with health services-related factors interact in a nonlinear and complex way, contributing significantly to treatment dropout and/or development of drug-resistant tuberculosis (DR-TB). One of the main consequences is increased resistance to treatment. The main objective of this project was to improve DR-TB surveillance tools through linkage techniques, data mining, and machine learning to (1) identify factors associated with unfavorable outcomes of DR-TB, (2) study the patient's trajectory in the health services network, considering primary care units and hospitals, and (3) analyze access to diagnosis of DR-TB via introduction of the rapid molecular test in the health services network, meanwhile developing computational tools to support TB and DR-TB surveillance using existing information systems. The project proposed a systems-based model to improve TB and DR-TB management, surveillance, and treatment.

How was the experiment

The Special Tuberculosis Treatment Information System (SITE-TB) is the main data source for surveillance of drug-resistant tuberculosis (DR-TB) in Brazil. Data from SITE-TB were linked to other databases, allowing the patient´s follow-up in the services network over time. Starting with this database, linkage was performed with the following other databases: Disease Notification System-TB (SINAN-TB - surveillance and history of tuberculosis, 2008 to 2016),  Laboratory Environment Manager (GAL – data from smear diagnoses, culture, and rapid molecular testing), and Mortality Information System (SIM -database from 2013 to 2016). These databases were linked using deterministic and probabilistic techniques and analyzed with data mining and machine learning algorithms to develop a DR-TB risk model. The model includes regression models, association rules, decision tree, graphs, random forest, and XGBoost. A TB cascade of care and concordance analysis between notifications (SINAN-TB) and laboratory data (GAL) were developed.

Main results

980 individuals were included in the evaluation of associated factors and DR-TB outcomes at the national level. Low schooling and illicit drug use were important factors for treatment dropout, and bilateral pulmonary TB was associated with treatment failure. The factors most associated with death were age (over 60 years) and TB-HIV coinfection.

Considering only MDR-TB patients, the predictive models for unfavorable outcomes with the lowest error were nonlinear (random forest and XGBoost, with less than 30% of the error observed in traditional regressions). The best accuracy rates in the linear models were only valid for the cure criterion, the most prevalent in the database. Classification of patients at high risk for unfavorable outcomes with XGBoost generated less than 10% of errors, compared to 77% with the linear model.

An algorithm was developed using association rules for mining "valid" rules (less likely to be obtained by chance) to identify special patient profiles related to various unfavorable outcomes in DR-TB. The trajectory patterns of TB-DR patients were analyzed with data from the city of Rio de Janeiro. Data showed that patients were relatively young (half under 40 years of age) with low schooling levels. DR-TB follow-up data in SITE-TB showed that 28% involved primary resistance.

TB notifications in the SINAN database (2012-2018) were modeled in graphs. Most patients with more than one notification had only one previous treatment, while 15% had at least two prior treatments. The shortest time from previous treatment to resistance was 1.5 years. Graph modeling allowed visualizing various patient trajectories between health units until detection of resistance. Neighborhoods with high TB burden showed short trajectories and a large presence of primary healthcare units in detection and treatment monitoring, suggesting that most patients gain access to primary units for detecting drug resistance.

Linkage of the SINAN database to laboratory data (GAL) showed that units with physicians specialized in family and community medicine were more likely to identify respiratory symptoms, perform diagnoses with bacteriological confirmation, and show shorter delay in treatment initiation.

An R version of a script initially developed in SQL was set to qualify SINAN variables, as a tool to support TB and DR-TB surveillance. The tool was tested in Rio de Janeiro and showed significant differences between HIV testing, severe AIDS, and diabetes. The dropout rate more than doubled after correcting for ignored closures and transfers.

A systems-based, multi-layered model for TB and DR-TB management and control was also developed, aiming to guide analyses, actions, and policies and to highlight the knowledge gaps for further research. The model describes connections of various orders, favoring the general view, interrelations between entities and subsystems, and incidence of complexity due to circular relations with feedback loops. The model emphasizes the excess relations between the individual and the operational and "administrative" health interface, limited relations with the territory, and intense management relations with few subsystems, revealing low capillarity. Deepening of the subsystem related to the health team is in progress, with implementation of continuing education.

Why is it innovative

The proposal´s innovation lies in the use of database linkage techniques with data mining and machine learning to establish a predictive risk model and the study of patients´ trajectory to access the health services network for TB diagnosis and treatment. The innovation also involves the design of a systems-based, multidirectional model for management and continuing education.

Implications for the brazilian health system

TB remains a significant public health problem in Brazil. Identifying interrelations between factors and potential obstacles can reveal new knowledge for better control of TB and DR-TB. Traditional techniques tend to identify general patient profiles but are insufficient to mitigate the problem of TB and DR-TB control. Identifying the most vulnerable population subgroups, places, and health units allows for specific actions to be implemented and monitored. The creation of alarms with the early identification of subgroups at higher risk of resistance or unfavorable outcomes in DR-TB treatment can benefit the care of the most vulnerable patients and support DR-TB surveillance and management.

Next steps

The project intends to discuss the results with health services managers, further aiming to readjust and systematize the data in the different systems. It would be interesting to develop an information center combining project analysis and upgrading the databases with data recovery/updating, cleaning, and standardization. In this context, automatization protocols for linkage using machine learning techniques can support the classification of links in doubtful areas, increasing accuracy and allowing process advancement. We also intend to develop alarms for patients´ risk classification and non-compliance in detection, diagnosis, and treatment. Finally, it is essential to deepen the systems model for TB care in primary healthcare units and invest in strategies that favor continuing education.

Published studies

No published studies.

Article about the project

No related article.

Related projects

No related projects.