Exploring Global Trends in Tuberculosis Incidence Rates - with GetTBinR

In November I attended Epidemics, which is a conference focused on modelling infectious diseases. There was a lot of great work and perhaps most excitingly a lot of work being offered as R packages.

I’ve recently begun wrapping all my analytical work in R packages, as it makes producing reproducible research a breeze! Unfortunately all of this work is still making it’s way towards publication and for a variety of reasons can’t be shared until it has passed this hurdle. As I didn’t want to be left out my only alternative was to produce an R package during my downtime at the conference - from which getTBinR was born. Over the Christmas period more work has been done and the package is now available from CRAN.

GetTBinR wraps the World Health Organisation Tuberculosis data, which includes data on Tuberculosis incidence rates, mortality rates, case reporting rates and several other metrics by country from 2000 until 2016 with varying levels of completeness. It also provides some general purpose functions for mapping and plotting the data, with the aim of kick starting any exploratory analysis of the WHO Tuberculosis data.

For this post I will be quickly diving into global trends in Tuberculosis incidence rates and exploring whether Tuberculosis eradication is on the horizon.

Now to get started, the first step is to get the package (for this post we are using the development version from GitHub) and to load the other packages required for this analysis.

install.packages("devtools")
devtools::install_github("seabbs/GetTBinR")
library(getTBinR)
install.packages("tidyverse")
library(tidyverse)
install.packages("viridis")
library(viridis)

The package is loaded, time to get the data. We download both the data itself and it’s accompanying data dictionary.

tb_burden <- get_tb_burden(download_data = TRUE, save = TRUE)
dict <- get_data_dict(download_data = TRUE, save = TRUE)

We want to explore incidence rates so we need to find them in the data. We can do this using the getTBinR::search_data_dict function,

search_data_dict(dict = dict, def = "incidence", verbose = FALSE) %>% 
  knitr::kable()
variable_name dataset code_list definition
e_inc_100k Estimates Estimated incidence (all forms) per 100 000 population
e_inc_100k_hi Estimates Estimated incidence (all forms) per 100 000 population, high bound
e_inc_100k_lo Estimates Estimated incidence (all forms) per 100 000 population, low bound
e_inc_rr_num Estimates Estimated incidence of rifampicin resistant TB (absolute number)
e_inc_rr_num_hi Estimates Estimated incidence of rifampicin resistant TB (absolute number): high bound
e_inc_rr_num_lo Estimates Estimated incidence of rifampicin resistant TB (absolute number): low bound
e_inc_tbhiv_100k Estimates Estimated incidence of TB cases who are HIV-positive per 100 000 population
e_inc_tbhiv_100k_hi Estimates Estimated incidence of TB cases who are HIV-positive per 100 000 population, high bound
e_inc_tbhiv_100k_lo Estimates Estimated incidence of TB cases who are HIV-positive per 100 000 population, low bound
e_inc_tbhiv_num Estimates Estimated incidence of TB cases who are HIV-positive
e_inc_tbhiv_num_hi Estimates Estimated incidence of TB cases who are HIV-positive, high bound
e_inc_tbhiv_num_lo Estimates Estimated incidence of TB cases who are HIV-positive, low bound

The first hit, e_inc_100k, is the TB incidence rate, with the next two variables being the lower and upper bounds of this estimate. For a quick overview lets map country specific incidence rates in 2016 using the getTBinR function map_tb_burden,

map_tb_burden(df = tb_burden, dict = dict,
              metric = "e_inc_100k", verbose = FALSE) +
  labs(title = "Map of Global Tuberculosis Incidence Rates - 2016",
       subtitle = "", caption = "Source: World Health Organisation")

This shows that incidence rates are highly heterogeneous between regions. To get an better understanding of this lets plot incidence rates by WHO region,

tb_inc_region <- tb_burden %>% 
  group_by(year, g_whoregion) %>% 
  summarise_at(.vars = vars(e_inc_num, e_inc_num_lo, e_inc_num_hi, e_pop_num),
               .funs = funs(sum(as.numeric(.), na.rm = T))) %>% 
  mutate_at(.vars = vars(e_inc_num, e_inc_num_lo, e_inc_num_hi),
            .funs = funs(inc_rate = . / e_pop_num * 1e5)) %>% 
  mutate(g_whoregion = ifelse(is.na(g_whoregion), "Asia", g_whoregion))

plot_tb_inc_region <- function(df = NULL, title = NULL, subtitle = NULL, scales = NULL) {
  df %>% 
      ggplot(aes(x = year, y = e_inc_num_inc_rate, col = g_whoregion)) +
  geom_point() +
  geom_linerange(aes(ymin = e_inc_num_lo_inc_rate, ymax = e_inc_num_hi_inc_rate)) +
  geom_line() +
  scale_colour_viridis(discrete = TRUE) +
  labs(title = title, subtitle = subtitle, 
       x = "Year", y = "Tuberculosis Incidence Rates (per 100,000 population)",
       caption = "Source: World Health Organisation") +
  theme_minimal() +
  theme(legend.position = "none") +
  facet_wrap(~g_whoregion, scales = scales)
}

tb_inc_region %>% 
  plot_tb_inc_region(title = "Global Tuberculosis Incidence Rates",
                     subtitle = "By WHO region, with a fixed y axis",
                     scales = "fixed")

We see that incidence rates are much higher in Africa, and in Asia, than in other regions, and that incidence rates in the Americas and Europe are the lowest. This chart has a fixed y axis which makes it hard to see trends over time within regions, if we repeat it with a free y axis the trends over time become more apparent,

tb_inc_region %>% 
  plot_tb_inc_region(title = "Global Tuberculosis Incidence Rates",
                     subtitle = "By WHO region, with a variable y axis",
                     scales = "free_y")

This plot shows that Tuberculosis incidence rates are decreasing in all regions, which is a great sign for the elimination of Tuberculosis. However whilst this is true on the regional level it may not be true for all countries in the data set, something that is required to truly eradicate Tuberculosis. To explore this we find the countries that had higher incidence rates in 2016 than in 2000.

countries_inc_up <- tb_burden %>% 
  filter(year %in% c(2000, 2016)) %>% 
  group_by(country) %>% 
  arrange(desc(e_inc_100k)) %>% 
  slice(1) %>% 
  filter(year == 2016) %>% 
  pull(country)

high_inc_countries <- tb_burden %>% 
  filter(year == 2016) %>% 
  group_by(country) %>% 
  summarise(e_inc_100k = max(e_inc_100k)) %>% 
  ungroup %>% 
  arrange(desc(e_inc_100k)) %>% 
  slice(1:20) %>% 
  pull(country) %>% 
  unique

high_inc_up_countries <- intersect(countries_inc_up, high_inc_countries)

This results in a list of 44 countries all of which had higher incidence rates in 2016 than in 2000. Of these countries 8 were in the top 20 countries by incidence rate in 2016. This can plotted below using getTBinR::plot_tb_burden_overview,

plot_tb_burden_overview(df = tb_burden,
                        dict = dict, 
                        metric = "e_inc_100k", 
                        countries = high_inc_up_countries,
                        verbose = FALSE) +
  labs(title = "Tuberculosis Incidence Rates from 2000-2016",
       subtitle = "Showing countries with the highest incidence rates in which incidence rates are increasing",
       caption = "Source: World Health Organisation")

GetTBinR also supplies another function (plot_tb_burden) that can be used to visualise this,

plot_tb_burden(df = tb_burden,
               dict = dict,
               metric = "e_inc_100k", countries = high_inc_up_countries, 
               facet = "country", scales = "free_y", verbose = FALSE) +
  labs(title = "Tuberculosis Incidence Rates from 2000-2016",
       subtitle = "Showing countries with the highest incidence rates in which incidence rates are increasing",
       caption = "Source: World Health Organisation")

A possible cause of this may be that reporting of Tuberculosis notifications has improved over time, to understand this we first find the required variable in the data,

search_data_dict(dict = dict, def = "detection", verbose = FALSE) %>% 
  knitr::kable()
variable_name dataset code_list definition
c_cdr Estimates Case detection rate (all forms), percent
c_cdr_hi Estimates Case detection rate (all forms), percent, high bound
c_cdr_lo Estimates Case detection rate (all forms), percent, low bound

We then plot the case detection rate over time in the countries of interest,

plot_tb_burden(df = tb_burden,
               dict = dict,
               metric = "c_cdr", countries = high_inc_up_countries, 
               facet = "country", scales = "free_y", verbose = FALSE) +
  labs(title = "Tuberculosis Detection Rates from 2000-2016",
       subtitle = "Showing countries with the highest incidence rates in which incidence rates are increasing",
       caption = "Source: World Health Organisation")

The plot above shows that for some countries such as Mozambique and South Africa the increases in incidence rates may be linked to increases in the cases detection rate. However, it is clear that this is not the case for all countries, in particular the Congo which has seen increasing incidence rates and falling case detection rates.

This post has highlighted the fact that whilst Tuberculosis is declining globally this is not the case in all countries. In order for the global eradication of Tuberculosis to be a success it is vital that resources are targeted at those countries which are struggling to reduce their incidence rates.

The WHO Tuberculosis data contains many variables not mentioned in this post and has the potential to answer a number of other questions. See the package site for additional package features not mentioned in this post. See my gists for code snippets to get you started. Feature requests are welcome, preferably by filing a GitHub issue, as are suggestions for additional data sources to include.

Related

comments powered by Disqus