Exploring Tuberculosis Monitoring Indicators in England; Using Dimension Reduction and Clustering

Introduction

I recently attended the Public Health Research and Science Conference, run by Public Health England (PHE), at the University of Warwick. I was mainly there to present some work that I have been doing (along with my co-authors) estimating the direct effects of the 2005 change in BCG vaccination policy on Tuberculosis (TB) incidence rates (slides) but it was also a great opportunity to see what research is being done within, and partnered with, PHE. The standout out work for me was the nascent data science work that is being undertaken within PHE, which is currently focused around the fingertipsR R package by Sebastian Fox. fingertipsR provides an easy interface to access the fingertips API, which contains data on a large variety of public health issues, and can be explored interactively online (see previous link).

In this post we will focus on data on Tuberculosis (TB), which is predominately a respiratory disease and if left untreated kills approximately half of those infected. The majority of cases are symptom-less (known as latent TB), with 10% of latent cases progressing to active disease. It is thought that immediately after infection individuals are at a higher risk of proceeding to active disease, with the risk diminishing after several years. However, individuals who have carried the disease for many years can, and do, progress to active TB disease. This makes the control and management of TB on a population scale challenging as cases may either be due to recent transmission or be from the activation of latent cases who have carried the disease for many years. HIV infection is known to drastically increase the likelihood of progression to active TB disease and TB is the leading cause of death among people living with HIV (source). Globally 10.4 million people fell ill with TB in 2016 alone, with 1.7 million deaths (source).

In England, TB incidence rates have declined drastically over the course of the last century, with the introduction of BCG vaccination, effective TB treatments and improved standards of living. However, over the previous two decades incidence rates have remained relatively stable with incidence becoming increasing focused in at risk communities, such as the homeless population and the non-UK born living in high density urban areas (source). Recently it has been recognised that TB interventions in England must be collaborative and consistent across the country (source), as this has proven to be effective in other countries such as the USA.

This post uses dimension reduction (Principal Component Analysis (PCA)) and clustering (Partitioning Around Mediods (PAM)) to explore the Tuberculosis monitoring indicators available from the fingertips API, using the fingertipsR R package, for England. It aims to use hypothesis free techniques to generate clusters of counties with similar characteristics of TB indicators. This may help to identify regional variation in Tuberculosis monitoring indicators and possibly provide a framework for future improvements to TB control efforts. It also seeks to act as an example of dimension reduction and clustering analysis. Therefore, comments to improve this aspect of this post would be greatly appreciated (paper links, package recommendations, methodology improvements etc.)!

Packages

The first step is to load the packages required for the analysis, we do this using the fantastic pacman package.

if (!require(pacman)) install.packages("pacman"); library(pacman)
p_load("viridis")
p_load("broom")
p_load("knitr")
p_load("ggfortify")
p_load("gpclib")
p_load("rgdal")
p_load("raster")
p_load("maptools")
p_load("purrr")
p_load("FactoMineR")
p_load("cluster")
p_load("scales")
p_load("fingertipsR")
p_load("tidyverse")
p_load("knitr")
p_load("ggthemes")
p_load_gh("thomasp85/patchwork", dependencies = TRUE)

Data

As discussed in the introduction we are using fingertipsR as our data source. The first step is to investigate the data profiles provided by the package which mention TB.

profs <- profiles()

sel_profs <- profs[grepl("TB", profs$ProfileName),]

kable(sel_profs)
ProfileID ProfileName DomainID DomainName
86 TB Strategy Monitoring Indicators 1938132814 Key Indicators
86 TB Strategy Monitoring Indicators 1938133208 LTBI programme monitoring

We find that there are two profiles: key indicators and LTBI programme monitoring (screening the at-risk population for latent TB). We use the indicators function to explore the variables available in each profile.

tb_inds <- indicators(ProfileID = sel_profs$ProfileID)

kable(tb_inds)
IndicatorID IndicatorName DomainID DomainName ProfileID ProfileName
91359 TB incidence in England 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91361 TB incidence (three year average) 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91365 Proportion of pulmonary TB cases that were culture confirmed 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91366 Proportion of culture confirmed TB cases with drug susceptibility testing reported for the four first line agents 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91367 Proportion of drug sensitive TB cases who had completed a full course of treatment by 12 months 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91368 Proportion of drug sensitive TB cases who were lost to follow up at last reported outcome 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91369 Proportion of drug sensitive TB cases who had died at last reported outcome 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91373 Proportion of TB cases offered an HIV test 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91374 Proportion of drug sensitive TB cases with at least one social risk factor who completed treatment within 12 months 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91375 Proportion of culture confirmed TB cases with any first line drug resistance 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91450 Proportion of pulmonary TB cases starting treatment within two months of symptom onset 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators
91451 Proportion of pulmonary TB cases starting treatment within four months of symptom onset 1938132814 Key Indicators 86 TB Strategy Monitoring Indicators

This gives us 12 TB indicators, which we can now extract using the fingertips_data function combined with a call to purrr::map. This results in 12 tibbles, with the first being empty (TB incidence rates).

tb_df <- tb_inds$IndicatorID %>% map(~fingertips_data(IndicatorID = .))

The key indicator, which we will match all of the remaining data too, is the three-year average TB incidence rates (as the annual TB incidence rate is missing and would also be more susceptible to noise). The following code extracts this at the county level, re-codes the value variable as recent incidence rates and pulls the overall incidence of cases. According to the fingertips website, local authorities and CCGs with fewer than 20 TB cases per year have had all data for the indicators (apart from three-year average TB incidence) suppressed to avoid deductive disclosure. We can therefore filter out these counties now to avoid issues with missing data later. We also adjust the time period to represent the final year for each rolling average.

tb_inc <- tb_df[[2]] %>% 
  filter(AreaType %in% "County & UA") %>% 
  dplyr::select(AreaName, Sex, Age, Timeperiod,
         rec_inc_rate = Value, rec_inc = Count) %>% 
  filter(rec_inc >= 20) %>% 
  mutate(Timeperiod = Timeperiod %>% 
           str_split(" - ") %>% 
           map_chr(first) %>% 
           as.numeric %>% 
           {. + 2} %>% 
           as.character) %>% 
  dplyr::select(-rec_inc)

Looking through the other tibbles they all have the same structure - we can write a function using this knowledge to speed up data extraction.

tb_df_extraction <- function(tb_df, var_name, area_type = "County & UA") {
  df <- tb_df %>% 
    filter(AreaType %in% area_type) %>% 
    dplyr::select(AreaName, Sex, Age, Value, Timeperiod) %>% 
    rename_at(.vars = vars(Value), funs(paste0(var_name)))
  
  return(df)
}

We now extract data for all remaining indicators, rename variables with meaningful names, join into a single tibble and then left join onto the TB incidence rate tibble. Data is only available aggregated for all ages and genders so we also drop these variables here. Finally we clean up Timeperiod into years.

var_names <- c("prop_pul_cc", "prop_cc_ds_front", "prop_ds_treat_com_12",
               "prop_ds_lost_to_follow", "prop_ds_died",
               "prop_tb_offered_hiv_test", "prop_ds_rf_treat_com_12",
               "prop_cc_dr_front", "prop_p_start_treat_2_m_sym",
               "prop_p_start_treat_4_m_sym"
)

extracted_tb <- map2(tb_df[-(1:2)], var_names, ~tb_df_extraction(.x, .y)) %>% 
  reduce(full_join, by = c("AreaName", "Sex", "Age", "Timeperiod"))

com_tb_df <- tb_inc %>% 
  left_join(extracted_tb, by = c("AreaName", "Sex", "Age", "Timeperiod")) %>% 
  mutate(year = Timeperiod %>% as.numeric) %>% 
  mutate_if(is.character, as.factor) %>% 
  dplyr::select(-Timeperiod) %>% 
  filter(Sex %in% "Persons", Age %in% "All ages") %>% 
  dplyr::select(-Sex, -Age)

com_tb_df
## # A tibble: 2,026 x 13
##    AreaName     rec_inc_rate prop_pul_cc prop_cc_ds_front prop_ds_treat_c…
##    <fct>               <dbl>       <dbl>            <dbl>            <dbl>
##  1 Hartlepool          10.0         NA                NA            NA    
##  2 Middlesbrou…        16.1         72.7             100.           82.6  
##  3 Stockton-on…         5.99        NA                NA            NA    
##  4 Blackburn w…        37.7         61.8             100.           80.7  
##  5 Blackpool            7.95        NA                NA            NA    
##  6 Kingston up…         5.06        NA                NA            NA    
##  7 East Riding…         3.28        NA                NA            NA    
##  8 Derby               19.3         66.7             100.           76.7  
##  9 Leicester           71.1         62.6             100.            0.546
## 10 Nottingham          16.9         80.0             100.           73.0  
## # ... with 2,016 more rows, and 8 more variables:
## #   prop_ds_lost_to_follow <dbl>, prop_ds_died <dbl>,
## #   prop_tb_offered_hiv_test <dbl>, prop_ds_rf_treat_com_12 <dbl>,
## #   prop_cc_dr_front <dbl>, prop_p_start_treat_2_m_sym <dbl>,
## #   prop_p_start_treat_4_m_sym <dbl>, year <dbl>

Next, we need to check the completeness of the data. Ideally, we would use the most recent year of data for our clustering analysis but this may not be possible if variables are highly missing. The following function calculates the proportion of missing data in each year for each variable.

get_frac_missing <- function(df) {
  df %>% 
    nest() %>% 
    mutate(missing = map(data,~map_dfr(. ,~sum(is.na(.))/length(.)))) %>% 
    dplyr::select(-data) %>% 
    unnest(missing) 
}

## Get the proportion missing per variableby year
tb_miss_per_year <- com_tb_df %>% 
  group_by(year) %>% 
  get_frac_missing %>% 
  mutate_all(~round(., 2)) %>% 
  arrange(year) 

tb_miss_per_year %>% 
  mutate(year = as.character(year)) %>% 
  t %>%
  kable
year 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
AreaName 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
rec_inc_rate 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
prop_pul_cc 0.36 0.36 0.34 0.31 0.33 0.30 0.35 0.27 0.33 0.31 0.27 0.34 0.34 0.39 0.39
prop_cc_ds_front 0.36 0.36 0.34 0.31 0.33 0.30 0.35 0.27 0.33 0.31 0.27 0.34 0.34 0.39 0.39
prop_ds_treat_com_12 0.36 0.36 0.34 0.31 0.33 0.30 0.35 0.27 0.33 0.31 0.27 0.34 0.34 0.39 1.00
prop_ds_lost_to_follow 0.36 0.36 0.34 0.31 0.33 0.30 0.35 0.27 0.33 0.31 0.27 0.34 0.34 0.39 1.00
prop_ds_died 0.36 0.36 0.34 0.31 0.33 0.30 0.35 0.27 0.33 0.31 0.27 0.34 0.34 0.39 1.00
prop_tb_offered_hiv_test 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.28 0.34 0.34 0.39 0.39
prop_ds_rf_treat_com_12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
prop_cc_dr_front 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
prop_p_start_treat_2_m_sym 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.31 0.28 0.35 0.34 0.39 0.39
prop_p_start_treat_4_m_sym 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.31 0.28 0.35 0.34 0.39 0.39

We see that data completeness increases with time but that some variables are completely missing (e.g. prop_ds_rf_treat_com_12 and prop_cc_dr_front). We therefore drop these variables and then identify which year has the lowest amount of missing data across all remaining variables (by looking at mean missingness).

## Drop full missing variables
tb_partial_miss_year <- tb_miss_per_year %>% 
  select_if(~!sum(.) == length(.))

## Full missing variables
com_miss_vars <- setdiff(names(tb_miss_per_year), names(tb_partial_miss_year))

## Which year has the most complete data
tb_complete_years_all_vars <- com_tb_df %>% 
  group_by(year) %>% 
  nest() %>% 
  mutate(missing = map(data,~mean(colSums(is.na(.))/nrow(.)))) %>% 
  dplyr::select(-data) %>% 
  unnest(missing) %>% 
  mutate(missing = round(missing, 2)) %>% 
  arrange(year)

kable(tb_complete_years_all_vars)
year missing
2002 0.57
2003 0.57
2004 0.56
2005 0.55
2006 0.55
2007 0.54
2008 0.56
2009 0.53
2010 0.56
2011 0.43
2012 0.35
2013 0.39
2014 0.39
2015 0.43
2016 0.58

The above table indicates that 2016 has a high proportion of missing data. From the previous table we saw that this was partially due to some variables being completely missing. The next best option is 2015 - this has a slightly higher proportion of missing data than previous years but has no variables that are completely missing and is the most relevant after 2016. The final question is to what extent missingness is still related to TB incidence rates. The following table investigates this by looking at what happens as counties are excluded using varying incidence rate cut-offs.

com_tb_df %>%
  filter(year == 2015) %>% 
  dplyr::select(-map_dbl(com_miss_vars, contains)) %>% 
  mutate(inc_rate_lower = list(seq(2, 20, 2))) %>% 
  unnest(inc_rate_lower) %>% 
  group_by(year, inc_rate_lower) %>% 
  filter(rec_inc_rate > inc_rate_lower) %>% 
  nest() %>% 
  mutate(missing = map(data,~mean(colSums(is.na(.))/nrow(.)))) %>% 
  dplyr::select(-data, -year) %>% 
  unnest(missing) %>% 
  kable
inc_rate_lower missing
2 0.3117647
4 0.2598291
6 0.1934066
8 0.1014085
10 0.0387097
12 0.0148148
14 0.0000000
16 0.0000000
18 0.0000000
20 0.0000000

The choice of incidence rate cut-off is somewhat arbitrary. However, it appears that a cut-off of at least 10 (per 100,000) is sufficient to deal with the majority of missing data. This is also a sensible cut-off as it represents the World Health Organisations 2035 target for TB eradication (source). This means that our analysis will focus on counties that have relatively high incidence rates in comparison to the median in England. Using everything we have learnt about the quality of the indicator data we now identify the near final analysis dataset.

tb_df_2015 <- com_tb_df %>% 
  dplyr::select(-map_dbl(com_miss_vars, contains)) %>% 
  filter(year == 2015) %>% 
  filter(rec_inc_rate > 10)

tb_df_2015 
## # A tibble: 62 x 11
##    AreaName     rec_inc_rate prop_pul_cc prop_cc_ds_front prop_ds_treat_c…
##    <fct>               <dbl>       <dbl>            <dbl>            <dbl>
##  1 Blackburn w…         29.0        71.4            100.              87.5
##  2 Derby                14.0        72.7            100.              71.9
##  3 Leicester            41.8        81.2             97.6             78.4
##  4 Nottingham           17.1        82.1             97.1             80.4
##  5 Stoke-on-Tr…         12.1        66.7            100.              80.8
##  6 Bristol              20.6        53.1             95.1             79.0
##  7 Swindon              10.7        57.1            100.              75.0
##  8 Peterborough         23.2        66.7            100.              69.0
##  9 Luton                30.1        80.0            100.              82.4
## 10 Reading              34.7        64.3             95.0             97.0
## # ... with 52 more rows, and 6 more variables:
## #   prop_ds_lost_to_follow <dbl>, prop_ds_died <dbl>,
## #   prop_tb_offered_hiv_test <dbl>, prop_p_start_treat_2_m_sym <dbl>,
## #   prop_p_start_treat_4_m_sym <dbl>, year <dbl>

The final step is to deal with the remaining missing data. As all variables, except incidence rate, are missing for some counties we cannot reliably impute the data. We therefore drop it and make a note of the counties for which data was not available.

tb_clean_2015 <- tb_df_2015 %>% 
  drop_na() %>% 
  select(-year)

missing_regions <- setdiff(tb_df_2015$AreaName %>% as.character, tb_clean_2015$AreaName %>% as.character)

missing_regions
## [1] "Wokingham" "Bedford"   "Bury"

This leave us with a tidy and complete dataset with data on TB monitoring indicators for 59 counties in 2015.

Dimension reduction

We are now ready to conduct some clustering analysis on this data. The first step is to reduce the dimensionality of the data using principal component analysis (PCA). We use the estim_ncp function (which uses a method outlined in this paper) from the FactoMineR package to estimate the number of principal components required. We then perform PCA (using prcomp) and plot the variance explained by each component as a check on estim_ncp. All of the following analysis is done using nested tibbles and so can be easily generalised to higher dimensional use cases.

tb_pca <- tb_clean_2015 %>% 
  nest() %>% 
  mutate(
    numeric_data = map(data, ~select_if(., is.numeric) %>% 
                         as.data.frame()),
    optimal_pca_no = map(numeric_data, ~estim_ncp(., 
                                                  scale = TRUE, 
                                                  ncp.min = 2, 
                                                  ncp.max = 10)) %>% 
      map_dbl(~.$ncp),
    pca = map(numeric_data, ~prcomp(.x, 
                                    center = TRUE, 
                                    scale = TRUE)),
    pca_data = map(pca, ~.$x),
    pca_aug = map2(pca, data, ~augment(.x, data = .y)))

We find that the optimal number of principal components is 2. We can also plot the percentage of variance explained in order to evaluate this choice.

## Variance explained
var_exp <- tb_pca %>% 
  dplyr::select(-optimal_pca_no) %>% 
  unnest(pca_aug) %>% 
  summarize_at(.vars = vars(contains("PC")), .funs = funs(var)) %>% 
  gather(key = pc, value = variance) %>% 
  mutate(var_exp = variance/sum(variance) * 100,
         cum_var_exp = cumsum(var_exp),
         pc = str_replace(pc, ".fitted", "") %>% 
           str_replace("PC", ""))
var_exp %>% 
  rename(
    `Variance Explained` = var_exp,
    `Cumulative Variance Explained` = cum_var_exp
  ) %>% 
  gather(key = key, value = value, `Variance Explained`, `Cumulative Variance Explained`) %>%
  mutate(key = key %>% 
           factor(levels  = c("Variance Explained", 
                              "Cumulative Variance Explained"))) %>% 
  mutate(value = value / 100) %>% 
  mutate(pc = factor(pc, levels = as.character(1:max(var_exp$pc %>% as.numeric)))) %>% 
  ggplot(aes(pc, value, group = key)) + 
  geom_point(size = 2, alpha = 0.8) + 
  geom_line(size = 1.1, alpha = 0.6) + 
  facet_wrap(~key, scales = "free_y") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(0, 1, 0.05), lim = c(0, NA),
                     minor_breaks = NULL, labels = percent) +
  labs(
    title = "Variance Explained by Principal Component",
    subtitle = paste0("The optimal number of principal components suggested by estim_ncp was ",
                      tb_pca$optimal_pca_no, " which explains ", round(var_exp$cum_var_exp[[2]], 0), "% of the data."),
    x = "Principal Component",
    y = "Variance Explained (%)",
    caption = "@seabbs Source: Public Health England (fingertipsR)"
  )

The above plot shows that only 42% of the variance in the data is explained by the first two principle components (PCs) even though the estim_ncp function suggested that this was the optimal number. This indicates that there is large amount of noise in the data with a large amount of non-systematic between county variation. Another approach, using the ‘elbow’ (change from decreasing to stable amount of variance explained), would estimate that 8 PCs are required to explain the variance in the data.

Clustering of TB Monitoring Indicators

Next, we can now perform clustering using the partitioning around medoids algorithm on the first two principal components. This approach should be more stable than K means and also has the benefit of producing a metric (the average silhouette width) which can be used to estimate the number of clusters that provides the best fitting model. The outline of the PAM algorithm is as follows:

  1. Randomly select k observations as the initial medoid.
  2. Assign each observation to the closest medoid.
  3. Swap each medoid and non-medoid observation, computing the dissimilarity cost.
  4. Select the configuration that minimizes the total dissimilarity.
  5. Repeat steps 2 through 4 until there is no change in the medoids.

Again, we use an approach that makes use of nested tibbles, this should be easier to generalise to other use cases.

## Perform pam on pca data 1 to 10 groups
tb_pca_pam <- tb_pca %>%
  mutate(centers = list(2:10)) %>% 
  unnest(centers, .preserve = everything()) %>% 
  dplyr::select(-centers, centers = centers1) %>% 
  group_by(centers) %>% 
  mutate(
    pam = map(pca_data,
              ~ pam(x = .x[, 1:optimal_pca_no], k = centers, stand = TRUE)),
    clusters = map(pam, ~.$clustering),
    avg_silhouette_width = map(pam, ~.$silinfo$avg.width),
    data_with_clusters = map2(.x = data, .y = clusters, ~mutate(.x, cluster = factor(.y, ordered = TRUE)))
  ) %>% 
  ungroup

tb_pca_pam
## # A tibble: 9 x 11
##   optimal_pca_no data   numeric_data  pca   pca_data pca_aug centers pam  
##            <dbl> <list> <list>        <lis> <list>   <list>    <int> <lis>
## 1             2. <tibb… <data.frame … <S3:… <dbl [5… <data.…       2 <S3:…
## 2             2. <tibb… <data.frame … <S3:… <dbl [5… <data.…       3 <S3:…
## 3             2. <tibb… <data.frame … <S3:… <dbl [5… <data.…       4 <S3:…
## 4             2. <tibb… <data.frame … <S3:… <dbl [5… <data.…       5 <S3:…
## 5             2. <tibb… <data.frame … <S3:… <dbl [5… <data.…       6 <S3:…
## 6             2. <tibb… <data.frame … <S3:… <dbl [5… <data.…       7 <S3:…
## 7             2. <tibb… <data.frame … <S3:… <dbl [5… <data.…       8 <S3:…
## 8             2. <tibb… <data.frame … <S3:… <dbl [5… <data.…       9 <S3:…
## 9             2. <tibb… <data.frame … <S3:… <dbl [5… <data.…      10 <S3:…
## # ... with 3 more variables: clusters <list>, avg_silhouette_width <list>,
## #   data_with_clusters <list>

To assess the optimal number of clusters we can plot the average silhouette width. This indicates that two clusters are optimal, although this estimate may not be that robust as the average silhouette width is low (0.38) with 6, 7, and 8 clusters also having average silhouette widths that are comparable. In general, we prefer the parsimonious model and therefore we will only investigate two clusters for the remainder of this post. Note that this does not test for no clustering (i.e a single cluster).

## Get max silhouette width
max_silhouette_width <- tb_pca_pam %>% 
  dplyr::select(centers, avg_silhouette_width) %>% 
  unnest(avg_silhouette_width) %>% 
  arrange(desc(avg_silhouette_width)) %>% 
  slice(1)
  
## Plot average silhouette width
tb_pca_pam %>% 
  dplyr::select(centers, avg_silhouette_width) %>% 
  unnest(avg_silhouette_width) %>% 
  ggplot(aes(x = centers, y = avg_silhouette_width)) +
  geom_line(size = 2, alpha = 0.4) +
  geom_point(size = 3, alpha = 0.8) +
  theme_minimal() +
  scale_x_continuous(breaks = seq(1, 10, 1), minor_breaks = NULL) +
  scale_y_continuous(limits = c(NA, NA), breaks = seq(0, 1, 0.01), minor_breaks = NULL) +
  labs(title = "Average Silhouette Width by Number of PAM Clusters",
       subtitle = paste0("The optimal number of clusters identifed by avg. silhouette width was ",
                      max_silhouette_width$centers,
                      " with an avg. silhouette width of ", 
                      round(max_silhouette_width$avg_silhouette_width, 2)
       ),
       x = "Clusters",
       y = "Avg. Silhouette Width",
       caption = "@seabbs Source: Public Health England (fingertipsR)")

Results

We can now explore the clusters we have identified. A useful way to do this is to visual the first two principal components overlaid with the original variable loadings, and the clusters we have identified.

## Plot clusters
pca_plot <- tb_pca_pam %>% 
  filter(centers == max_silhouette_width$centers) %>% 
  select(data_with_clusters, pca) %>% 
  mutate(pca_graph = map2(.x = pca, 
                          .y = data_with_clusters,
                          ~ autoplot(.x, x = 1, y = 2, 
                                     loadings = TRUE, loadings.label = TRUE,
                                     loadings.label.repel = TRUE,
                                     loadings.label.size = 2, loadings.alpha = 0.8,
                                     loadings.label.vjust = -1, data = .y, 
                                     label = TRUE, label.label = "AreaName",
                                     label.size = 1.5, label.vjust = -1, 
                                     alpha = 0.3, frame = TRUE, 
                                     frame.type = 'convex', frame.alpha= 0.05,
                                     colour = "cluster", size = "rec_inc_rate") +
                            theme_minimal() +
                            labs(x = paste0("Principal Component 1 (Variance Explained: ",
                                            round(var_exp$var_exp[[1]], 1), "%)"),
                                 y = paste0("Principal Component 2 (Variance Explained: ",
                                            round(var_exp$var_exp[[2]], 1), "%)")) +
                            guides(colour=guide_legend(title = "Cluster", ncol = 2), 
                                   fill=guide_legend(title= "Cluster", ncol = 2),
                                   size = guide_legend(title = "TB Incidence Rate (per 100,000 population)",
                                                       ncol = 2)) +
                            scale_colour_viridis(option = "viridis", 
                                                 discrete = TRUE, end = 0.5) +
                            scale_fill_viridis(option = "viridis", 
                                               discrete = TRUE, end = 0.5) +
                            theme(legend.position = "bottom", 
                                  legend.box = "horizontal") +
                            labs(
                              title = "Tuberculosis Data in England; First Two Principal Components",
                              subtitle = "The arrows are variable loadings and points are counties coloured by cluster membership",
                              caption = "@seabbs Source: Public Health England (fingertipsR)"
                            )
  )) %>% 
  pull(pca_graph) %>% 
  first


pca_plot

From this we see that the clusters are generally split by incidence rates with lower incidence rate counties also having a higher proportion that either die or are lost to follow up. The higher incidence counties have a higher proportion of cultured confirmed pulmonary cases and more cases that complete treatment within 12 months. It appears that the proportion of cases that start treatment within 2 and 4 months varies over both clusters. We can also see that the proportion lost to follow up is inversely related to the proportion that are offered HIV tests, with a higher proportion that are lost to follow up corresponding to a reduced proportion of cases being offered HIV tests.

Another way of summarising the between cluster differences is to summarise the data by cluster, which is presented in the following plot. This approach to exploring differences between clusters may not be applicable to data with a large number of clusters, for this a faceted ridge plot (ggridges) would probably offer a better solution.

sum_tb_df <- tb_pca_pam %>% 
  filter(centers == max_silhouette_width$centers) %>% 
  pull(data_with_clusters) %>% 
  map(~ gather(., key = "Variable", value = "value", -AreaName, -cluster)) %>% 
  first %>% 
  rename(Cluster = cluster) 

plot_cluster_diff <- sum_tb_df %>% 
  ggplot(aes(x = Variable, y = value, col = Cluster, fill = Cluster)) +
  geom_violin(draw_quantiles = c(0.025, 0.5, 0.975), alpha = 0.2, scale = "width") +
  geom_jitter(position = position_jitterdodge(), alpha = 0.3) +
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "bottom") +
  scale_y_continuous(breaks = seq(0, 100, 5), minor_breaks = NULL) +
  scale_colour_viridis(option = "viridis", discrete = TRUE, end = 0.5) +
  scale_fill_viridis(option = "viridis", discrete = TRUE, end = 0.5) +
  labs( 
    title = "Tuberculosis in England; Summarised by Cluster",
    subtitle = "Violin plots are scaled by width, with the 2.5%, 50% and 97.5% quantiles shown.",
    x = "Variable",
    y = "Incidence rate (per 100,000) for rec_int_rate, otherwise proportion (0-100%)",
    caption = "@seabbs Source: Public Health England (fingertipsR)")

plot_cluster_diff

To explore this further we can calculate the percentage difference between clusters for several of the variables’ summary statistics. The following table does this for the mean, the median, the 2.5% quantile, and the 97.5% quantile.

sum_tb_df %>% 
  group_by(Cluster, Variable) %>% 
  summarise(mean = mean(value), median = median(value),
            lll = quantile(value, 0.025),
            hhh = quantile(value, 0.975)) %>% 
  group_by(Variable) %>% 
  mutate_if(is.numeric, .funs = funs((lag(.) - .)/ .)) %>% 
  drop_na %>% 
  mutate_if(is.numeric, .funs = funs(paste0(round(. * 100, 1), "%"))) %>% 
  ungroup %>% 
  mutate(Variable = factor(Variable, levels = rev(.$Variable))) %>% 
  arrange(Variable) %>% 
  rename(Mean = mean, Median = median, `2.5% Quantile` = lll, `97.5% Quantile` = hhh) %>% 
  select(-Cluster) %>% 
  kable
Variable Mean Median 2.5% Quantile 97.5% Quantile
rec_inc_rate 46.2% 26.5% -7.6% 123.1%
prop_tb_offered_hiv_test 2.4% 4.5% 2.6% 0%
prop_pul_cc 16.2% 15.4% 32.5% 17.5%
prop_p_start_treat_4_m_sym 11.8% 13.3% 28.5% 9.2%
prop_p_start_treat_2_m_sym 12.1% 9.2% -28.2% 21%
prop_ds_treat_com_12 11.1% 11.1% 11.8% 10.3%
prop_ds_died -56% -51.6% -100% -61.2%
prop_cc_ds_front 0.1% 0% 0.6% 0%

From the plot and table above, we see that, cluster 1 contains the counties with higher incidence rates and therefore has a higher median incidence rate. At the same time, it has a large interquartile range but a similar lower quartile to cluster 2. We also see that in cluster 1 a greater number of counties are not offering HIV tests to all cases, although it appears the majority of counties are offering HIV tests to 100% of TB cases in both cases. In cluster 2 there is a large reduction in the proportion of pulmonary cases that were culture confirmed. In addition, in pulmonary cases, fewer start treatment within 2 or 4 months of developing symptoms in comparison to cluster 1. There is also a moderate increase in the proportion of cases that are lost to follow up and a large increase in the proportion of cases that died in cluster 2 compared to cluster 1. There is little difference between clusters for the proportion of cultured confirmed cases which had drug susceptibility reported, with the majority of clusters reporting on 100% of cases.

A more visual way to understand the clustering of TB in England based on the data we have extracted using the fingertipsR package is to plot the cluster membership for each county on a map. We do this using data on the outlines of the counties in England from the UK data service (Darren L Dahly kindly pointed me in the direction of this data - thank you!)

## Make the plot into a function as ggplot2 object is very large and cause git issues (i.e therefore easier to remake the plot than it is to transfer between code chunks)
tb_cluster_map <- function(tb_pca_pam) {
  ## Some issues here with extracting code from the sp file
## Solved using the folling maptools functions - improvements appreciated!
gpclibPermit()

england_counties <- shapefile("../../static/data/shapefiles/england-2011-ct-shape/england_ct_2011.shp") %>%
  fortify(region = "code") %>% 
  as_tibble

england_urban_areas <- shapefile("../../static/data/shapefiles/england-urb-2001-shape/england_urb_2001.shp") %>% 
  fortify(region = "name") %>% 
  as_tibble %>% 
  filter(id %in% c("Greater London Urban Area", 
                   "Greater Manchester Urban Area",
                   "Bristol Urban Area",
                   "West Midlands Urban Area",
                   "Milton Keynes Urban Area"))

## Make custom positions for urban area labels
urban_area_labels <- england_urban_areas %>%
  group_by(id) %>% 
  slice(100) %>% 
  ungroup() %>% 
  mutate(long = long - 200000,
         lat = lat + 20000)
  
  
tb_cluster_results <- tb_pca_pam %>% 
  filter(centers == max_silhouette_width$centers) %>% 
  pull(data_with_clusters) %>% 
  first

tb_cluster_results <- tb_df[[2]] %>% 
              dplyr::select(AreaName, AreaCode, AreaType) %>% 
  filter(AreaType %in% "County & UA") %>% 
              unique %>% 
  left_join(tb_cluster_results,
            by = "AreaName") %>% 
  left_join(england_counties, by = c("AreaCode" = "id"))

   tb_cluster_results %>% 
  rename(Cluster = cluster) %>% 
  drop_na(Cluster) %>% 
  dplyr::select(long, lat, Cluster, group) %>% 
  ggplot( 
                 aes(x = long, 
                     y = lat,
                     fill = Cluster)) +
    geom_polygon(data = england_urban_areas, 
                 aes(group = group, fill = NULL),
                 alpha = 0.4) +
    geom_path(data = tb_cluster_results, 
              aes(group = group, fill = NULL), 
              alpha = 0.4) +
    geom_polygon(data = tb_cluster_results, 
                 aes(group = group, fill = NULL),
                 alpha = 0.1) +
    geom_polygon(aes(group = group), alpha = 0.6) +
    geom_line(data = urban_area_labels %>% 
                bind_rows(urban_area_labels %>% 
                            mutate(long = long + 200000, 
                                   lat = lat - 20000)),
              aes(fill = NA, group = id), alpha = 0.8) + 
    geom_label(data = urban_area_labels,
              aes(label = id), fill = "grey") +
    scale_fill_viridis(option = "viridis", discrete = TRUE,
                       end = 0.5) +
    coord_equal() +
    theme_map() +
    theme(legend.position = "bottom") +
    labs(title = "Tuberculosis Monitoriing Indicators; Map of County Level Clusters in England",
         subtitle = "Using data from 2015 - only counties with incidence rates above 10 per 100,000 population and complete data are shown",
         caption = "Selected urban areas are shown (dark grey) and labelled.
@seabbs Source: Public Health England (fingertipsR)
Contains National Statistics data © Crown copyright and database right 2018. 
         Contains OS data © Crown copyright and database right 2018")
}

plot_tb_cluster_map <- tb_cluster_map(tb_pca_pam)

ggsave("../../static/img/fingertips/map-tb-clust.png",
       plot_tb_cluster_map, width = 8, height = 8, dpi = 330)

plot_tb_cluster_map

## Remove objects to reduce stored chunk size
rm("plot_tb_cluster_map")

In the map above we see that cluster 1 is mainly made up of counties in London, Birmingham, and in the North West of England. Cluster 2 accounts for the majority of counties that are not within these large urban areas, as well the remaining as several counties in Birmingham and the North West and a single county in London. As expected we see that the majority of counties in England have been excluded from our analysis due to low incidence rates or missing data.

Summary and Wrap-up

In this post we have explored the fingertipsR R package and the data on TB monitoring indicators for counties in England that is provided through the fingertips API. We found that there was a large amount of non-random missing data, much of which was due to censoring to prevent deductive disclosure. However, several variables were entirely missing and even once counties with low incidence rates were removed there was still several counties with little TB monitoring data available. Substantial improvements could be made here to improve any future analysis or monitoring using this dataset.

Once counties and variables with missing data had been removed we found that there was substantial non-systematic variation between counties with only 42% of variation explained by the optimal number of principal components. The large amount of variation between counties we observed reinforces the need for the collaborative TB strategy in England which was launched in 2015. Hopefully once data becomes available for 2016, and 2017 this will show a decrease in variation between counties.

We found that, after dimension reduction, the data best supported two clusters. One cluster contained the majority of high incidence counties, which also had a higher proportion of culture confirmed pulmonary TB cases and more cases completing treatment in the first 12 months. This cluster was mainly centred around the greater London area but also included counties in Birmingham and the North West. The second cluster contained mainly counties not in these large urban areas but did contain the remaining counties in the North West and Birmingham area, as well as a single county in London. This cluster had a lower proportion of cases that started treatment within 2 and 4 months of developing symptoms, as well as having a greater proportion of cases that were lost to follow up and that died.

This analysis was limited by the TB monitoring indicators available, the large amount of missing data for the variables that were available, and the lack of high quality up to date data. There was a large amount of non-systematic variation present that was exluded from our clustering analysis and the clusters that we did identify cannot be considered to be highly robust. We did not test for a single cluster, which may have produced a model more consistent with the date. A single clustering algorithm was used, and although it is considered a robust approach, further validation of these results is necessary using another (or multiple) clustering methods.

Hopefully this post proved useful for informing you about the present state of TB in England, but also as an example of dimension reduction and clustering analysis. This analysis serves as a framework for a similar analysis that I am carrying out as part of my PhD so any comments, suggested improvements etc. would be greatly appreciated.

To wrap-up this post we use the patchwork package to quickly compose a storyboard of the results from the clustering analysis carried out above. See here for a full size version.

plot_tb_cluster_map <- tb_cluster_map(tb_pca_pam)

storyboard <- plot_tb_cluster_map | ( pca_plot / plot_cluster_diff)

ggsave("../../static/img/fingertips/storyboard-fingertips-tb-clust.png",
       storyboard, width = 20, height = 16, dpi = 330)

storyboard

## Remove to reduce chunk size as over 100mb git limit
rm("storyboard", "plot_tb_cluster_map")

Related

comments powered by Disqus