Chapter 4 The epidemiology of tuberculosis and the role of BCG vaccination in England

4.1 Introduction

Although the characteristics of tuberculosis (TB) in England have been reported elsewhere,[2,21] and key risk factors such as non-UK birth status have been identified,[45] little attention has been given to the role of BCG vaccination. In particular, there is little information available regarding the demographics of vaccinated versus unvaccinated cases and the impact of BCG vaccination on TB outcomes in England has not been explored. There has also only been limited reporting of the age distribution, and trends over time, in incidence rates stratified by UK birth status.

In this chapter I explore the epidemiology of TB in England using routine datasets with a particular focus on the impact of missing data, the mechanisms underlying that missing data, seasonal trends, the role of age, UK birth status and BCG status. I also estimate incidence rates, stratified by UK birth status and age, which I use to identify trends in TB incidence over time. Finally I report TB outcomes in England using case rates, again stratified by BCG status and UK birth status. These data are then used throughout this thesis to explore the impact of BCG vaccination on TB outcomes (Chapter 6), to estimate the direct impact of the 2005 change in BCG vaccination policy (Chapter 7), to parameterise a dynamic TB transmission model (Chapter 8) and to fit the same model to data (Chapter 9).

4.2 Data sources

4.2.1 Enhanced tuberculosis surveillance (ETS) system

4.2.1.1 Background

The ETS is a database that collects demographic, clinical, and microbiological data on all notified TB cases in England and is maintained by Public Health England (PHE). Notification is required by law, with health service providers having to inform PHE of all confirmed TB cases.[2] Data collection began in 2000 and was expanded, with additional variables, with the launch of a web based system in 2008.[46] It is updated annually with de-notifications, late notifications and other updates. A descriptive analysis of TB epidemiology in England is published each year, which reports on data collection, cleaning, and trends in TB incidence at both a national, and sub-national level.[2]

4.2.1.2 Data extraction and management

Data on all notifications (114,820 notifications) from the ETS system from 2000 to 2015 were obtained from PHE via an application to the TB monitoring team. Data fields included: notification date, age, PHE centre, occupation, ethnic group, UK birth status, years since entry to the UK, date of symptom onset, date of presentation, date of diagnosis, date of treatment start, date of treatment end, date of death, pulmonary TB status, culture status, sputum smear status, drug resistance, BCG vaccination status, year of vaccination, outcome at 12 months, overall outcome, and cause of death. Notifications were assessed for identifiability and the data release was conditional on the raw data not being shared further. Invalid entries were replaced with missing values unless otherwise noted, with character variables stored as factors using their most common entry as the baseline. Notifications from Scotland, Northern Ireland and Wales were dropped from the dataset. Several variables were created, or modified, for use in further analysis, Table 4.1 summarises these variables. The code used for data cleaning is available as an R package¹⁴.

Table 4.1: Variables derived or modified from the ETS system for use in the analyses throughout this thesis.
Created/modified variable	Description
Years since BCG	Derived using year of vaccination and year of notification. Categorised into $\leq 10$ and 11+ due to the evidence of waning protection for the BCG vaccine.[28]
Age at BCG	Derived using year of vaccination and age at vaccination. Categorised into < 1, 1 to 11, 12 to 16 and $16+$ to capture historic vaccination policy.[47]
Successful treatment	For cases that had a recorded date of starting treatment, with their outcome recorded at the latest available follow up. Those that completed treatment are defined as successfully treated: treatment failure is defined as those that stopped treatment, were lost to follow up, those that died during follow up from TB, those that died during follow up where TB contributed to their death, or those who were still on treatment. Those that were not evaluated were treated as missing.
Mortality	Assessed via follow up at 12 and 24 months: mortality is defined as cases with an overall outcome of death, and survival is defined as those that completed treatment, were still on treatment, or stopped treatment. Those that were lost to follow up, or not evaluated were treated as missing
TB mortality	For cases with an overall outcome of died, and whose cause of death was known to be TB or to be related to TB. Those that were known to have not died, or who were known to have died from a cause other than from TB were defined to have not died from TB.
Death due to TB	Death due to TB is defined as those that died directly from TB, or where TB had contributed to their death with death not due to TB being cases that died from any other cause. Conditioned on all-cause mortality, for cases with a known cause of death.

4.2.1.3 Structure of the ETS

The ETS is in a wide format with each notification having a single row, and with each unique variable having a single column. This structure means that the progression of TB in each individual is captured by a series of dates rather than as a series of events. As notifications are not linked to a unique patient I.D it is possible that individuals are duplicated within the ETS, with multiple notifications. These recurrent notifications have been flagged within the data extract by the TB section at PHE. The majority of variables are factors, with a significant minority of numeric and date variables.

4.2.1.4 Data completeness

Missing data can take several forms, data that are missing completely at random (MCAR), data that are missing at random (MAR) and data that are missing not at random (MNAR).[48] Data that are MAR are missing with a mechanism that is conditional on observed variables, whilst MNAR are missing with a mechanism that is conditional on variables that are not observed. Data that is MAR, and MNAR may lead to biases when analysing the data, however it is not possible to deduce from the observed data what the mechanism driving missing data is. Therefore, it is necessary to account for these potential biases during the analysis stage. This is possible using a variety of methods such as scenario analysis accounting for the ‘best’ and ‘worst’ case scenarios, and multiple imputation of missing data using additional variables in the dataset to inform the imputation model.[48]

As the ETS is aggregated across England, from a variety of sources, some level of missing data are inevitable. This takes two forms: under-reporting of notified cases, of which there is some evidence in the literature,[49] and data missing for a notified case. The former is particularly problematic as apart from using comparative studies the characteristics of those that are not notified is unknown. For variables that are missing data within the dataset it is possible to calculate the proportion of missing data (Figure 4.1, Table 4.2) but care must be taken to account for nested variables such as date of death and year of BCG vaccination. After accounting for nested variables, there was high completeness for common demographic variables such as sex, age, ethnic group and UK birth status. More problematically, BCG status and year of BCG status have a high percentage missing, even after accounting for the introduction of national collection of these variables in 2008. Socio-economic status (as national quintiles) was not collected until 2010 but after this point is highly complete. Comparing pre 2009 and post 2008 in Table 4.2 (and by inspecting Figure 4.1) there are also issues of changing completeness over time,[2,50] if this is not accounted for than it may lead to spurious trends. Figure 4.1 also indicates that there are multiple groups of variables that share a common pattern of missing data.

$Summary plot of missing data in the extract of the ETS data used in this thesis. Due to the large size of the dataset, the data has been sub-sampled with only 20\% of the data shown in this figure. Notifications have been ordered by date of notification from left to right. The following subset of variables are shown: year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelat), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e date of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated.$

Figure 4.1: Summary plot of missing data in the extract of the ETS data used in this thesis. Due to the large size of the dataset, the data has been sub-sampled with only 20% of the data shown in this figure. Notifications have been ordered by date of notification from left to right. The following subset of variables are shown: year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelat), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e date of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated.

Table 4.2: Breakdown of missing data from the ETS prior to the web based system (pre 2009) and post (post 2008) by variable, ordered by the percentage missing for a subset of variables. The following subset of variables are shown year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelat), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e data of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated.
	Pre 2009		Post 2008
Variable	Missing (N)	Missing (%)	Missing (N)	Missing (%)
natquintile	63175	100.0	8120	15.7
bcgvaccyr	62479	98.9	31421	60.8
bcgvacc	61916	98.0	17133	33.2
datediag	45557	72.1	10303	19.9
sputsmear	32912	52.1	32094	62.1
timesinceent	29084	46.0	18670	36.2
anyres	27485	43.5	20995	40.7
occat	24870	39.4	5513	10.7
symptonset	23937	37.9	12829	24.8
txenddate	18711	29.6	1137	2.2
prevdiag	13204	20.9	3148	6.1
starttreatdate	9151	14.5	2127	4.1
tomdeathrelat	7539	11.9	1191	2.3
ukborn	6230	9.9	1825	3.5
overalloutcome	6044	9.6	0	0.0
startedtreat	4242	6.7	602	1.2
ethgrp	2811	4.4	1229	2.4
dateofdeath	1235	2.0	357	0.7
pulmextrapulm	177	0.3	213	0.4
sex	101	0.2	110	0.2
phec	32	0.1	0	0.0
age	25	0.0	0	0.0
caserepdate	0	0.0	0	0.0
year	0	0.0	0	0.0
culture	0	0.0	0	0.0

For nested variables with rare outcomes an alternative approach for estimating the proportion of missing data is to first filter the data for the top level variable required for the nested variable to be defined and to then compute the proportion of these notifications that are missing data for the outcome of interest. For the date of starting treatment this approach leads to an estimate of 5.9% (6434/108410) being missing, which is more complete than previously estimated. For cases that are known to have completed treatment 16.5% (13804/83891) are missing a date for the end of treatment. In notifications that are known to have died, 26.6% (1592/5976) were missing the date of death and 44.9% (2686/5976) were missing the cause of death. In any analysis where these variables are used the missing data for these variables will need to be carefully adjusted for. In particular, if cause of death is used it must be clearly stated that it is highly missing and results based on this variable should be properly caveated.

4.2.1.5 Drivers of Variable completeness

As previously discussed, missing data may be MAR or MNAR, which may introduce biases into any analyses based on these data. This is of particular importance for variables that have high levels of missingness, as any introduced bias is likely to have a greater impact on the overall results, and for variables that are used extensively in analyses later in this thesis. Unfortunately MNAR data cannot be detected, so bias from this source cannot be discounted. However, it is possible to detect potential MAR mechanisms from observed variables that would not necessarily be included in a model used for analysis, although any associations may themselves be caused by an external factor. In the following section I explore variables associated with data being missing for several key variables including: BCG status, year of BCG vaccination, date of death, cause of death, date of symptom onset, date of diagnosis, date of starting treatment and date of ending treatment. All of these variables were shown to have high levels of missing data in the previous section and they will all be used extensively throughout this thesis.

In order to explore the drivers of missing data I have reformulated the problem as a logistic regression for each variable of interest, with the outcome being data completeness (complete/missing). This allows variables that are hypothesised to be related to missing data to be adjusted for and their independent impact on data completeness to be estimated. Unlike classic approaches to missing data, such as multiple imputation by chained regression (MICE),[51] this is not an imputation. The details of the approach are discussed below.

4.2.1.5.1 Method

In order to reformulate missing data as a logistic regression I took the following steps:

For the variable of interest create a new temporary binary variable, called data status, that is “Missing” when the variable of interest is missing and “Complete” when it is not. Specify “Complete” as the baseline.
For nested variables exclude notifications that do not have the top level outcome required by the variable of interest. An example of this is excluding cases that did not die, or have a missing overall outcome, when investigating TB mortality.
Specify the hypothesised drivers of missingness for the variable of interest. These should be variables with a reasonable hypothesis for how they would drive missingness in the variable of interest. They must also be relatively complete as this approach does not impute missing confounder data.
Fit a logistic regression model with the temporary data status variable as the outcome, adjusting for the hypothesised drivers of missingness.
Exponentiate the returned coefficients, and confidence intervals so that they represent Odds Ratios (ORs).
Refit the model, dropping each variable in turn and then comparing the updated model with the full model using a likelihood ratio test.
Interpret the results, using the estimated size of the effect, the width of the confidence intervals and the size of the likelihood ratio test p values to determine which variables are related to missingness for the variable of interest. Evidence should be interpreted on a spectrum, rather than using arbitrary significance cut-offs.[52] To avoid issues of multiple testing the level of evidence should be weighted based on the number of variables adjusted for and the number of outcomes explored.

For all outcomes considered I adjusted for the same set of demographic variables that were both highly complete and also plausibly linked to missingness for all outcomes considered. These were: year, sex, age (grouped as grouped as 0-14 year olds, 15-44 year olds, 45-64 year olds, and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Complete case analysis has been used, with the dataset limited to notifications from 2010 and on-wards as socio-economic status was not collected prior to this.

4.2.1.5.2 BCG status

It is clear that BCG status is missing with a MAR mechanism for the variables considered (Table 4.3). BCG data missingness is strongly associated with year of notification, sex age, ethnic group, and socio-economic status. It appears that after adjusting for other variables data completeness increased from 2010 until 2012 but has since showed no clear trend. Men appear to be more likely than women to have a missing BCG status, with the non-UK born also being more likely than the UK born to be missing BCG status. The proportion of those missing BCG status increases with age, with those aged 65+ being over 4 times more likely to be missing BCG status than those aged 0-14 years old. There is also evidence to suggest that notifications in the lowest socio-economic group are more likely to have a missing BCG status but there was no clear evidence of a trend across socio-economic quintiles. The White ethnic group was more likely to have a missing BCG status than any other ethnic group.

Table 4.3: Results from a logistic regression model with data completeness (Complete/Missing) for BCG vaccination as an outcome, adjusted for: year, sex, age (grouped as 0-14 year olds, 15-45 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). Socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that BCG status is missing at random for the variables considered.
Variable	Category	Missing (N)	Notifications (41659)	Odds Ratio	P value
Year	2010	31.3% (2235)	7143		1.27e-08
	2011	29.8% (2319)	7781	0.93 (0.87, 1.00)
	2012	27.9% (2164)	7755	0.85 (0.79, 0.91)
	2013	27.1% (1907)	7034	0.79 (0.74, 0.86)
	2014	30.1% (1907)	6327	0.91 (0.85, 0.98)
	2015	29.7% (1668)	5619	0.89 (0.82, 0.96)
Sex	Female	27.4% (4847)	17664		8.74e-11
	Male	30.6% (7353)	23995	1.16 (1.11, 1.21)
Age	0-14	13.1% (235)	1793		1.67e-157
	15-44	26.0% (6557)	25235	2.10 (1.82, 2.43)
	45-64	32.8% (2964)	9026	2.84 (2.45, 3.30)
	65+	43.6% (2444)	5605	4.42 (3.80, 5.15)
Ethnic group	White	35.4% (2959)	8359		2.15e-41
	Black-Caribbean	24.6% (228)	928	0.62 (0.52, 0.72)
	Black-African	27.3% (1966)	7204	0.73 (0.67, 0.80)
	Black-Other	24.1% (89)	369	0.65 (0.51, 0.83)
	Indian	25.9% (2805)	10848	0.62 (0.58, 0.68)
	Pakistani	33.2% (2258)	6806	0.89 (0.82, 0.97)
	Bangladeshi	27.9% (469)	1680	0.71 (0.62, 0.80)
	Chinese	33.6% (166)	494	0.88 (0.72, 1.07)
	Mixed / Other	25.3% (1260)	4971	0.65 (0.59, 0.71)
UK birth status	Non-UK Born	29.5% (9104)	30880		7.2e-18
	UK Born	28.7% (3096)	10779	0.75 (0.70, 0.80)
Socio-economic status	1	30.7% (4948)	16131		4.88e-08
	2	26.8% (3383)	12621	0.84 (0.80, 0.89)
	3	29.2% (1905)	6530	0.92 (0.86, 0.98)
	4	30.1% (1142)	3796	0.91 (0.84, 0.99)
	5	31.8% (822)	2581	0.94 (0.85, 1.03)

4.2.1.5.3 Year of BCG vaccination

As for BCG status, year of BCG vaccination is also clearly missing with MAR mechanisms for the variables considered (Table ??). As for BCG status men were more likely to have a missing year of BCG vaccination as were the non-UK born. Older notifications were again more likely to have missing data, with those aged 65+ being more than 2 times more likely to have a missing year of vaccination. However, unlike BCG vaccination status, year of notification shows a clear trend of increasing data completeness from 2010 until 2015. Additionally, for year of BCG vaccination the White ethnic group is more likely to have complete data than any other ethnic group, with those of Black-Caribbean descent being over 3 times more likely to have a missing year of BCG vaccination. Socio-economic status is highly associated with year of vaccination being missing but there is little clear evidence of a trend. The second, and third, poorest quintiles were more likely to have a missing year of vaccination. Whilst the richest, and second richest quintiles were less likely to have a missing year of vaccination.

4.2.1.5.4 Date of death

For date of death there is some evidence that data is missing with an MAR mechanism for ethnic group and socio-economic status, with little evidence for any other association (Table 4.4). These associations should be interpreted carefully due to the strength of the evidence when compared to the number of tests conducted, there is a high likelihood of a type 1 error. Whilst the confidence intervals were wide for all ethnic groups there was some weak indication that the White ethnic group were more likely to have a complete date of death than other ethnic groups. Similarly, those in the lowest socio-economic group were somewhat more likely to have a complete date of death than other quintiles. The reduction in the levels of evidence found for case of death may be linked to the reduction in power for this outcome, as mortality is a rare outcome.

Table 4.4: Results from a logistic regression model with data completeness (Complete/Missing) for date of death as an outcome, adjusted for: year, sex, age (grouped as grouped as 0-14 year olds, 15-44 year olds, 45-64 year olds, and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that there is some evidence that date of death is missing at random for ethnic group, with weaker evidence for all other variables.
Variable	Category	Missing (N)	Notifications (1883)	Odds Ratio	P value
Year	2010	16.6% (53)	320		0.0876
	2011	15.9% (52)	327	0.95 (0.62, 1.46)
	2012	14.5% (51)	351	0.81 (0.53, 1.25)
	2013	13.5% (42)	312	0.73 (0.46, 1.14)
	2014	9.5% (30)	317	0.52 (0.32, 0.84)
	2015	13.3% (34)	256	0.69 (0.43, 1.11)
Sex	Female	14.8% (97)	657		0.609
	Male	13.5% (165)	1226	0.93 (0.70, 1.23)
Age	0-14	10.0% (1)	10		0.929
	15-44	15.7% (31)	198	1.90 (0.32, 36.43)
	45-64	14.6% (68)	465	1.92 (0.33, 36.42)
	65+	13.4% (162)	1210	1.95 (0.34, 37.04)
Ethnic group	White	11.1% (102)	920		0.00373
	Black-Caribbean	21.7% (10)	46	1.58 (0.67, 3.51)
	Black-African	20.1% (27)	134	1.49 (0.76, 2.94)
	Black-Other	20.0% (1)	5	1.59 (0.08, 11.72)
	Indian	17.4% (64)	367	1.08 (0.62, 1.92)
	Pakistani	8.0% (20)	249	0.50 (0.25, 0.99)
	Bangladeshi	22.7% (10)	44	1.65 (0.67, 3.87)
	Chinese	14.3% (3)	21	0.89 (0.19, 3.00)
	Mixed / Other	25.8% (25)	97	1.99 (1.01, 3.92)
UK birth status	Non-UK Born	16.6% (167)	1004		0.133
	UK Born	10.8% (95)	879	0.67 (0.40, 1.14)
Socio-economic status	1	11.4% (79)	695		0.0265
	2	18.3% (86)	470	1.67 (1.19, 2.35)
	3	16.2% (48)	296	1.49 (0.99, 2.22)
	4	12.7% (30)	237	1.21 (0.75, 1.90)
	5	10.3% (19)	185	0.95 (0.54, 1.62)

4.2.1.5.5 Cause of death

For cause of death there is less evidence of an MAR mechanism, with little evidence of an association for year, sex, age, or socio-economic group (Table 4.5). There was, however, strong evidence of an association with ethnic group and very weak evidence of an association with UK birth status. The White ethnic group was less likely to have an incomplete cause of death when compared to the majority of other identified ethnic groups but there was evidence to suggest that cause of death was more likely to be missing in those identifying as being of Black-Caribbean, Black-Other, Indian and Bangladeshi descent. The confidence intervals for these estimates were wide, indicating that these estimates may not be reliable. There was again some weak evidence to suggest that the UK born were more likely to be missing a cause of death than the non-UK born, which reverses the trend observed in the other variables explored. The reduction in the levels of evidence found for case of death may be linked to the reduction in power for this outcome, as mortality is a rare outcome.

Table 4.5: Results from a logistic regression model with data completeness (Complete/Missing) for cause of death as an outcome, adjusted for: year, sex, age (grouped as grouped as 0-14 year olds, 15-44 year olds, 45-64 year olds, and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that cause of death is missing at random for ethnic group and UK birth status, with little evidence for any other variables
Variable	Category	Missing (N)	Notifications (1883)	Odds Ratio	P value
Year	2010	45.0% (144)	320		0.724
	2011	45.6% (149)	327	1.03 (0.75, 1.41)
	2012	45.3% (159)	351	1.02 (0.75, 1.39)
	2013	43.9% (137)	312	0.99 (0.72, 1.37)
	2014	44.8% (142)	317	0.96 (0.70, 1.32)
	2015	38.7% (99)	256	0.80 (0.57, 1.12)
Sex	Female	44.7% (294)	657		0.628
	Male	43.7% (536)	1226	0.95 (0.78, 1.16)
Age	0-14	50.0% (5)	10		0.116
	15-44	35.4% (70)	198	0.64 (0.17, 2.48)
	45-64	43.0% (200)	465	0.90 (0.24, 3.44)
	65+	45.9% (555)	1210	0.96 (0.25, 3.67)
Ethnic group	White	48.2% (443)	920		0.000704
	Black-Caribbean	21.7% (10)	46	0.40 (0.18, 0.82)
	Black-African	45.5% (61)	134	1.41 (0.85, 2.36)
	Black-Other	20.0% (1)	5	0.41 (0.02, 2.87)
	Indian	35.7% (131)	367	0.83 (0.55, 1.27)
	Pakistani	49.4% (123)	249	1.47 (0.95, 2.29)
	Bangladeshi	27.3% (12)	44	0.60 (0.27, 1.26)
	Chinese	52.4% (11)	21	1.64 (0.64, 4.23)
	Mixed / Other	39.2% (38)	97	1.00 (0.58, 1.72)
UK birth status	Non-UK Born	40.1% (403)	1004		0.072
	UK Born	48.6% (427)	879	1.41 (0.97, 2.07)
Socio-economic status	1	43.7% (304)	695		0.345
	2	40.0% (188)	470	0.93 (0.72, 1.18)
	3	42.9% (127)	296	0.98 (0.74, 1.31)
	4	49.8% (118)	237	1.24 (0.91, 1.69)
	5	50.3% (93)	185	1.21 (0.86, 1.71)

4.2.1.5.6 Date of symptom onset

For date of symptom onset there was strong evidence of an MAR mechanism for all variables considered, except for sex (Table 4.6). As found previously, the likelihood of date of symptom onset being missing reduced with year of notification. Children (0-14 years old) were more likely to have a missing date of symptom onset than any other age group as were those in any socio-economic quintile when compared to the poorest group. UK born cases were more likely to have a complete date of symptom onset than non-UK born cases, with the White ethnic group being more likely to have a missing date of symptom onset than most other ethnic groups.

Table 4.6: Results from a logistic regression model with data completeness (Complete/Missing) for date of symptom onset as an outcome, adjusted for: year, sex, age (grouped as grouped as 0-14 year olds, 15-44 year olds, 45-64 year olds, and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that date of symptom onset is missing not at random for the variables for all variables considered, except for sex.
Variable	Category	Missing (N)	Notifications (41659)	Odds Ratio	P value
Year	2010	34.0% (2426)	7143		0
	2011	30.1% (2339)	7781	0.83 (0.78, 0.89)
	2012	24.2% (1878)	7755	0.61 (0.57, 0.66)
	2013	17.5% (1233)	7034	0.41 (0.38, 0.44)
	2014	11.8% (744)	6327	0.25 (0.23, 0.28)
	2015	6.9% (390)	5619	0.14 (0.13, 0.16)
Sex	Female	22.0% (3894)	17664		0.93
	Male	21.3% (5116)	23995	1.00 (0.95, 1.05)
Age	0-14	38.1% (684)	1793		3.59e-73
	15-44	20.5% (5182)	25235	0.35 (0.31, 0.39)
	45-64	20.7% (1870)	9026	0.37 (0.33, 0.42)
	65+	22.7% (1274)	5605	0.43 (0.38, 0.49)
Ethnic group	White	20.9% (1749)	8359		3.98e-09
	Black-Caribbean	23.1% (214)	928	1.04 (0.88, 1.23)
	Black-African	23.0% (1654)	7204	0.89 (0.80, 0.98)
	Black-Other	18.7% (69)	369	0.79 (0.60, 1.04)
	Indian	22.2% (2404)	10848	0.86 (0.79, 0.94)
	Pakistani	19.2% (1305)	6806	0.75 (0.68, 0.83)
	Bangladeshi	23.9% (401)	1680	1.05 (0.91, 1.20)
	Chinese	18.8% (93)	494	0.74 (0.58, 0.94)
	Mixed / Other	22.6% (1121)	4971	0.93 (0.83, 1.03)
UK birth status	Non-UK Born	21.9% (6774)	30880		5.44e-12
	UK Born	20.7% (2236)	10779	0.77 (0.71, 0.83)
Socio-economic status	1	19.9% (3218)	16131		5e-17
	2	22.9% (2888)	12621	1.22 (1.15, 1.29)
	3	24.2% (1578)	6530	1.33 (1.24, 1.43)
	4	22.0% (837)	3796	1.20 (1.09, 1.31)
	5	18.9% (489)	2581	1.00 (0.89, 1.12)

4.2.1.5.7 Date of diagnosis

For date of diagnosis there was again strong evidence for an MAR mechanism for all variables considered, except for sex for which there was very weak evidence (Table 4.7). Increasing completeness was found for year of notification as seen previously, as was an increased likelihood of missing data in males and the non-UK born. The White ethnic group was less likely to be missing data on the data of diagnosis as compared to the majority of other ethnic groups, as were the poorest socio-economic group compared to all other socio-economic quintiles. Children (0-14 years old) were again more likely to be missing data than adults in any age group.

Table 4.7: Results from a logistic regression model with data completeness (Complete/Missing) for date of diagnosis onset as an outcome, adjusted for: year, sex, age (grouped as grouped as 0-14 year olds, 15-44 year olds, 45-64 year olds, and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. The model indicates that date of diagnosis is missing at random for the variables for all variables considered, except for sex.
Variable	Category	Missing (N)	Notifications (41659)	Odds Ratio	P value
Year	2010	26.9% (1918)	7143		1.65e-283
	2011	22.3% (1736)	7781	0.78 (0.72, 0.84)
	2012	18.8% (1458)	7755	0.63 (0.58, 0.68)
	2013	12.9% (909)	7034	0.41 (0.37, 0.44)
	2014	10.4% (659)	6327	0.32 (0.29, 0.35)
	2015	7.4% (415)	5619	0.22 (0.19, 0.24)
Sex	Female	16.9% (2984)	17664		0.0296
	Male	17.1% (4111)	23995	1.06 (1.01, 1.12)
Age	0-14	19.4% (348)	1793		0.000164
	15-44	17.8% (4504)	25235	0.76 (0.67, 0.87)
	45-64	15.9% (1434)	9026	0.73 (0.64, 0.84)
	65+	14.4% (809)	5605	0.72 (0.62, 0.84)
Ethnic group	White	12.5% (1043)	8359		2.91e-67
	Black-Caribbean	25.2% (234)	928	2.21 (1.87, 2.61)
	Black-African	21.9% (1577)	7204	1.49 (1.34, 1.66)
	Black-Other	17.9% (66)	369	1.32 (0.98, 1.74)
	Indian	18.0% (1957)	10848	1.09 (0.99, 1.21)
	Pakistani	11.8% (805)	6806	0.75 (0.67, 0.84)
	Bangladeshi	21.5% (361)	1680	1.57 (1.35, 1.82)
	Chinese	13.4% (66)	494	0.82 (0.61, 1.07)
	Mixed / Other	19.8% (986)	4971	1.32 (1.18, 1.48)
UK birth status	Non-UK Born	18.4% (5696)	30880		6.07e-16
	UK Born	13.0% (1399)	10779	0.71 (0.65, 0.77)
Socio-economic status	1	14.4% (2317)	16131		1.05e-45
	2	19.6% (2469)	12621	1.48 (1.39, 1.58)
	3	20.3% (1325)	6530	1.62 (1.50, 1.75)
	4	17.0% (645)	3796	1.37 (1.24, 1.52)
	5	13.1% (339)	2581	1.07 (0.94, 1.21)

4.2.1.5.8 Date of starting treatment

For date of starting treatment there is little evidence that missing data is associated with any variable considered, except for year of notification (Table 4.8). Variable completeness improved year-on-year, with a 96% drop in missing data in 2015 compared to 2010.

Table 4.8: Results from a logistic regression model with data completeness (Complete/Missing) for date of starting treatment as an outcome, adjusted for: year, sex, age (grouped as grouped as 0-14 year olds, 15-44 year olds, 45-64 year olds, and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables. There is little evidence that the missing data for the date of starting treatment is associated with any variable considered, except for year of notification.
Variable	Category	Missing (N)	Notifications (40977)	Odds Ratio	P value
Year	2010	3.5% (244)	7020		2.4e-70
	2011	3.2% (242)	7655	0.91 (0.76, 1.08)
	2012	2.5% (187)	7628	0.69 (0.57, 0.84)
	2013	2.2% (154)	6923	0.63 (0.51, 0.77)
	2014	0.8% (51)	6239	0.23 (0.17, 0.31)
	2015	0.1% (8)	5512	0.04 (0.02, 0.08)
Sex	Female	2.2% (383)	17439		0.83
	Male	2.1% (503)	23538	0.99 (0.86, 1.13)
Age	0-14	3.0% (54)	1783		0.157
	15-44	2.2% (539)	25000	0.72 (0.53, 0.98)
	45-64	2.0% (180)	8896	0.68 (0.49, 0.95)
	65+	2.1% (113)	5298	0.69 (0.49, 0.99)
Ethnic group	White	2.3% (182)	8055		0.423
	Black-Caribbean	2.2% (20)	916	0.89 (0.54, 1.39)
	Black-African	1.9% (139)	7140	0.73 (0.55, 0.96)
	Black-Other	3.0% (11)	368	1.33 (0.67, 2.38)
	Indian	2.1% (230)	10707	0.86 (0.67, 1.10)
	Pakistani	2.4% (158)	6721	0.92 (0.72, 1.19)
	Bangladeshi	2.2% (37)	1665	0.88 (0.59, 1.29)
	Chinese	1.7% (8)	483	0.68 (0.30, 1.33)
	Mixed / Other	2.1% (101)	4922	0.86 (0.64, 1.15)
UK birth status	Non-UK Born	2.1% (646)	30481		0.763
	UK Born	2.3% (240)	10496	0.97 (0.79, 1.18)
Socio-economic status	1	2.3% (364)	15884		0.517
	2	2.1% (263)	12422	0.92 (0.78, 1.08)
	3	2.0% (131)	6435	0.89 (0.72, 1.09)
	4	1.9% (70)	3712	0.83 (0.63, 1.07)
	5	2.3% (58)	2524	1.04 (0.77, 1.37)

4.2.1.5.9 Date of ending treatment

For date of ending treatment there is evidence that missing data is associated with year of notification and weaker evidence of an association with ethnic group and socio-economic status, with little evidence for any other variable. As found previously, variable completeness increased over time. There was some evidence that poorest socio-economic group was more likely to be missing the date of ending treatment but the evidence for this was mixed. The White ethnic group was more somewhat likely to be missing date of treatment ending than most other ethnic groups.

Table 4.9: Results from a logistic regression model with data completeness (Complete/Missing) for date of ending treatment as an outcome, adjusted for: year, sex, age (grouped as grouped as 0-14 year olds, 15-44 year olds, 45-64 year olds, and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Notifications from 2010 onwards were included as socio-economic status was not collected before this. Complete case analysis was used. Odds ratios shown are adjusted for all explanatory variables.
Variable	Category	Missing (N)	Notifications (33606)	Odds Ratio	P value
Year	2010	2.9% (182)	6171		2.52e-14
	2011	2.6% (177)	6855	0.88 (0.71, 1.08)
	2012	2.4% (164)	6882	0.80 (0.64, 0.99)
	2013	1.5% (97)	6298	0.51 (0.39, 0.65)
	2014	1.2% (66)	5341	0.40 (0.30, 0.53)
	2015	1.4% (28)	2059	0.45 (0.30, 0.66)
Sex	Female	2.1% (311)	14630		0.859
	Male	2.1% (403)	18976	1.01 (0.87, 1.18)
Age	0-14	2.7% (44)	1617		0.711
	15-44	2.0% (419)	21027	0.83 (0.60, 1.18)
	45-64	2.3% (165)	7272	0.88 (0.62, 1.27)
	65+	2.3% (86)	3690	0.83 (0.56, 1.23)
Ethnic group	White	2.9% (176)	6076		0.00931
	Black-Caribbean	2.8% (21)	753	1.01 (0.62, 1.57)
	Black-African	1.9% (114)	6071	0.69 (0.52, 0.93)
	Black-Other	2.3% (7)	306	0.88 (0.37, 1.78)
	Indian	1.7% (150)	8842	0.66 (0.51, 0.87)
	Pakistani	2.5% (140)	5668	0.94 (0.72, 1.22)
	Bangladeshi	1.3% (18)	1409	0.48 (0.28, 0.78)
	Chinese	2.8% (11)	396	1.09 (0.54, 1.99)
	Mixed / Other	1.9% (77)	4085	0.75 (0.54, 1.02)
UK birth status	Non-UK Born	1.9% (480)	25174		0.153
	UK Born	2.8% (234)	8432	1.17 (0.94, 1.45)
Socio-economic status	1	2.4% (308)	13080		0.000621
	2	1.7% (170)	10266	0.72 (0.60, 0.87)
	3	1.9% (100)	5265	0.82 (0.65, 1.03)
	4	2.8% (84)	2994	1.19 (0.92, 1.52)
	5	2.6% (52)	2001	1.07 (0.78, 1.44)

4.2.1.6 Biases in the ETS

Routine observational datasets are subject to numerous potential biases, such as selection bias, recall bias, measurement bias, and unmeasured confounding.[53] Additionally, as the data has not been collected with a specific analysis in mind there maybe issues with the specificity of variables. The ETS system is likely to suffer from all of the above biases to some extent, which must be accounted for as far as possbile, and explicitly stated at every level of analysis. The most important consideration is that the ETS system is unlikely to be representative of the general population as it contains only notified TB cases that occurred in England during the study period, research questions must therefore be either limited to active TB patients, or when extended to the general population the differing population demographics must be accounted for. If this is not done then any results may be due to selection bias. Additionally, multiple variables may suffer from misclassification bias, including BCG status which can be assessed via vaccination record, the presence of a scar, or case recall: this may lead to spurious associations.[54] Validation studies would be required to account for this, which is beyond the scope of this thesis.

4.2.1.7 Date variables in the ETS

For analyses that aim to reproduce temporal trends in TB incidence, such as dynamic modelling studies, it is important to understand which variables represent the most accurate date of contact with the health system and more generally on what scale date variables can be considered reliable. In the ETS extract used in this thesis there are several date variables that encode useful information including: the date of notification, the date of symptom onset, the date of diagnosis, the date of starting treatment, the date of completing treatment, and the date of death. In the following section I explore these variables using counts and proportions aggregated to the nearest year, month and day. These summary measures are displayed graphically using scatterplots (with trend lines) and violin plots. Violin plots are a compact method of plotting continuous distributions across multiple categories. They are effectively mirrored density plots and can be interpreted similarly. Here the underlying data has also been plotted.

As seen in the previous section (Section 4.2.1), many of these variables have a large proportion of missing data, with date of notification and date of starting treatment having the least amount of missing data. It is also likely that some of the dates recorded are inaccurate or systematically biased. The date of notification represents the simplest variable to use to represent when a case can be defined to have occurred as it is complete for all records. Unfortunately, cases may be notified at any stage of active TB, from initially becoming symptomatic to post-mortem diagnosis and notification. Despite this limitation, date of notification can be used as a baseline on which to judge other date variables and some of these limitations may be mitigated by aggregating data by month or by year. Figure 4.2 a.) shows the number of TB notifications by year and Figure 4.2 b.) shows the number of TB notifications by month. These figures indicate that aggregating by year, rather than month, reduces the level of noise in the estimates and makes the trend over time easier to identify. This is an acceptable approximation if inference is being drawn on the scale of years. For shorter term processes, such as the duration of treatment which is generally considered to take approximately 6 months (Chapter 2), aggregating by year would reduce the accuracy of the estimated parameter. There is some evidence of a seasonal trend in notifications (Figure 4.2 c.)), with a higher proportion of cases notified in the May, June and July than in the rest of the year. This seasonality would have to be accounted for if conducting analysis on a monthly scale and date of notification was being used as the date of first contact with the health system. There is little evidence that date of notification varies by day of the month (Figure 4.2 d.)).

a.) and b.) show notifications over time by date of notification in the ETS, with a.) aggregated by year and b.) aggregated by month. A trendline has been produced using a locally weighted regression model. Both of these plots show the same overall trend, but b.) contains a large amount of apparent noise. c.) Shows the proportion of cases notified in a given month for each year, with some evidence of a seasonal trend. d.) Shows the proportion of cases notified on a given day for each month, there is little evidence of between day variation in cases notified.

Figure 4.2: a.) and b.) show notifications over time by date of notification in the ETS, with a.) aggregated by year and b.) aggregated by month. A trendline has been produced using a locally weighted regression model. Both of these plots show the same overall trend, but b.) contains a large amount of apparent noise. c.) Shows the proportion of cases notified in a given month for each year, with some evidence of a seasonal trend. d.) Shows the proportion of cases notified on a given day for each month, there is little evidence of between day variation in cases notified.

An alternative measure is to use the date of symptom onset. Unfortunately there are multiple issues with this measure, the first of which being is that it is only 68.0% (78054/114820) complete across the data extract. Additionally, completeness changes with time, with 65.7% (3969/6044) complete in 2000, 60.4% (4720/7809) complete in 2008, and 87.7% (5677/6472) complete in 2014. This could lead to spurious trends in the number of cases. Perhaps most importantly the date of symptom onset is highly susceptible to recall bias with the majority of cases becoming symptomatic on the first of each month (Figure 4.3 d.)), with some evidence that a greater number of cases occur in January than would be expected (Figure 4.3 c.)). Theses biases may also be the result of the defaults used during data entry with the first of the month or the first month of the year being used when the exact date is not known. Another possible measure of the number of cases is the date of diagnosis, this should be a more reliable variable than the date of symptom onset, as it does not rely on the recall of the case. However it is only 51.3% (58960/114820) complete across the dataset, with strong evidence of increasing completeness going from 11.6% (699/6044) complete in 2000, to 89.4% (5786/6472) complete in 2014. This trend would be hard to properly account for in any analysis and therefore this variable should not be used as a primary measure.

a.) and b.) show notifications over time by date of symptom onset in the ETS, with a.) aggregated by year and b.) aggregated by month. A trendline has been produced using a locally weighted regression model. Both of these plots show the same overall trend, but b.) contains a large amount of apparent noise. c.) Shows the proportion of cases notified in a given month for each year, with some evidence of a seasonal trend and a higher proportion of cases reporting symptoms starting in January than would be expected. d.) Shows the proportion of cases notified on a given day for each month, with a much higher proportion of cases reproting symptoms on the first of the month than would be expected. On both the scale of months and years there is some evidence of recall bias, with the first month, or first day, reporting higher proportions of cases than would be expected.

Figure 4.3: a.) and b.) show notifications over time by date of symptom onset in the ETS, with a.) aggregated by year and b.) aggregated by month. A trendline has been produced using a locally weighted regression model. Both of these plots show the same overall trend, but b.) contains a large amount of apparent noise. c.) Shows the proportion of cases notified in a given month for each year, with some evidence of a seasonal trend and a higher proportion of cases reporting symptoms starting in January than would be expected. d.) Shows the proportion of cases notified on a given day for each month, with a much higher proportion of cases reproting symptoms on the first of the month than would be expected. On both the scale of months and years there is some evidence of recall bias, with the first month, or first day, reporting higher proportions of cases than would be expected.

The date of starting treatment should be a more reliable contact date as it records an official contact with the health system. Indeed it was 75.7% (4464/5899) complete in 2000 which increased year-on-year to 98.8% (5612/5680) complete in 2015. This increasing completeness may lead to a temporal bias if not properly adjusted for when evaluating the date of starting treatment over time. As for the data of notification there is some evidence of a seasonal trend for date of starting treatment (Figure 4.4 c.)), with a peak of cases starting treatment in May, June and July. However, this seasonal trend is difficult to identify when cases starting treatment are visualised by month over time (Figure 4.4 b.)). Unlike the date of symptom onset there is little evidence of recall bias by month, or by day (Figure 4.4 c.) and d.)).

a.) and b.) show notifications over time by date of starting treatment in the ETS, with a.) aggregated by year and b.) aggregated by month. A trendline has been produced using a locally weighted regression model. Both of these plots show the same overall trend, but b.) contains a large amount of apparent noise. c.) Shows the proportion of cases starting treatment in a given month for each year, with some evidence of a seasonal trend. d.) Shows the proportion of cases starting treatment on a given day for each month, with little evidence of between day variation. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Figure 4.4: a.) and b.) show notifications over time by date of starting treatment in the ETS, with a.) aggregated by year and b.) aggregated by month. A trendline has been produced using a locally weighted regression model. Both of these plots show the same overall trend, but b.) contains a large amount of apparent noise. c.) Shows the proportion of cases starting treatment in a given month for each year, with some evidence of a seasonal trend. d.) Shows the proportion of cases starting treatment on a given day for each month, with little evidence of between day variation. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

The date of ending treatment does not appear to display similar seasonality (Figure 4.5 c.)). This maybe because treatment time varies between individuals and this dilutes the seasonality observed for the date of starting treatment. As noted previously, there was some evidence of recall bias when the proportion of those ending treatment was examined on a day of the month basis, with a larger proportion ending treatment on the first of the month than on any other day (Figure 4.5 d.)). Also as previously noted, these biases may also be the result of the defaults used during data entry. There were also several outlier months in which all notifications were reported as having their treatment on the same date. This is highly unlikely and may indicate an additional data quality issue. The date of ending treatment was not recorded in 2000, or 2001, and was highly missing for the first several years after collection began (45.4% (2593/5712) complete in 2002 and 58.7% (3475/5921) complete in 2003). From 2009 it was over 90% complete, reaching 97.7% (5359/5486) complete in 2013. As for the other data variables discussed this increasing completeness over time may lead to a bias if not accounted for in future analyses.

$a.) and b.) show notifications over time by date of treatment ending in the ETS, with a.) aggregated by year and b.) aggregated by month. A trendline has been produced using a locally weighted regression model. Both of these plots show the same overall trend, but b.) contains a large amount of apparent noise. c.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. d.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. d.) also contains several clear outliers with data from some months indicating that 100\% of notifications had their treatment on the same day. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.$

Figure 4.5: a.) and b.) show notifications over time by date of treatment ending in the ETS, with a.) aggregated by year and b.) aggregated by month. A trendline has been produced using a locally weighted regression model. Both of these plots show the same overall trend, but b.) contains a large amount of apparent noise. c.) Shows the proportion of cases finishing treatment in a given month for each year, with little evidence of a seasonal trend. d.) Shows the proportion of cases finishing treatment on a given day for each month, with a much higher proportion of cases finishing treatment on the first of the month than would be expected. d.) also contains several clear outliers with data from some months indicating that 100% of notifications had their treatment on the same day. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Finally, date of death displays little evidence of seasonal variation or recall bias (Figure 4.6 c.) and d.)) but has a strong temporal trend for data completeness, with a year-on-year increase. Data was not collected in 2000 and was only 11.8% (199/1689) complete in 2001, data completeness remained below 20% until 2005 when it increased to 38.3% (353/921). This can be seen as a discontinuity when deaths are aggregated by year and plotted (Figure 4.6 a.)). Missing data also masks a drop in notified cases that died, which fell from 1451 in 2005 to 921 in 2006. In comparison, only 352 cases in 2005 and 353 cases in 2006 had a date of death. Data completeness has remained below 80% with increases in data completeness decreasing year-on-year.

a.) and b.) show notifications over time by date of death in the ETS, with a.) aggregated by year and b.) aggregated by month. A trendline has been produced using a locally weighted regression model. Both of these plots show the same overall trend, but b.) contains a large amount of apparent noise. c.) Shows the proportion of cases who died in a given month for each year, with no evidence of a seasonal trend. d.) Shows the proportion of cases who died on a given day for each month, with little evidence of between day variation. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

Figure 4.6: a.) and b.) show notifications over time by date of death in the ETS, with a.) aggregated by year and b.) aggregated by month. A trendline has been produced using a locally weighted regression model. Both of these plots show the same overall trend, but b.) contains a large amount of apparent noise. c.) Shows the proportion of cases who died in a given month for each year, with no evidence of a seasonal trend. d.) Shows the proportion of cases who died on a given day for each month, with little evidence of between day variation. Data is only shown from 2001 until 2015 and prior to 2001 this variable was not recorded and it is not complete for 2015.

4.2.2 Demographic data

4.2.2.1 Background

Demographic data used in this thesis is drawn from two main sources: mid-year resident populations, by single year of age, downloaded from the Office for National Statistics (ONS) website for 2000 to 2015 and population estimates from the yearly April to June Labour Force Survey (LFS) stratified by single year of age and UK birth status¹⁵. The LFS is a study of the employment circumstances of the UK population and provides the official measures of employment and unemployment in the UK. It also records other details such as ethnicity and UK birth status which may be used, along with population weightings, to estimate the UK and non-UK born population.

4.2.2.2 Data management

The mid-year population estimates were transformed from wide format into tidy data,[55] with the population estimates from 2000 being reformatted to match those from 2001 on-wards. Data from the LFS was available by year, so each dataset was separately imported into R.[56] Reporting practices have changed with time so the appropriate variables for age, country of origin, country of birth, and survey weight (used to make survey responses representative of the general population) were extracted from each yearly extract, standardised, and combined into a single dataset. The LFS data was then aggregated, accounting for survey weight, by year, age, and UK birth status to provide yearly estimates of the UK born/Non-UK born demographics by age. Finally 5 year age groups were defined using the single year of age.

4.2.2.3 Data structure, completeness, and biases.

Both the mid-year ONS population estimates,[57] and the LFS are assessed for performance and quality elsewhere.[58,59] However, both have several failings that it is important to note, as they could introduce bias in future analysis. Whilst the ONS mid-year, and LFS estimates compare well when aggregated by age (Figure 4.7) there is more disagreement when they are broken down by 5 year age groups (Figure 4.8). For those at working age both data sources are comparable (with approximately a 1% difference across all years). However, for children, young adults, and those who are 85+ the LFS underestimates the total population. This is particularly the case for older adults with between a 5% and 20% discrepancy for those aged 85-89 and a 25% to 45% discrepancy between those aged 90+. This could be problematic as these age groups often have the most severe outcomes to TB infection. A pragmatic approach to this is to exclude those aged 90+ from future analysis as results for this age group will be subject to large amounts of uncertainty which will be difficult to directly incorporate into the results.

Overall population estimates in England derived using ONS (Total) and LFS (Total (LFS)) demographic data. The ONS data is likely to be more reliable as the LFS data is derived using a weighted survey. After accounting for missing UK birth status both datasets provide comparable estimates of the population of England, with a clearly increasing trend over time. However, the ONS data indicates a reduction in population from 2000 until 2001 that is not seen in the LFS data. The UK born and non-UK born populations are estimated using the LFS data.

Figure 4.7: Overall population estimates in England derived using ONS (Total) and LFS (Total (LFS)) demographic data. The ONS data is likely to be more reliable as the LFS data is derived using a weighted survey. After accounting for missing UK birth status both datasets provide comparable estimates of the population of England, with a clearly increasing trend over time. However, the ONS data indicates a reduction in population from 2000 until 2001 that is not seen in the LFS data. The UK born and non-UK born populations are estimated using the LFS data.

$Percentage difference between ONS population estimates and estimates derived from the LFS by 5 year age group. For most age groups there is less than a 2\% difference over time. In older adults (85+) there is a substantially greater difference ranging from 5\% to 40\%.$

Figure 4.8: Percentage difference between ONS population estimates and estimates derived from the LFS by 5 year age group. For most age groups there is less than a 2% difference over time. In older adults (85+) there is a substantially greater difference ranging from 5% to 40%.

4.3 TB notifications

4.3.1 Overview

There were 114,820 notifications between 2000-2015 in England of which 67.6% (77669/114820) were non-UK born. Over this period notifications increased in the non-UK born from 2000 until 2011, since when they have decreased year-on-year (Figure 4.9). In the UK born, notifications remained relatively stable from 2000 until 2011, since then there has been a small decrease. Notifications with missing UK birth status have decreased year-on-year, with only 121 in 2015. The majority of cases were aged between 15-44 years old (60.2% (69106/114820)), with few cases in young children (0-14; 5.1% (5842/114820)) or older adults (65+; 14.4% (16538/114820)). Cases are heterogeneously distributed with the majority of cases in London (42.8% (49142/114820)) with the next highest number of notifications in the West Midlands (12.3% (14100/114820)). Since 2009, 47.2% (24354/51645) of notifications have been BCG vaccinated with 33.2% (17133/51645) having a missing BCG status. Of cases with a known BCG status 66.8% (34512/51645) were recorded as having been BCG vaccinated. From 2010, when collection of socio-economic status began, 38.6% (16800/43533) of cases have been in the lowest socio-economic quintile. For a more complete breakdown of notifications in the ETS see the yearly PHE TB report.[2] It should be noted that these statistics do not take into account changes in population demographics which may mask underlying changes in TB epidemiology, this is addressed in Section 4.5.

Notifications in England from 2000 to 2015 stratified by UK birth status, sourced from the ETS system. Notifications in the non-UK born doubled from 3329 in 2000 to 6021 in 2011, since when they have decreased year-on-year. In the UK born notifications have remained comparable over time, with some evidence of a decrease from 2011 until 2015. UK birth status has become increasingly complete over time with notifications without birth status dropping from 885 in 2000 to 121 in 2015.

Figure 4.9: Notifications in England from 2000 to 2015 stratified by UK birth status, sourced from the ETS system. Notifications in the non-UK born doubled from 3329 in 2000 to 6021 in 2011, since when they have decreased year-on-year. In the UK born notifications have remained comparable over time, with some evidence of a decrease from 2011 until 2015. UK birth status has become increasingly complete over time with notifications without birth status dropping from 885 in 2000 to 121 in 2015.

4.3.2 Age distribution of notifications

Notifications in the ETS are heterogeneous distributed by age as well as by UK birth status.[2] In the non-UK born the majority of cases occur in young adults with few cases in young children or older adults (Figure 4.10). Over time the distribution of cases is becoming more uniform with a reduction in the proportion of cases in young adults. In the UK born the distribution of cases is more homogeneous, although there is some evidence of a higher proportion of cases in working age adults as opposed to older adults and children. Unlike the non-UK born population there is little evidence of a change in the distribution of cases over time. 0-4 year old UK born children make up a higher proportion of cases than other UK born children. This spike is not observed in the non-UK born population. These conclusions may be biased by changes in underlying population demographics, this is addressed in Section 4.5.

Proportion of total yearly notification by 5 year age group in the ETS system in 2005, 2010 and 2015 stratified by UK birth status. Non-UK born cases have a higher proportion of young adult cases with very few cases in children or in older adults. UK born cases have a more uniform distribution of cases with some evidence of a higher proportion of cases in young adults. In the non-UK born the proportion of cases in young adults has decreased over time, with no evidence of a temporal trend in the UK born. These results are not adjusted for population demographics and therefore may be biased.

Figure 4.10: Proportion of total yearly notification by 5 year age group in the ETS system in 2005, 2010 and 2015 stratified by UK birth status. Non-UK born cases have a higher proportion of young adult cases with very few cases in children or in older adults. UK born cases have a more uniform distribution of cases with some evidence of a higher proportion of cases in young adults. In the non-UK born the proportion of cases in young adults has decreased over time, with no evidence of a temporal trend in the UK born. These results are not adjusted for population demographics and therefore may be biased.

4.4 Population Demographics in England

Underlying trends in population demographics can be important factors in driving changes in infectious disease dynamic, so it is important to understand these trends before conducting further analysis. England has an increasing population (Figure 4.7), driven by small increases in the UK born population, and larger increases in the non-UK born population. The increase in the non-UK born population is mainly in young adults, with a reduction in the proportion of the non-UK born population that are older (Figure 4.11). In the UK born the proportion of the population that is in late middle age has increased, with the proportion of younger adults decreasing. The proportion of those aged 75+ has remained constant over time in both the UK born and non-UK born populations.

The changes in population demographics, for both the UK and non-UK born, from 2000 to 2015 may have directly impacted the number and age distribution of TB notifications. In the previous section, it appeared that a higher proportion of cases were in young adults in the non-UK born than in other age groups. Figure 4.7 indicates that this maybe due to a higher proportion of the non-UK born population being young adults. Additionally, Figure 4.7 indicates that proportion of the non-UK born population that were young adults has decreased over time, this mirrors the trend in the age distribution of notifications observed in Figure 4.10 and is likely to be driving part of this trend. In the UK born the population has become older in general, this is not clearly reflected in the age distribution of notifications (Figure 4.10). This may indicate changes in the risk of developing TB.

Figure 4.11: The estimate proprotion of the population in each 5 year age group stratified by UK birth status for 2000, 2008, and 2016.

4.5 TB incidence rates

4.5.1 Motivation

As discussed in Section 4.3.2 and Section 4.4, changes in underlying population demographics may mask or bias trends in TB notifications. To account for this, incidence rates, which indicate the incidence of TB for a standard population size, may be used. Whilst TB incidence rates are available in the yearly PHE TB report,[2] they are limited in detail and do not report age stratified, or UK born stratified incidence rates across years. Estimating these incidence rates will allow for novel analyses to be conducted later in this thesis that explore population adjusted trends in TB. The method used to estimate incidence rates is first outlined, then overall trends in incidence rates, stratified by UK birth status are explored. Finally trends in incidence rates, stratified by age and UK birth status, are investigated.

4.5.2 Method

Age-specific incidence rates were calculated as follows:

\[\begin{equation} \textit{Incidence rate (over time period, t and age, a)} = \frac{\textit{Number of cases (t,a)}}{Population(t, a)} \tag{4.1} \end{equation}\]

Age-standardised rates were calculated using the epiR package for R,[60] using the average age distribution of England from 2000-2015 as the standard population to allow comparison between years. Those aged 90+ were excluded as demographic data for this population were unreliable. The code used to calculate incidence rates is available online as an R package (tbinenglanddataclean¹⁶; see Chapter 1).

4.5.3 Overall trends in TB incidence rates

Incidence has varied with time, increasing from 11.6 per 100,000 people (95% CI 11.3, 11.9) in 2000 to a maximum of 15.6 per 100,000 people (95% CI 15.3, 15.9) in 2011, since when it has decreased to a low of 10.5 per 100,000 people (95% CI 10.2, 10.8) in 2015 (Figure 4.12). This may indicate that TB control efforts are proving effective in preventing TB outbreaks, or may be driven by changes in the composition of those immigrating to England. It also highlights the lack of progress in reducing TB burden in England over the previous two decades, with little evidence of a decrease in overall incidence rates from 2000 until 2015. In the non-UK born incidence rates increased dramatically from 2000 to 2005, since when they have fallen consistently. This may be driven by a change in the composition of the non-UK born population or it may be the result of increased screening of those entering the UK. In comparison, incidence rates fell in the UK born from 2000 until 2005 and then increased until 2012, since when they too have decreased year-on-year. This may indicate that incidence rates in the two populations are linked, with incidence rates in the non-UK born driving incidence rates in the UK born with some time lag. Alternatively, it may be that incidence rates in the two populations are only weakly linked, or not linked at all. In this scenario the TB endemic in England would actually be two nearly separate endemics, each with different drivers. These scenarios can be differentiated using trends in age-specific incidence rates, and with statistical (Chapter 7)) and dynamic modelling (Chapter 8).

$Age standardised incidence rates (by 100,000 population) for all notified TB cases from 2000-2015. Overall incidence rates are shown, along with incidence rates in the UK and non-UK born populations. Point estimates are given along with 95\% confidence intervals for each incidence rate estimate. Trends over time are highlighed by linking points with a line. Incidence rates increased over time from 2000 until 2011, since when they have falled year-on-year. This appears to be driven by increasing incidence rates in the non-UK born from 2000 until 2005, since when they have fallen year-on-year. This trend was not observed in the UK born, in which incidence rates fell from 2000 until 2005 and then increased from 2005 until 2012. As in the non-UK born they have since fallen year-on-year.$

Figure 4.12: Age standardised incidence rates (by 100,000 population) for all notified TB cases from 2000-2015. Overall incidence rates are shown, along with incidence rates in the UK and non-UK born populations. Point estimates are given along with 95% confidence intervals for each incidence rate estimate. Trends over time are highlighed by linking points with a line. Incidence rates increased over time from 2000 until 2011, since when they have falled year-on-year. This appears to be driven by increasing incidence rates in the non-UK born from 2000 until 2005, since when they have fallen year-on-year. This trend was not observed in the UK born, in which incidence rates fell from 2000 until 2005 and then increased from 2005 until 2012. As in the non-UK born they have since fallen year-on-year.

4.5.4 Age stratified incidence rates

Stratifying incidence rates into age groups (children (0-14), adults (15-64) and older adults (65+)) it is clear that the trends observed in the age adjusted overall incidence rates are not seen in all age groups (Figure 4.13). In the 65+ age group there was evidence of a year-on-year decrease in incidence rates from 14.3 per 100,000 people (95% CI 13.5, 15.1) in 2002, to 8.7 per 100,000 people (95% CI 8.1, 9.3) in 2015. In comparison, in the 15-64 year old age group, which represents the majority of cases, incidence rates rose year-on-year to a maximum of 19.5 per 100,000 people (95% CI 19.0, 20.0) in 2011 and then fell year-on-year to 13.3 per 100,000 people (95% CI 12.9, 13.7) in 2015. In children (0-14) incidence rates peaked earlier, with an incidence rate of 3.5 per 100,000 people (95% CI 3.1, 3.9) in 2000 which increased to 5.0 per 100,000 people (95% CI 4.5, 5.5) in 2007. Since when they have decreased to in 2.2 per 100,000 people (95% CI 2.0, 2.6) in 2015.

$Incidence rates (by 100,000 population) for all notified TB cases from 2000-2015, stratified by age group (children (0-14), adults (15-64) and older adults (65+)) and UK birth status. Point estimates are given along with 95\% confidence intervals for each incidence rate estimate. Trends over time are highlighed by linking points with a line. Incidence rates declined overall in children over time. In adults incidnce rates incrseased until 2011 and have since fallen. In older adults incidence rates consistently fell. In the non-UK born, incidence rate also fell in childen but peaked earler in adults and showed little evidence of a downwards trends in older adults until 2013. In the UK born, incidence rates increased in children until 2008, since when they havell fallen. Incidence rates also increased over time in UK born adults until 2012 but has consistently fallen in UK born older adults.$

Figure 4.13: Incidence rates (by 100,000 population) for all notified TB cases from 2000-2015, stratified by age group (children (0-14), adults (15-64) and older adults (65+)) and UK birth status. Point estimates are given along with 95% confidence intervals for each incidence rate estimate. Trends over time are highlighed by linking points with a line. Incidence rates declined overall in children over time. In adults incidnce rates incrseased until 2011 and have since fallen. In older adults incidence rates consistently fell. In the non-UK born, incidence rate also fell in childen but peaked earler in adults and showed little evidence of a downwards trends in older adults until 2013. In the UK born, incidence rates increased in children until 2008, since when they havell fallen. Incidence rates also increased over time in UK born adults until 2012 but has consistently fallen in UK born older adults.

Further stratifying incidence rates, by both age group and UK birth status, it is clear that the contribution of the non-UK born dominates that of the UK born in adults (15-64) but that the reverse is true in older adults (65+) and trends appear to be similar in children (0-14), regardless of UK birth status (Figure 4.13). In the non-UK born, incidence rates have fallen year-on-year in children but increased from 2000 until 2005 in adults, since when they have decreased. In non-UK born older adults there is less clear evidence of a trend over time, although incidence rates have fallen, as in other populations, from 2011 on-wards. In the UK born, incidence rates increased in children from 2000 until 2008, since when they too have consistently fallen. UK born adults had increasing incidence rates year-on-year until 2012 but incidence rates have since fallen to pre 2000 levels. In older UK born adults incidence rates have consistently fallen, more rapidly from 2000 until 2008 and since 2014.

Another approach to explore trends in age stratified incidence rates is to visualise them across 5 year age groups, for a selected subset of years. This can be seen in Figure 4.14 stratified by UK birth status. This figure indicates that TB incidence in the non-UK born has been driven by high incidence rates in young adults. Incidence rates in this population increased dramatically between 2000 and 2005 and then fell in all age groups, except 20-24 years old by 2010. In 2015 there was little evidence of this peak in young adults but a secondary spike in much older adults (75+) remained. In the UK born, incidence rates increased with age in 2000, this trend has weakened over time, with a secondary peak developing in young adults (with a 5 year lag when compared to the peak observed in non-UK born adults). In 2015, incidence rates in the UK born were largely homogeneous except for a gradual increase in much older adults (75+), and lower incidence rates in children. 0-4 year old children have remained at greater risk of TB, compared to other children across the time period for which data is available. There is some evidence that incidence rates fell in this group after the introduction of BCG vaccination in 2005, with incidence rates in older children (5-9) also having fallen by 2015.

$Age-specific incidence rates (by 100,000 population) grouped into 5 year age categories for 2000, 2005, 2010 and 2010, stratified by UK birth status. Point estimates are given along with 95\% confidence intervals for each incidence rate estimate. Trends across age distributions are highlighted by linking points with a line. This Figure indicates that TB incidence in the non-UK born has been driven by high incidence rates in young adults. Incidence rates in this population increased dramatically between 2000 and 2005 and then fell in all age groups, except 20-24 years old by 2010. In 2015 there was little evidence of this peak in young adults but a secondary spike in much older adults (75+) remained. In the UK born, incidence rates increased with age in 2000, this trend has weakened over time, with a secondary peak developing in young adults (with a 5 year lag when compared to the peak observed in non-UK born adults). In 2015, incidence rates in the UK born were largely homogeneous except for a gradual increase in much older adults (75+), and lower incidence rates in children. 0-4 year old children have remained at greater risk of TB, compared to other children across the time period for which data is avialable.$

Figure 4.14: Age-specific incidence rates (by 100,000 population) grouped into 5 year age categories for 2000, 2005, 2010 and 2010, stratified by UK birth status. Point estimates are given along with 95% confidence intervals for each incidence rate estimate. Trends across age distributions are highlighted by linking points with a line. This Figure indicates that TB incidence in the non-UK born has been driven by high incidence rates in young adults. Incidence rates in this population increased dramatically between 2000 and 2005 and then fell in all age groups, except 20-24 years old by 2010. In 2015 there was little evidence of this peak in young adults but a secondary spike in much older adults (75+) remained. In the UK born, incidence rates increased with age in 2000, this trend has weakened over time, with a secondary peak developing in young adults (with a 5 year lag when compared to the peak observed in non-UK born adults). In 2015, incidence rates in the UK born were largely homogeneous except for a gradual increase in much older adults (75+), and lower incidence rates in children. 0-4 year old children have remained at greater risk of TB, compared to other children across the time period for which data is avialable.

4.5.5 Incidence rates in children (0-14 years old) as a proxy for TB transmission

Trends in incidence rates in UK born young children (0-14 years old) are used as a proxy for recent transmission and compared to the overall incidence rate in order to extrapolate the degree of reactivation occurring in older populations.[2] Whilst this proxy approach is limited, in that it assumes that different population groups have an equivalent risk of TB and that TB control measures are the same across age groups, it may be combined with other methods to derive a good understanding of TB transmission. In Figure 4.12 incidence rates in the UK born decreased from 2000 until 2006 and then increased until 2011, since when they have fallen. This trend was not seen in UK born children, in whom incidence rates increased over time until 2008 (Figure 4.13). The trend in UK born children diverging from that seen in the overall population may be interpreted as evidence that TB transmission increased from 2000 to 2008, and then decreased subsequently. Unfortunately, this conclusion is difficult to extrapolate to older populations as it is likely that UK born children (the segment with non-UK born parents) have more interaction with non-UK born adults than UK born adults do. Additionally, BCG vaccination of high risk UK born children was introduced in 2005, which is likely to have depressed incidence rates since then. More complex modelling approaches are required to explore this question in more detail, this is explored in greater detail later in this thesis.

4.6 TB outcomes

4.6.1 Motivation

Whilst TB outcomes are tracked in detail in the yearly PHE TB reports,[2] the role of BCG vaccination has not previously been considered. There is some evidence that BCG vaccination may reduce all-cause mortality,[33–35] TB mortality,[28] and improve treatment outcomes.[36] The evidence for this in the ETS will be explored in the following section for: all-cause mortality, TB mortality, successful treatment at 12 months, and lost to follow up. TB outcomes are also likely to vary with age and UK birth status, both of which may mask potential variation due to BCG vaccination if not accounted for. As when identifying trends in TB notifications, relying solely on case counts for TB outcomes gives a biased picture as the underlying number of cases may change. For this reason in this section I explore TB outcomes using case rates.

4.6.2 Method

Case rates were calculated as follows and confidence estimates were estimated using the prop.test function from the stats R package:

\[\begin{equation} \textit{Case rate (over time period, t and age, a)} = \frac{\textit{Number of cases with outcome of interest (t,a)}}{\textit{Number of cases with known outcome (t, a)}} \times 100 \tag{4.2} \end{equation}\]

4.6.3 All-cause mortality

In 2015 fewer UK born cases died from any cause in the ETS than in 2000 but the number of non-UK born cases dying remained stable (Figure 4.15). However, the case all-cause fatality rate indicates that the rate of all-cause deaths has increased over time in both the UK and non-UK born. There is also evidence to suggest that the case all-cause fatality rate is higher in those born in the UK than in the non-UK born and that it is higher for BCG vaccinated versus unvaccinated cases. The highest case all-cause fatality rate, regardless of UK birth status is observed in those missing UK birth status. In both populations the case all-cause fatality rate increases with age (as might be expected) but also has a secondary peak in early childhood (0-4) (Figure 4.16). The all-cause case fatality rate is higher in BCG unvaccinated cases, compared to vaccinated cases, from early adulthood until 50 years of age in the UK born but there is less evidence of a difference in the non-UK born. Young non-UK born children missing BCG status are particularly at risk of death from any cause.

$a.) Cases that died from any cause by year of notification stratified by UK birth and BCG status, b.) Case all-cause fatality rate stratified by UK birth and BCG status. Point estimates along with 95\% confidence are shown for all estimates. All-cause mortality has reduced over time in the UK born but remained stable in the non-UK born. This is also reflected in the case fatality rate with the UK born having a higher rate regardless of BCG status. The recording of BCG status has improved over time but it appears that for years with data BCG unvaccinated cases have a higher all-cause case fatality rate in both the UK and non-UK born. In both populations those missing UK birth status are more likely to die from any cause. Data is incomplete for 2015, with cases that survived being potentially more likely to be missing than those that died. This may be the cause of the observed increase in uncertainty and may also have resulted in a biased mortality rate for 2015.$

Figure 4.15: a.) Cases that died from any cause by year of notification stratified by UK birth and BCG status, b.) Case all-cause fatality rate stratified by UK birth and BCG status. Point estimates along with 95% confidence are shown for all estimates. All-cause mortality has reduced over time in the UK born but remained stable in the non-UK born. This is also reflected in the case fatality rate with the UK born having a higher rate regardless of BCG status. The recording of BCG status has improved over time but it appears that for years with data BCG unvaccinated cases have a higher all-cause case fatality rate in both the UK and non-UK born. In both populations those missing UK birth status are more likely to die from any cause. Data is incomplete for 2015, with cases that survived being potentially more likely to be missing than those that died. This may be the cause of the observed increase in uncertainty and may also have resulted in a biased mortality rate for 2015.

$Age distribution (in 5 year age groups) of the case all-cause mortality rate presented on a square root scale. Estimates are stratified by BCG and UK birth status. Point estimates and 95\% confidence intervals are shown. In both populaitons the case all-cause fatality rate increases with age, and has a secondary peak in early childhood (0-4). The all-cause case fatality rate is higher in BCG unvaccinated cases, compared to vaccinated cases, from early adulthood until 50 years of age in the UK born. There is less evidence of a difference in case fatality rates in the non-UK born. Case missing BCG status are more likely to die in both populations, with young non-UK born children being particularly at risk.$

Figure 4.16: Age distribution (in 5 year age groups) of the case all-cause mortality rate presented on a square root scale. Estimates are stratified by BCG and UK birth status. Point estimates and 95% confidence intervals are shown. In both populaitons the case all-cause fatality rate increases with age, and has a secondary peak in early childhood (0-4). The all-cause case fatality rate is higher in BCG unvaccinated cases, compared to vaccinated cases, from early adulthood until 50 years of age in the UK born. There is less evidence of a difference in case fatality rates in the non-UK born. Case missing BCG status are more likely to die in both populations, with young non-UK born children being particularly at risk.

4.6.5 Successful treatment

Successful treatment within 12 months has increased in both populations over time in terms of cases (Figure 4.19). The case successful treatment rate initially decreased for both UK and non-UK born populations but since 2012 has improved in the UK born. There is little evidence to suggest that the successful treatment rate varies by BCG status, or by UK birth status. Successful treatment rates appear to be lowest for young adults and highest for young children (Figure 4.20).

$a.) Cases that were treated successfully within 12 months by year of notification stratified by UK birth and BCG status, b.) Case successful treatment within 12 months rate stratified by UK birth and BCG status. Point estimates along with 95\% confidence are shown for all estimates. Successful treatment within 12 months has increased in both populations over time in terms of cases. The case successful treatment rate initailly decreased for both UK and non-UK born populations but since 2012 has improved in the UK born. There is little evidence to suggest that the case successful treatment rate varies by BCG status. Data is incomplete for 2015, with cases that were successfully treated being potentially less likely to be missing than those that were not. This may be the cause of the observed increase in uncertainty and may also have resulted in a biased successful treatment rate for 2015.$

Figure 4.19: a.) Cases that were treated successfully within 12 months by year of notification stratified by UK birth and BCG status, b.) Case successful treatment within 12 months rate stratified by UK birth and BCG status. Point estimates along with 95% confidence are shown for all estimates. Successful treatment within 12 months has increased in both populations over time in terms of cases. The case successful treatment rate initailly decreased for both UK and non-UK born populations but since 2012 has improved in the UK born. There is little evidence to suggest that the case successful treatment rate varies by BCG status. Data is incomplete for 2015, with cases that were successfully treated being potentially less likely to be missing than those that were not. This may be the cause of the observed increase in uncertainty and may also have resulted in a biased successful treatment rate for 2015.

$Age distribution (in 5 year age groups) of the case successful treatment within 12 months rate presented on a square root scale. Estimates are stratified by BCG and UK birth status. Point estimates and 95\% confidence intervals are shown. There is little evidence that successful treatment rates differ greatly by BCG or UK birth status when stratified by age. Successful treatment rates appear to be lowest for young adults and highest for young children.$

Figure 4.20: Age distribution (in 5 year age groups) of the case successful treatment within 12 months rate presented on a square root scale. Estimates are stratified by BCG and UK birth status. Point estimates and 95% confidence intervals are shown. There is little evidence that successful treatment rates differ greatly by BCG or UK birth status when stratified by age. Successful treatment rates appear to be lowest for young adults and highest for young children.

4.6.6 Lost to follow up

As for other outcomes discussed, cases lost to follow up has decreased over time in the UK born, but increased in the non-UK born (with incomplete data for 2015) (Figure 4.21). In all populations the case loss to follow up rate has decreased over time, although this may be biased as cases may not have had sufficient time to be classed as lost to follow up. In both populations there is little evidence to suggest variation by BCG status but the loss to follow up was higher in the non-UK born than in the UK born. This was true across all age groups, and there was again little evidence of variation due to BCG status (Figure 4.22). Young adults were the most likely to be lost follow up in both populations but this appeared to be a particular issue in the non-UK born.

$a.) Cases that were lost to follow up stratified by UK birth and BCG status, b.) Case lost to follow up rate stratified by UK birth and BCG status. Point estimates along with 95\% confidence are shown for all estimates. Loss to follow up has decreased over time in the UK born, but increased in the non-UK born (with incomplete data for 2015). The case loss to follow up rate has decreased over time for the UK born but increased for the non-UK born. In both populations there is little evidence that loss to follow up varies by BCG status. Data is incomplete for 2015, with cases that were lost to follow up being potentially less likely to be missing than those that were not. This may be the cause of the observed increase in uncertainty and may also have resulted in a biased lost to follow up rate for 2015.$

Figure 4.21: a.) Cases that were lost to follow up stratified by UK birth and BCG status, b.) Case lost to follow up rate stratified by UK birth and BCG status. Point estimates along with 95% confidence are shown for all estimates. Loss to follow up has decreased over time in the UK born, but increased in the non-UK born (with incomplete data for 2015). The case loss to follow up rate has decreased over time for the UK born but increased for the non-UK born. In both populations there is little evidence that loss to follow up varies by BCG status. Data is incomplete for 2015, with cases that were lost to follow up being potentially less likely to be missing than those that were not. This may be the cause of the observed increase in uncertainty and may also have resulted in a biased lost to follow up rate for 2015.

$Age distribution (in 5 year age groups) of the case loss to follow up rate presented on a square root scale. Estimates are stratified by BCG and UK birth status. Point estimates and 95\% confidence intervals are shown. There is little evidence of variation by BCG status but loss to follow up is higher in the non-UK born compared to the UK born across all age groups. Young adults are the most likely to be lost follow up in both populations but this is a particular issue in the non-UK born.$

Figure 4.22: Age distribution (in 5 year age groups) of the case loss to follow up rate presented on a square root scale. Estimates are stratified by BCG and UK birth status. Point estimates and 95% confidence intervals are shown. There is little evidence of variation by BCG status but loss to follow up is higher in the non-UK born compared to the UK born across all age groups. Young adults are the most likely to be lost follow up in both populations but this is a particular issue in the non-UK born.

4.7 Discussion

In this chapter I have explored the epidemiology of TB in England using routine datasets, with a particular focus on the impact of missing data, the mechanisms underlying that missing date, seasonal trends, the role of age, UK birth status, and BCG status. I have also estimated incidence rates, stratified by UK birth status and age, which I then used to identify trends in TB incidence over time. Finally, I explored TB outcomes in England using case rates, again stratified by BCG status and UK birth status.

In the ETS system I found a high degree of missing data for several important variables. I also found that there is likely to be strong missing at random (MAR) mechanism underlying this missing data for multiple variables. Several factors are strongly associated with data being missing for many variables, including UK birth status, ethnic group, socio-economic status and year. These MAR mechanisms must be adjusted for in future analysis to avoid bias. I found that date variables in particular suffered from changing data completeness over time, which may introduce spurious temporal trends if not fully understood. I also found that for several variables, including the date of symptom onset, there was a large degree of recall bias when aggregating by day or month. Several variables, including date of notification and date of starting treatment, showed a seasonal trend with a maximum in the summer months. The date of ending treatment showed less evidence of a seasonal trend.

As reported elsewhere, I found that TB incidence initially increased from 2000 until 2011 but has since decreased. This was mainly driven by changing incidence in the non-UK born with a slight decrease in UK born incidence in recent years. Stratifying by age, I found that non-UK born cases were more likely to be young adults than any other age group but that the age distribution of the UK born was more nearly uniform. There was some evidence that these trends in TB incidence were driven by changing population demographics, with a large increase in the young adult non-UK born population between 2000 and 2015. In general the population of England is ageing, except for the non-UK born population which is still primarily made up of young adults. This is likely to impact trends in TB over time, with more severe outcomes but potentially less TB transmission.

After estimating incidence rates, I found that TB incidence rates increased over time in the UK born from 2000 until 2005, since when they have declined year-on-year. There appeared to be some linkage between the UK born and the non-UK born with incidence rates in the UK born initially decreasing until 2005, then increasing year-on-year until 2012. Since then they have decreased, in line with the decreases seen in the non-UK born. Stratifying incidence rates by age gives insights into what may be driving these mechanisms. In the non-UK born, incidence rates have decreased over time in children (0-14), increased in adults (15-64) through to 2005 before again beginning to decrease year-on-year, and remained relatively stable in older adults (65+) until 2011 since when they have also fallen. These trends are not mirrored in the UK born with incidence rates initially increasing in children through to 2008 before beginning to decline. Incidence rates also increased in adults through to 2011 before again beginning to decline. Incidence rates in older adults dramatically decreased between 2000 and 2015 with some evidence of a decline in the rate of this decrease from 2007 on-wards. These findings indicate that current reductions in TB incidence may not be reaching the young UK born adult population, additional control measures may be required to reduce TB incidence in this population further. Finally, I explored the use of incidence rates in UK born children as a proxy of TB transmission in England. There may be issues with this method as UK born children may not be representative of the population as a whole as they may be more likely to mix with higher risk non-UK born adults and because the change of BCG vaccination policy may have depressed incidence rates in children. More work is required, using both dynamic and statistical modelling, to understand whether incidence rates in children may be reliably used to proxy TB transmission.

Using case rates, I found that there was some evidence that cases who were not BCG vaccinated may be more likely to suffer from negative TB outcomes with differences in all-cause mortality and TB mortality. These differences were observable after stratifying by UK birth status and BCG status with young adults deriving the greatest apparent benefit from BCG vaccination. TB outcomes were also generally worse in the non-UK born, except for successful treatment.

Findings from this chapter are used throughout the later chapters of this thesis. In particular, Chapter 6 uses statistical modelling to exploring the impact of BCG vaccination on TB outcomes in greater detail, Chapter 7 explores the impact of the change in BCG vaccination policy on TB incidence rates using the incidence rate estimates from this chapter, Chapter 8 uses the understanding of the ETS gained from this chapter to parameterise a dynamic TB transmission model, and Chapter 9 uses the insights gained into the date variables in the ETS to fit a dynamic TB transmission model.

4.8 Summary

In this chapter the key data sources used in this thesis have been examined in detail with a particular focus on the role of age, UK birth status and BCG vaccination status. The role of missing data and potential mechanisms driving it have also been extensively explored. Data completeness was found to increase dramatically over time for many variables, which must be accounted for in any analysis using these variables to identify temporal trends.
TB incidence rates stratified by age and UK birth status have been calculated, along with case rates for TB outcomes. These estimates were then to extensively explore trends in TB in England, identifying possible analysis questions to be addressed later in this thesis.
The code used in this chapter to import, clean and manipulate the data sources has been made accessible separately as an R package (tbinenglanddataclean; see Chapter 1), along with documentation of the required data sources and package functions. If interested in reproducing this work from the raw data please see this documentation for details.
Findings from this chapter are used throughout the later chapters of this thesis: to inform analysis questions (Chapter 6 and 7), identify variables for which missing data must be imputed (Chapter 6 and 7), to parameterise a dynamic TB transmission model (Chapter 8), and to fit a dynamic TB transmission model (Chapter 9).

References

2 Public Health England. Tuberculosis in England 2017 report ( presenting data to end of 2016 ) About Public Health England. 2017.

21 Public Health England. Tuberculosis in England: 2018. 2019;1–218.

28 Abubakar I, Pimpin L, Ariti C et al. Systematic review and meta-analysis of the current evidence on the duration of protection by bacillus Calmette-Guérin vaccination against tuberculosis. Health technology assessment 2013;17:1–372, v–vi.

33 Garly ML, Martins CL, Balé C et al. BCG scar and positive tuberculin reaction associated with reduced child mortality in West Africa: A non-specific beneficial effect of BCG? Vaccine 2003;21:2782–90.

35 Rieckmann A, Villumsen M, Sørup S et al. Vaccinations against smallpox and tuberculosis are associated with better long-term survival: a Danish case-cohort study 19712010. International journal of epidemiology 2016;0:1–11.

36 Jeremiah K, Praygod G, Faurholt-Jepsen D et al. BCG vaccination status may predict sputum conversion in patients with pulmonary tuberculosis: a new consideration for an old vaccine? Thorax 2010;65:1072–6.

45 French CE, Antoine D, Gelb D et al. Tuberculosis in non-UK-born persons, England and Wales, 2001-2003. Int J Tuberc Lung Dis 2007;11:577–84.

46 Kruijshaar M, French C, Anderson C et al. Tuberculosis in the UK, Annual report on tuberculosis surveillance and control in the UK 2007. Thorax 2007;50:703–3.

47 Public Health England. The Green Book. 2013;391–409.

48 Sterne JAC, White IR, Carlin JB et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj 2009;338:b2393–3.

49 Pillaye J, Clarke A. An evaluation of completeness of tuberculosis notification in the United Kingdom. BMC Public Health 2003;3:31.

50 PHE. Tuberculosis in England 2016 Report (presenting data to end of 2015). 2016.

51 van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software 2011;45:1–67.https://www.jstatsoft.org/v45/i03/

52 Sterne JA, Davey Smith G. Sifting the evidence-what’s wrong with significance tests? Bmj 2001;322:226–31.

53 Benchimol EI, Smeeth L, Guttmann A et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. The American Statistician 2016;115-116:1–22.

54 Fewell Z, Davey Smith G, Sterne JAC. The impact of residual and unmeasured confounding in epidemiologic studies: A simulation study. American Journal of Epidemiology 2007;166:646–55.

55 Wickham H. Tidy Data. Journal of Statistical Software 2014;59:1–23.

56 R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: 2016.

57 Office for National Statistics. Accuracy of official high-age population estimates, in England and Wales: an evaluation. 2016. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/ageing/methodologies/accuracyofofficialhighagepopulationestimatesinenglandandwalesanevaluation

58 NHS Trafford CCG Governing Body. Performance and Quality Report. 2015;1–47.

59 Office for National Statistics. A comparison of the 2011 Census and the Labour Force Survey ( LFS ) labour market indicators. 2012. https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/articles/acomparisonofthe2011censusandthelabourforcesurveylfslabourmarketindicators/2012-12-11

60 Stevenson M, Nunes T, Heuer C et al. epiR: Tools for the Analysis of Epidemiological Data. 2017.

Data cleaning code: https://www.samabbott.co.uk/tbinenglanddataclean/reference/clean_munge_ets_2016.html ↩
Demographic data sources: https://github.com/seabbs/tbinenglanddataclean ↩
tbinenglanddataclean: https://github.com/seabbs/tbinenglanddataclean ↩