Introduction

Most guidelines for uncomplicated UTI recommend treatment with empirical antibiotics. However, when urine is cultured, approximately only one in three women with UTI symptoms are found to have a UTI as defined by a positive bacterial culture1. Therefore, prescribing empirically may result in antibiotic overuse and contribute to development of antimicrobial resistance. Clinicians generally base treatment decisions on symptoms, urine appearance, urine dipstick results, risk factors for development of complications and patient preference2,3. Some of these features have been combined into clinical prediction rules, but the predictive values remain suboptimal4. Therefore, the development of better diagnostic tools for UTI is essential for improving antimicrobial stewardship.

Exploratory approaches to aid UTI diagnosis have been based on serum and urinary biomarkers. The specificity of blood immune markers is limited by the possibility of cross-reactivity due to other infections or inflammatory responses. Urinary biomarkers that might reflect local immunological responses by the bladder epithelium include nerve growth factor (NGF), chemokines including IL-8/CXCL85,6 and antimicrobial peptides (AMPs), human α-defensin 5 (HD5)7 and neutrophil gelatinase-associated lipocalin (NGAL)8. However, there is a lack of comprehensive biomarker screening studies for UTI.

With an expansion in the list of potential UTI biomarkers, it is also important to identify the most useful and readily available clinical information that could assist UTI diagnosis and guide prescribing decisions at the point of care. Many studies have implemented multivariate statistical models such as logistic regression to identify UTI clinical predictors2,4. These models are bound by relationship assumptions between predictors and outcome variables. In this study we aimed to use a machine learning-based approach, in which random forest (RF) and support vector machines (SVM) were implemented to allow fewer assumptions and more complex relationships between predictors. We combined these algorithms with recursive feature elimination (RFE) to extract the best predictor(s) for uncomplicated UTI using clinical information and potential biomarkers present in urine. These analytical approaches have been widely used in medical applications, such as drug discovery, biomarker selection and early diagnosis9,10,11,12,13,14,15,16. SVM, for instance, is a supervised learning model based on statistical learning for classification and regression analysis, which finds the separating hyperplane with the maximal margin between data from different groups. RF is an ensemble learning method that constructs a multitude of decision trees17 and is a popular approach for diagnosis18 and medical decision support systems19. Both SVM and RF outperform other machine learning methods for discriminant problems20. In this study, the aim was to find the best biomarker for UTI diagnosis, thus the classification ability was an important factor in differentiating UTI groups. Also, considering the complexity of the raw data required for biomarker discovery, the ability to cope with high-dimensional data was another criterion in choosing machine learning methods. In RF, the trees are decorrelated at each split on a small subset of features rather than all features, thus it is a strong candidate algorithm for high dimensional data. For SVMs, the separate hyperplane relies on the support vectors not all data, thus giving it independent advantages in dealing with high-dimensional data.

Results

Clinical information to predict UTI

Our study cohort included 183 women who participated in the POETIC (Point of care testing for urinary tract infection in primary care) trial21. They ranged in age from 18 to 85 years, and the key UTI symptoms of urgency, frequency and dysuria were present in 84.2%, 91.8% and 77.0% of patients, respectively. The frequency of other symptoms is presented in Table 1. Following urine culture and according to the POETIC protocol22, 79 (43.2%) and 104 (56.8%) patients were classified as UTI positive and negative, respectively. Data from 128 patients (70%) were used for model training while data from 55 patients (30%) were used for testing model performance.

Table 1 Frequency of clinical and immunological predictors.

Using only the clinical data recorded during the initial consultation, urine cloudiness was the best clinical predictor for UTI with an area under the ROC curve (AUC) of 0.72 (95% CI 0.60–0.85), positive predictive value (PPV) 0.65, negative predictive value (NPV) 0.79, positive likelihood ratio (LR+) 2.55, negative likelihood ratio (LR−) 0.37 and F1 score of 0.69 on the test data subset (Table 2). We then substituted cloudiness (measured as a binary yes/no) with a more discriminatory assessment of cloudiness (turbidity score with three categories; Table 1). This substitution resulted in a similar AUC of 0.73 (95% CI 0.60–0.85) and improved PPV 0.76 and LR + 4.38 (Table 2). No other clinical features or age added to the predictive value of cloudiness/turbidity. RF and SVM algorithms produced similar results, except that SVM selected age plus turbidity (Table 2).

Table 2 Performance of selection and merged models on test data subset.

Urinary biomarkers to predict UTI

We previously reported correlations between bacterial infection and defined immune signatures (‘immune fingerprints’) in other scenarios23,24. To apply this knowledge to the diagnosis of uncomplicated UTI we conducted a comprehensive analysis of 42 inflammatory biomarkers in urine samples. In line with earlier observations, we found positive correlations between many of the immunological biomarkers measured (Supplementary Figure S1). As a consequence, RFE was employed to select the best biomarkers for predicting UTI. Using the RFE coupled with RF algorithm (RF + RFE), IL-1β and MMP9 were selected as the best predictors with AUC of 0.82 (95% CI 0.69–0.94) and F1 score of 0.67 on the test data subset (Table 2). The diagnostic relevance of IL-1β and MMP9 was corroborated in an independent analysis using the SVM + RFE algorithm, which resulted in the selection of the same urinary biomarkers alongside NGAL and IL-8/CXCL8, with a similar AUC and improved LR+ and F1 score, compared to the RF + RFE selection (Table 2). Adding the selected immunological biomarkers to the model with clinical features (including cloudiness or turbidity) did not improve the predictive properties (Table 2). We conclude that while urine cloudiness was the most useful clinical predictor to rule out negative cases, urinary biomarkers were particularly helpful to predict the presence of UTI in symptomatic women.

Variable UTI classification guidelines

Finally, we explored whether changing the bacterial count threshold (based on different national and European UTI guidelines) would affect the selection of clinical and immunological predictors. Using the Public Health England (PHE) guidelines25,26 to interpret urine culture results, 99 (54.1%) and 84 (45.9%) patients were UTI positive and negative, respectively. The European Association of Urology (EAU) guidelines27 classed 118 (64.5%) and 65 (35.5%) as positive and negative, respectively.

Cloudiness/turbidity remained the best clinical predictor when using the PHE or EAU definitions of UTI positivity (Supplementary Table S1). However, the selection of immunological markers varied with UTI classification and the type of machine learning algorithm employed. Using PHE classification, the best predicting model included a combination of urine cloudiness and NGAL, which resulted in a LR+ and LR− of 4.94 and 0.25 respectively, and a good F1 score of 0.82 (Table S1). Using the EAU classification, the combination of turbidity, feeling unwell, foul smell in urine, NGAL and MMP9 resulted in a model with the best predictive properties (Table S1).

Discussion

This is one of the first studies to use machine learning methods to select clinical features and urinary immunological markers to predict culture results for uncomplicated UTI in primary care. We found that cloudiness of urine samples was the best clinical predictor of microbiologically confirmed UTI among symptomatic women, and that assessing cloudiness using a categorical turbidity scale improved the predictive properties further, particularly in identifying positive UTI. We identified a set of four urinary immunological markers (MMP9, NGAL, IL-8/CXCL8 and IL-1β), which performed slightly better than cloudiness/turbidity when used independently. Changing the definition of UTI positivity to that used by PHE and the EAU standards, and using both RF and SVM algorithms, resulted in some changes to predictors, but urine cloudiness/turbidity, and the immunological markers MMP9, IL-1β and NGAL continued to be important predictors, thereby confirming their relevance in UTI diagnosis.

While normal urine samples are usually clear, white blood cells (WBCs), red blood cells, epithelial cells, proteins, crystals, drugs and microorganisms can cause the urine to become cloudy. In uncomplicated UTI, the presence of WBCs and/or bacteria in urine can lead to urine cloudiness. This is consistent with the findings of our study where urine cloudiness/turbidity consistently came out as the best predictor of UTI. This finding is in keeping with previous studies that investigated urine appearance as part of clinical rules to predict UTI in community settings4 and catheterized patients28.

Visual assessment of urine cloudiness by health care staff is recommended in some guidelines as a step in the process of diagnosing uncomplicated UTI (for example PHE)29. Our results highlight the importance of implementing this guideline in ruling out negative UTI cases, which is helpful for antibiotic stewardship activities. Furthermore, the improvement on positive UTI prediction by using a turbidity score, instead of binary cloudiness, indicates that the assessment of the degree of cloudiness could improve the diagnosis of uncomplicated UTI within a consultation. In our study, turbidity scores were assessed by the microbiology laboratory after samples were transported from GP practices by standard post at room temperature. As urine turbidity may decrease or increase with prolonged transportation due to WBC lysis or bacterial growth, respectively, our samples were preserved in boric acid to protect WBCs and prevent bacterial growth during transportation30,31. Of note, we found no correlation between transportation time and turbidity score, indicating that boric acid preservation was sufficient to stabilise the samples (data not shown).

Cloudiness has not yet been used in other studies using machine-learning for UTI prediction. Heckerling and colleagues used neural networks with genetic algorithm feature selection to examine 212 women with suspected UTI32. While they found that cloudiness was associated with increased LR+, their genetic algorithm did not retain it for the creation of the neural network. It is possible that this reflects differences between neural networks and RF models. Alternatively, it may reflect differences in the cohort, since the ratio of cloudy:clear urines differed significantly between the two cohorts (current study cloudy:clear ratio 1.13:1, Heckerling et al. 5.84:1), suggesting an underlying difference in the data informing the model. Taylor et al. also recently used machine learning to predict UTI33. They employed the XGBoost machine learning approach with 211 clinical variables to develop models predicting UTI in an emergency department setting. These were reduced to 10 variables (including urine analysis WBCs, bacteria, blood and dysuria) based on expert knowledge and literature reviews. While this approach worked well, it is not suitable for use in primary care given the number of recommended predictors. These studies, along with ours, demonstrate the potential of machine learning algorithms to enhance diagnosis. They also show that the context of the model is vitally important for its utility and that models may need to be customised for end users’ settings.

Predictor selection methods provide an advanced statistical tool to identify markers for infectious diseases but have not yet been widely used24. Using a RFE method coupled with either RF or SVM enabled us to simultaneously screen 17 clinical and 42 immunological biomarkers to identify predictors of UTI in symptomatic women in primary care. Nevertheless, we acknowledge that the relatively small sample size of our study in relation to the number of screened predictors may result in some instability of estimates and overfitting. While RFE is known to be particularly robust against overfitting34, we minimised this risk by using cross-validation in addition to a good hyperparameter search strategy within each model. During cross-validation, the model was trained on the training set and validated on a subset of the training data at each iteration, which ensured the generalization performance of the model for unseen data35. Furthermore, the classifier was trained on all possible combinations of features including the full feature set and the best combination of features (depending on the generalization performance of the model through cross validation) was selected as the searching space for the next step. Moreover, models were tested on an unseen test data set, which was randomly split prior to model training, indicating model generalizability to an independent data set.

The most promising immunological biomarkers identified were MMP9, NGAL, IL-8/CXCL8 and IL-1β as selected by SVM + RFE, while RF + RFE selected only IL-1β and MMP9 but with lower LR+ compared to SVM + RFE. In general, RF identifies the strongest predictors while SVM tends to produce stronger models based on a larger number of weaker predictors. The fact that we used two machine learning algorithms for predictors selection increased the confidence in markers that were selected by the two algorithms. There might be a potential for improvement in the future by using ensemble methods other than RF, however given that both RF and SVM found turbidity/cloudiness, MMP9 and IL-1β to be the best predictors of UTI it is likely that these predictors will remain as the most important markers. Ideally, we would be able to verify these as predictors using a large independent cohort, and we encourage further large studies to validate our findings. It is also interesting to note that the identified immunological markers interact with each other during urological infection by restricting bacterial growth and mediating trans-epithelial movement of neutrophils36. IL-1β induces renal production of NGAL in mice model experiments37, and NGAL modulates MMP9 activity by protecting it from degradation38. MMP-9, NGAL and some interleukins, have been previously studied as potential biomarkers for UTI, particularly in infants and children, however, conclusions were contradictory39,40,41,42,43,44.

Urine culture is an imperfect gold standard to identify UTI. Bacterial pathogens may die during transport, may not grow using conventional culture techniques or may be rendered unidentified due to contamination of urine samples during collection. There are also differences in opinion on the threshold used to identify significant growth, reflected in different microbiological guidelines. This has a direct impact on the reported prevalence of the disease and subsequently on the evaluation of new tools for UTI diagnosis. This has been shown in this study, as variable numbers of immunological markers were required to reach the optimum prediction depending on the underlying threshold guidelines applied.

This study involved women who participated in the POETIC trial, and who had excess urine samples available following the microbiological analyses included in the POETIC study protocol. No other selection criteria were applied and therefore this should be a relatively representative sample of women presenting in primary care with UTI. We found a slightly higher prevalence of positive UTI (43%) in our study compared to the full trial population (35%), but this is likely to be a chance finding and is unlikely to affect the generalisability of our results. Unfortunately, we were not able to compare urine cloudiness/turbidity or immunological markers with the point of care urine dipstick most commonly used in primary care, as dipstick results were not recorded in the POETIC trial. However, previous studies with similar uncomplicated UTI inclusion criteria, found that dipsticks predicted UTI culture results with a PPV between 0.63 and 0.94 and NPV between 0.20 and 0.81 depending on the diagnostic rule used (presence of nitrite, leukocytes esterase or both) and urine culture colony count threshold4,45. When dipstick results were based on leukocytes esterase results only, the maximum PPV and NPV was 0.86 and 0.72, respectively45. In our study, cloudiness achieved a comparable NPV of up to 0.79, while MMP9, NGAL, IL-8/CXCL8 and IL-1β achieved PPV of 0.82.

In conclusion, we found that urine cloudiness was the best clinical predictor of UTI among symptomatic women, and that grading cloudiness using a turbidity score may improve the predictive value further. We also found that MMP9, NGAL, IL-8/CXCL8 and IL-1β in urine may be useful predictors of UTI. These biomarkers could be used to develop a new point of care test for UTI, subject to validation of our findings in a larger population, across different age groups, using freshly collected urine and a stringent determination of cut-off levels for the individual biomarkers.

Methods

Patient population and clinical data

Clinical information and urine samples were collected as part of a two-arm randomized controlled trial, POETIC (Trial number: ISRCTN65200697)21,46. The current analysis included participants from England and Wales who had excess urine sample following the initial POETIC microbiology experiments. The POETIC study included women who presented in primary care with at least one key UTI symptom (dysuria, urgency and frequency) that had been present for up to 14 days. Exclusion criteria were pregnancy, signs of complicated UTI, current use of antibiotics and functional or anatomical genitourinary tract abnormalities21,46. Clinical data were collected by general practitioners (GPs). Main UTI symptoms were recorded as present/absent and on a scale from 0 (not affected) to 6 (as bad as possible) to measure its severity. Severity of other symptoms such as fever, flank or abdominal pain, blood in urine, unpleasant urine smell, restricted activity and feeling unwell were also measured (Table 1). Urine cloudiness (clear/cloudy) was reported by GPs following sample examination.

Ethics

Informed consent was obtained from each patient involved in the study as part of the POETIC clinical trial (number: ISRCTN65200697). Ethical approval was given by the Research Ethics Committee (REC) For Wales recognised by the United Kingdom Ethics Committee Authority (UKECA), REC reference 12/WA/0394. This study was conducted in accordance with the principles of the Declaration of Helsinki.

Sample collection, processing and culture

Mid-stream urine samples were collected at the GP clinic in a universal container containing boric acid and sent to the microbiology laboratory (Specialised Antimicrobial Chemotherapy Unit, University Hospital of Wales, Cardiff) by post. Average time from sample collection to processing in the laboratory was 2.2 [SD = 1.4] days. Urine turbidity was scored by microbiology staff, and for the current analysis, it was categorised as: 1 (clear or slightly turbid), 2 (moderately turbid) and 3 (very turbid). Urine samples were then analysed microscopically and cultured on Columbia Blood Agar (CBA) and CHROMagar UTI Orientation media (E&O) at 34–36 °C for 18–20 hrs46. Total and species-specific colony counts were enumerated from CBA and chromogenic agar, respectively. UTI culture positivity was defined as per the POETIC study protocol (Fig. 1).

Figure 1
figure 1

Urinary tract infection definitions according to POETIC study22, EAU27 and PHE25,29.

Urinary immune biomarker procedure

Cell-free urines were analyzed on a SECTOR Imager 6000 (Meso Scale Discovery) using the V-PLEX Human Cytokine 30-Plex Kit to measure levels of IL-1α, IL-1β, IL-2, IL-4, IL-5, IL-6, IL-7, IL-10, IL-12p40, IL-12p70, IL-13, IL-15, IL-16, IL-17A, IFN-γ, TNF-α, TNF-β, GM-CSF, VEGF, CCL2, CCL3, CCL4, CCL11, CCL13, CCL17, CCL22, CCL26, CXCL8 and CXCL10, and using an ultrasensitive single-plex assays for sIL-6R. Conventional ELISA kits were used to measure creatinine, cystatin C, HSA, MMP8, MMP9 and RBP4 (R&D Systems) as well as fibrinogen (Abcam). HNE was measured using a B.I.T.S. ELISA kit (Mologic); activated PGP, desmosine, FMLP and NGAL were measured using validated in-house developed ELISA kits (Mologic).

Statistical analysis

Data

Our cohort included 183 women with uncomplicated UTI symptoms. For these patients we matched 17 clinical and 70 immunological predictors using patient ID, date of birth and sample ID. There were no missing data on the outcome variable (UTI classes) or the clinical data, however, 28 immunological predictors had missing data of >5% and were therefore removed from the subsequent analysis. Missing data <5% were imputed using Multiple Imputation by Chained Equations in R package “mice” using all variables except the outcome. Imputation methods were predictive mean matching, logistic regression and proportional odds model for numeric variables, binary variables and ordered factor variables, respectively47. UTI classes were defined based on the POETIC guidelines22 for UTI classification (Fig. 1). Alternative UTI classification guidelines by PHE25,26 and the EAU27 were used in sensitivity analyses to explore if changing bacterial count threshold for positive UTI would change the marker selection.

Analysis approach

We used the RFE Algorithm 2 on “caret” R package platform48, which was coupled with either RF49 or SVM (radial basis function kernel in “kernlab” R package)50 algorithms to select the best clinical and immunological predictors. RF + RFE and SVM + RFE models were trained on the clinical and immunological predictors separately (Fig. 2). Models were trained on all possible combinations of features including the full feature set and the best combination of features was selected (Supplementary Figure S2). Following the selection of the best clinical and immunological predictors, we aimed to evaluate the additive predictive value of the selected immunological markers on the selected clinical predictors. Thus, we merged the selected clinical and immunological predictors and used them to train RF and SVM models (Fig. 2). Merging the selected clinical and immunological markers was conducted only when a small number of immunological markers were selected.

Figure 2
figure 2

Flowchart of data analysis. RFE: recursive feature elimination. SVM: support vector machine. RF: random forest.

Data pre-processing

For SVM, which does not recognize nominal variables, both binary and ordinal categorical variables were transformed by integer encoding, in which naturally ordered integer numbers were assigned to the levels of the categorical variables to keep the natural order of the clinical data. In addition, continuous data were standardized to a mean of 0 and a variance of 1 for SVM models51. For RF models, categorical variables were not transformed because RF can learn directly from categorical data with no data transformation required.

Model training and testing

Our data included 183 cases that were randomly split into training (70%) and test (30%) subsets while maintaining the proportion of cases with positive UTI. For all training models, three repeats of 10-fold cross-validation were used to avoid overfitting. During cross-validation, the model was trained on the training set and validated on a subset of the training data at each iteration (cross-validation ROC curves are provided in Supplementary Figure S3). The random search method in the caret package51 was implemented to select the optimum hyperparameters (RF: number of features randomly selected for splitting at each tree node [mtry]; SVM: sigma and Cost soft margin [C]; Supplementary Table S2). Model performance was examined on the unseen test data subset. Model performance was compared using the following metrics: AUC, PPV, NPV, LR+, LR−52 and F1 Score (harmonic mean of the precision and recall, which range between 0 and 1 where higher value indicates higher performance)53. For calculating AUC, the probability threshold for a positive UTI class was set to 0.5. All analyses were performed using R software version 3.4.254.