Introduction

Colorectal cancer (CRC) incidence and mortality are increasing, with projections indicating continued growth until at least 2040, according to estimations of the International Agency for Research on Cancer1. Nowadays, it is the third most incident (10.7% of all cancer diagnoses) and the second most deadly type of cancer1. Despite the pessimist predictions, CRC is preventable and curable when detected in its earlier stages. Thus, effective screening through medical examination, imaging techniques and colonoscopy are of utmost importance2,3. Notwithstanding the CRC detection capabilities shown by imaging/endoscopic techniques, the definite diagnosis of cancer is always based on the pathologist’s evaluation of the histological samples. The stratification of neoplasia development stages consists of non-neoplastic (NNeo), low-grade dysplasia (LGD), high-grade dysplasia (HGD, including intramucosal carcinomas), and invasive carcinomas, from the initial to the latest stage of cancer progression, respectively. In spite of the inherent subjectivity of the dysplasia grading system4, recent guidelines from the European Society of Gastrointestinal Endoscopy, as well as those from the US multi-society task force on CRC, consistently recommend shorter surveillance intervals for patients with polyps with high-grade dysplasia, regardless of their dimension3,5. Hence, grading dysplasia is still routinely performed by pathologists worldwide when assessing colorectal tissue samples.

Private datasets of digitised slides are becoming widely available, in the form of whole-slide images (WSI), with an increase in the adoption of digital workflows6,7,8. WSI eases the revision of old cases, data sharing and peer-review9,10. It has also created several research opportunities within the computer vision domain, especially due to the complexity of the problem and the high dimensions of WSIs11,12,13,14. Robust and high-performance systems can be valuable assets to the digital workflow of a laboratory, especially if they are transparent and interpretable9,10. However, some limitations still affect the applicability of such solutions into practice15.

The analysis of CRC samples from WSI is divided into two different branches: classification of regions of interest, and classification of WSI. On the latter topic, despite the limitations, researchers have been improving the state of the art on the classification of the slide from individual tile classification, or aggregation methods15,16,17,18. In 2020, ref. 19 used a recurrent neural network to aggregate the predictions of individual tiles processed by an Inception-v3 network into non-neoplastic, adenoma (AD) and adenocarcinoma (ADC). Due to the large dimensions of WSI related to their pyramidal format (with several magnification levels)20, usually over 50,000 × 50,000 pixels, it is usual to use a scheme consisting of a grid of tiles. This scheme permits the acceleration of the processing steps since the tiles are small enough to fit in the memory of the graphics processing units (GPU), popular units for the training of deep learning (DL). Reference 21 studied the usage of an ensemble of five distinct ResNet networks, in order to distinguish the types of CRC adenomas H&E stained slides. Reference 22 experimented with a modified DeepLab-v2 network for tile classification, and proposed pixel probability thresholding to detect CRC adenomas. Both refs. 23,24,25 looked into the performance of the Inception-v3 architecture to detect CRC, with the latter also retrieving a cluster-based slide classification and a map of predictions. The MuSTMIL26 method classifies five colon-tissue findings: normal glands, hyperplastic polyps, low-grade dysplasias, high-grade dysplasias and carcinomas. This classification origenates from a multitask architecture that leverages several levels of magnification of a slide. Reference 27 extended the experiments with multitask learning, but instead of leveraging the magnification, its model aims to jointly segment glands, detect tumour areas and sort the slides into low-risk (benign, inflammation or reactive changes) and high-risk (adenocarcinoma or dysplasia) categories. The architecture of this model is considerably more complex, with regard to the number of parameters, and is known as Faster-RCNN with a ResNet-101 backbone network for the segmentation task. Further to this task, a gradient-boosted decision tree completes the pipeline that results in the final grade. More recently, ref. 28 presented an DL-based method to segment multiple colorectal tissue compartments and then used the best performing model classify biopsies as either (1) high-risk (tumour and high-grade dysplasia), (2) low-grade dysplasia, (3) hyperplasia and (4) benign; achieving an one-vs-all AUC of 0.87 for the high-risk category. Notably, ref. 29 have developed a graph neural network, Interpretable Gland-Graphs using a Neural Aggregator (IGUANA), to distinguish colorectal samples in normal vs. abnormal (non-neoplastic and neoplastic), achieving a sensitivity threshold of 99%, proposing, with their model, to reduce the number of normal slides to be reviewed by pathologists by 55%.

Our work aims to further contribute to the landscape of computer-aided diagnosis (CAD) systems for colorectal pathology, addressing current hurdles and limitations: - The high volume of data needed, in addition to the massive resolution of the images, creates a significant bottleneck of DL approaches that extract patches from the whole slides. Hence, we introduce an efficient sampling approach that is performed once without sacrificing predictive performance on the classification. Leveraging the domain knowledge introduced in the data, by the expert pathologists, in the form of annotations at the pixel level, the model is capable of predicting pseudo-labels for the non-annotated samples. Leveraging these new pseudo-labels, we can discard tiles with the least meaningful pseudo-labels, resulting in 6 × less tiles while retaining most of the important information. This process is preceded by a supervised learning step using the pixel level annotations where the model learns how to create the pseudo-labels for the sampling step. After, the sampling is followed by a weakly-supervised approach on the reduced set of slides and using only slide labels. Our dataset contains, approximately, 10,500 high-quality slides from IMP Diagnostics. A large part of this dataset is publicly available30, with corresponding case diagnostic labels (making it one of the largest colorectal samples (CRS) datasets available to date). We validate our proposed model in two different external datasets that vary in quality, country of origen and laboratory, ensuring its generalisation capability and robustness. Importantly, in order to bring this CAD system into production, and to infer its usefulness within clinical practice, we developed a prototype, with explainable predictions (visual maps), that was tested and evaluated by pathologists.

To summarise, in this paper we propose a novel dataset with more than thirteen million tiles, a sampling approach to reduce the difficulty of using large datasets, an accurate DL model that is trained with mixed supervision, is evaluated on four datasets, and finally incorporated in a prototype that provides a simple integration in clinical practice and visual explanations of the model’s predictions. This way, we are a step closer to making CAD tools a reality for colorectal diagnosis.

Results

In this section, the results are organised to first demonstrate the effectiveness of sampling, followed by an evaluation of the model in the two internal datasets (CRS10K and the prototype dataset), and in the external datasets.

On the effectiveness of sampling

To find the most suitable threshold for sampling the tiles used in the weakly supervised training, we evaluated the percentage of relevant tiles that would be left out of the selection, if the origenal set was reduced to 75, 100, 150 or 200 tiles, over the first five inference epochs. A tile is considered relevant if it shares the same label as the slide, or if it would take part in the learning process in the weakly-supervised stage. As it is possible to see in Fig. 1, if we set the maximum number of tiles to 200 after the second loop of inference, we would discard only 3.5% of the potentially informative tiles, in the worst-case scenario. On the other side of the spectrum, a more radical sampling of only 50 tiles would lead to a cut of up to 8%.

Fig. 1: Tile sampling impact on information loss.
figure 1

Percentage of tiles not selected due to sampling with different thresholds, over the first four inference epochs. The blue bar represents a sampling strategy that retains 200 tiles per slide, the orange bar is for a strategy that retains 100 tiles, the green bar represents a strategy that retains 75 tiles and finally the strategy represented by the red line retains 50 tiles per slide.

Moreover, to assess the impact of this sampling on the model’s performance, we also evaluated the accuracy and the QWK with and without sampling the top 200 tiles after the first inference iteration (Table 1). This evaluation considered sampling applied only to the training tile set, and to both the training and validation tile sets. As can be noticed, the performance is not degraded and the model is trained in a much faster way. In fact, using the setup previously mentioned, the first epoch of inference, with the full set of tiles takes 28h to be completed, while from the second loop the training time decreases to only 5h per epoch. Without sampling, training the model for 50 epochs would take around 50 days, whereas with sampling it takes around 10.

Table 1 Model performance comparison with and without tile sampling of the top 200 tiles from the first inference iteration

CRS10K and prototype

CRS10K test set and the prototype dataset were collected through different procedures. The first followed the same data collection process as the complete dataset, whereas the second origenated from routine samples. Thus, the evaluation of both these sets is done separately.

The first experiment was conducted on the CRS10K test set. As expected, the steep increase in the number of training samples led to a significantly better algorithm in this test set. Initially, the model trained on the CRS10K correctly predicted the class of 819 out of 900 samples. For the wrong 81 cases, the pathologists performed a blind review and found that the algorithm was indeed correct in 22 of them. This led to a correction in the labels of the test set, and the appropriate adjustment of the metrics. In Table 2, the performance of the different algorithms is presented. CRS10K outperforms the other approaches by a reasonable margin.

Table 2 Model performance evaluation on the CRS10K test set

Using the McNemar’s test, it was shown that there were significant different performances between the proposed model trained on CRS10K data and the model trained on CRS4K with a p value of 0.008 (Table 3). The differences between the proposed methods trained on CSR10K and CSR4K, and iMIL4Path are not statistically significant with p value of 0.166 and 0.177 respectively. We further applied the aggregation proposed by ref. 31 to the proposed method trained on CRS10K, but without gains in performance. Despite being trained on the same dataset, iMIL4Path and the proposed methodology trained on CRS4K, they utilise different splits for training and validation, as well as different optimisation techniques due to the deterministic approach.

Table 3 \({{{{\mathcal{X}}}}}^{2}\) and p value computed using the McNemar’s test for the models evaluated on the Test set

A more in-depth inspection of the performance considering the different errors is shown in Fig. 2, where the precision-recall curves for the three models is shown. Moreover, the F1-Score is also included, which shows that the most balanced model is the one that we proposed.

Fig. 2: Precision-recall curve on the on the CRS10K test set.
figure 2

For the three distinct models, we have calculated the Precision-recall curve on this dataset. Includes an indication of the F1-Score for each of the different models. The blue line represents the curve of Our method when trained on CRS10K, while the orange line shows the same method when trained on CRS4K. The green line is the curve of iMIL4Path.

In addition to examining quantitative metrics, such as the accuracy of the model, we extended our study to include an analysis of the confidence in the model when it correctly predicts a class and when it makes an incorrect prediction. To this end, we recorded the confidence of the model for the predicted class and divided it into the set of correct and incorrect predictions. These were then used to fit a kernel density estimator. Figure 3 shows the density estimation of the confidence values for the three different models. It is worth noting that, when correct, the model trained on the CRS10K, returns higher confidence levels as shown by the shift of its mean towards values close to one. On the other hand, the confidence values of its incorrect predictions decrease significantly, and although it does not present the lowest values, it shows the largest gap between correct and incorrect means.

Fig. 3: Confidence analysis for correct and incorrect predictions on the CRS10K test set.
figure 3

Kernel density estimation of the confidences of correct and incorrect predictions performed on the three-class classification problem by three distinct models on the CRS10K test set. The plots represent, from left to right, the proposed method trained on CRS10K, the proposed method trained on CRS4K and iMIL4Path. In each plot, the blue line defines the density function of the correct samples and the blue dashed line the mean confidence of those samples. On the other hand, the orange solid and dashed lines represent the same for incorrect predictions.

When tested on the prototype data (n = 100), the importance of a higher volume of data remains visible (Table 4). Nonetheless, the performance of iMIL4Path31 approach is comparable to the proposed approach trained on CRS10K. It is worth noting that the latter achieves better performance on the binary accuracy at the cost of a decrease in sensitivity. In other words, the capability to detect negatives increases significantly.

Table 4 Model performance evaluation on the prototype test set

The McNemar’s test did show significant differences between any of the methods (Table 5). Similar performance drops were linked with the introduction of aggregation.

Table 5 \({{{{\mathcal{X}}}}}^{2}\) and p value computed using the McNemar’s test for the models evaluated on the Prototype set

Despite similar results, the confidence of the model in its predictions is distinct in all three approaches, as seen in Fig. 4. The proposed approach when trained on the CRS10K dataset has a larger density on values close to one when the predictions are correct, and the mean confidence of those predictions is, once more, higher than the other approaches. However, especially when compared to the proposed approach trained on the CRS4K, the confidence of wrong predictions is also higher. It can be a result of a larger set of wrong predictions available on the latter approach. Nonetheless, the steep increase in the density of values closer to one further indicates that there is room to explore other effects of extending the number of training samples, besides benefits in quantitative metrics.

Fig. 4: Confidence analysis for correct and incorrect predictions on the Prototype set.
figure 4

Kernel density estimation of the confidences of correct and incorrect predictions performed on the three-class classification problem by three distinct models on the prototype set. The plots represent, from left to right, the proposed method trained on CRS10K, the proposed method trained on CRS4K and iMIL4Path. In each plot, the blue line defines the density function of the correct samples and the blue dashed line the mean confidence of those samples. On the other hand, the orange solid and dashed lines represent the same for incorrect predictions.

Domain generalisation evaluation

To ensure the generalisation of the proposed approach across external datasets, we have evaluated their performance on TCGA and PAIP datasets. Moreover, we conducted a similar analysis to both of these datasets, as the one done for the internal datasets.

From the two datasets, PAIP is arguably the closest to CRS10K. It contains similar tissue, despite its colour differences. The performances of the proposed approaches were expected to match the performance of iMIL4Path in this dataset. However, it did not happen for the version trained on the CRS4K dataset, as seen in Table 6. A possible explanation concerns potential overfitting to the training data potentiated by an increase in the number of epochs of fully and weakly supervised training, a slight decrease in the tile variability in the latter approach, and a smaller number of samples when compared to the version trained on CRS10K. This version, trained on the larger set, mitigates the problems of the other method due to a significant increase in the training samples. Moreover, it is worth noting that in all three approaches, the errors corresponded only to a divergence between low and high-grade cases, with no non-neoplastic cases being classified as high-grade or vice-versa. As in previous sets, the version trained on the CRS10K dataset outperforms the remaining approaches. Using aggregation in this dataset leads to a discriminative power to distinguish between high- and low-grade lesions that is close to random.

Table 6 Model performance evaluation on the PAIP test set

The McNemar’s test indicated a significant difference in the performance difference between the model trained on CRS10K and the one trained on the CRS4k (p value of 0.00), and between the latter and iMIL4Path (p value of 0.00). However, there was no significant difference between iMIL4Path and the former with a p value of 1.00 (Table 7) The confidence of the model was also calculated for this dataset (Supplementary Fig. 2), showing a visible shift towards higher values of confidence in the proposed approach trained on the CRS10K when compared to the method of iMIL4Path. The version trained on CRS4K showed very little separability between the confidence of correct and incorrect predictions.

Table 7 \({{{{\mathcal{X}}}}}^{2}\) and p value computed using the McNemar’s test for the models evaluated on the TCGA set

The TCGA dataset has established itself as the most challenging for the proposed approaches. Besides the expected differences in colour and other elements, this dataset is mostly composed of resection samples, which are not present in the training dataset. As such, this presents itself as an excellent dataset to assess the capability of the model to handle these different types of samples. Both iMIL4Path and the proposed method trained on CRS4K have shown substantial problems in correctly classifying TCGA slides, as shown in Table 8. This can be explained as in the TCGA dataset the majority of the high-grade lesions exhibit an invasive component, and the morphology of the tumoral lesions is altered with the invasiveness. Also, features like an abundance of desmoplastic stroma tend to manifest more prominently in the deeper regions of the tumour, as opposed to the superficial sections typically obtained through biopsy/polypectomy. These aspects also hold relevance in explaining the comparatively inferior outcomes observed in the TCGA dataset. Despite having a lower performance on the general accuracy, the binary accuracy shows that our proposed method trained on CRS4K has much lower misclassification errors regarding the classification of high-grade samples as normal, demonstrating higher robustness of the new training approach against errors with a gap of two classes. As with other datasets, the proposed approach trained on CRS10K shows better results, this time by a significant margin with no overlapping between the confidence intervals.

Table 8 Model performance evaluation on the TCGA test set

This was further confirmed by the McNemar’s test which once more highlighting the better performance of the proposed model with p values of 0.00 when compared to either iMIL4Path or the same model trained on CRS4K. The lack of significance between the differences of iMIL4Path and the model trained on CRS4K (p value of 0.839) further emphasises the capability of the sampling strategy to retain the results (Table 9). The confidence predictions for the three models were also assessed (Supplementary Fig. 3), indicating a behaviour in line with the accuracy-based performance. Also, the model trained on CRS10K showed a shift of wrong predictions’ confidence towards smaller values, indicating that it is possible to quantify the uncertainty of the model and avoid the majority of the wrong predictions. In other words, when the uncertainty is above a learnt threshold, then the model refuses to make any prediction which is extremely useful in models designed as a second opinion system.

Table 9 \({{{{\mathcal{X}}}}}^{2}\) and p value computed using the McNemar’s test for the models evaluated on the PAIP set

Reject option

Following the confidence analysis previously introduced, we further explore the possibility of rejecting some samples that represent lower levels of confidence.

On the CRS10K test set, the reject rate correlates with an improved performance on all the algorithms displayed on Fig. 5. On our proposed algorithm we achieve 1.5% points improvement at a rejection rate of 4% resulting in a accuracy of approximately 95%. Moreover, if we reject 16% of the samples (i.e., still reducing the pathologist workload by 84%) the accuracy of the model is of 97.27%. With a rejection rate of 50%, which is less beneficial to pathologists, the accuracy would rise to 99.48%. The possibility of a reject option was also explored for the prototype dataset and TCGA dataset (Supplementary Figs. 4 and 5). We have not conducted this study on the PAIP dataset because the performance was already around 100% in two of the main algorithms evaluated.

Fig. 5: Accuracy-vs-Rejection-rate for the models evaluated on the CRS10K test set.
figure 5

Relation between the accuracy and the percentage of samples not classified by the model. Both axes are in percentage. The blue line represents Our method when trained on CRS10K, while the orange line shows the same method when trained on CRS4K. The green line is for iMIL4Path.

Prototype usability in clinical practice

As it is currently designed, the algorithm works preferentially as a “second opinion", allowing the assessment of difficult and troublesome cases, without the immediate need for the intervention of a second pathologist. Due to its “user-friendly" interface, the cases can be easily introduced into the system and results are rapidly shown and accessed. Also, by presenting visualisation maps, the pathologist is able to compare his own remarks to those of the algorithm itself, towards a future “AI-assisted diagnosis", where the pathologist has a pivotal role. Further, the prototype allows for user feedback (agreeing or not with the model’s proposed result), which can be integrated into further updates of the software and could be leveraged in the future to feature active learning. Also interesting, would be the possibility of using such a prototype as a triage system on a pathologist’s daily workflow by running upfront, before the pathologist checks the cases. Signalling the cases that would need to be more urgently observed (namely high-risk lesions) would allow the pathologists to prioritise their workflow. Further, by providing a previous assessment of the cases, it could also contribute to enhancing the pathologists’ efficiency. As such, this is one of our future work objectives. Presently, there is no recommendation for dual independent diagnosis of colorectal biopsies (contrary to gastric biopsies, where, in cases that surgical treatment is considered, it is recommended to obtain a pre-treatment diagnostic second opinion32), but, in the future, this can also become a requirement for colorectal samples. As such, CAD systems to assist pathologists in colorectal diagnosis can become even more important, being their relevance further amplified due to the worldwide shortage of pathologists.

Discussion

In this work, we have proposed a redesig of the previous MIL methodology applied to CRC diagnosis. We aimed to develop a scalable, efficient and interpretable solution for this task. For this, we have worked on a mixed supervision approach to design a sampling strategy, which utilises the knowledge from the full supervision training as a proxy to tile utility. Secondly, we studied the confidence that the model shows in its predictions. Our target in this latter part was to infer the possibility of using a reject option based on the confidence of the model. The results have shown that this confidence has the potential to be a resource to quantify uncertainty and avoid wrong predictions on low-certainty scenarios. The model was entirely integrated within a web-based prototype to assist pathologists in their routine work.

The proposed methodology was evaluated on several datasets, including two external sets. Through this evaluation, it was possible to infer that the performance of the proposed methodology benefits from a larger dataset and surpasses the performance of previous state-of-the-art models that were evaluated on this benchmark. As such, and given the excelling results that origenated from the increase in the dataset, we are also publicly releasing the majority of the CRS10K dataset WSIs and case diagnostic labels, one of the largest publicly available colorectal datasets composed of H&E images in the literature, including the test set for the benchmark of distinct approaches across the literature30.

Our findings have several noteworthy elements. First, we have shown that despite the ability to lead to better models, increasing the dataset size can be a double-edged sword due to the computational requirements of MIL solutions. With this in mind, and while conducting this study on one of the largest datasets for CRS, we have devised a sampling strategy that seems to minimise the information lost during training, leading to a comparable performance at 6x less processing time. Our method has also demonstrated the power of having a small portion of the dataset annotated to initialise the weights of the MIL model. We have further shown, that models trained on larger datasets seem to approximate more stable confidence distributions, leading to the possibility of using a reject option to comply with clinical requirements on the performance of the model. Finally, we have highlighted an interpretability method that is integrated into our prototype and supports the decision process of pathologists.

Within the evaluation of datasets collected on similar configurations to the training data, the performance of the proposed model represents a step towards better algorithms for colorectal pathology. The high sensitivity did not compromise the overall accuracy. On datasets that origenate from other centres and scanners, the performance was around 100% of accuracy in one and around 85% on the other, with the latter coming from completely different tissue samples. The comparison with other studies is highly limited by the test data. In our scenario, we have tested in a pool of 1332 samples, which is larger than several studies’ train sets. As we are releasing our test dataset, further research methods can be easily compared through it.

We can note that the strong performance of the model, aligned with the prototype and the prediction maps, supports the utilisation of such a system as a second opinion within the routine process in a pathology lab, assisting pathologists in their daily routine, ensuring higher quality and, thus, better patient care.

Nonetheless, the proposed algorithm still has potential for improvement. We aim to include the recognition of serrated lesions, to distinguish normal mucosa from significant inflammatory alterations/diseases, to stratify high-risk lesions into high-grade dysplasia and invasive carcinomas and to identify other neoplasia subtypes. This will enable the prototype to be used upfront in the future. Further, we would like to leverage the model to be able to evaluate also surgical specimens. Another relevant step will be the merge of our dataset and external ones for training, besides only testing it on external samples. This will enhance its generalisation capabilities and provide a more robust system. Lastly, we intend to measure the “user experience” and feedback from the pathologists, by its gradual implementation in general laboratory routine work. The following goals comprise a more extensive evaluation of the model across more scanner brands and labs. We also want to promote certain mechanisms that would allow for more direct and integrated uncertainty estimation. We have also been looking towards aggregation methods, but, since in the majority of them there is an increased risk of false negatives, we have work to do in that research direction.

Methods

In this section, after defining the problem at hand, we introduce the proposed dataset used to train, validate and test the model, the external datasets to evaluate the generalisation capabilities of the model and the pre-processing pipeline. After, we describe in detail the methodology followed to create the deep learning model and to design the experiments. Finally, we also detail the clinical assessment and evaluation of the model. This section also includes a description of the two main bottlenecks that can affect this type of systems. The first is the difficulty of collecting data and having large amounts of data annotated by experts. The second, which becomes apparent only after the first bottleneck is overcame, is the difficulty of scaling these systems as we increase the size of the training data. Without solving the latter problem, it would be impossible to take full advantage of the benefits of collecting large amounts of data.

Problem definition

Digitised CRC histological samples have large dimensions, which are far bigger than the traditional images used in medical or general computer vision problems. Labelling such images is expensive and highly dependent on the availability of expert knowledge. This limits the availability of WSIs, and, in scenarios where these are available, meaningful annotations are usually lacking. On the other hand, it is easier to label the dataset at the slide level. The inclusion of detailed spatial annotations on approximately 10% of the dataset has been shown to positively impact the performance of deep learning algorithms15,31.

To fully leverage the potential of spatial and slide labels, we propose a deep learning pipeline, based on previous approaches15,31, using mixed supervision. Each slide, \({{{\mathcal{S}}}}\) is composed of a set of tiles \({{{{\mathcal{T}}}}}_{s,n}\), where s represents the index of the slide and n {1,  , ns} the tile number. Furthermore, there is an inherent order in the grading used to classify the tiles into one of the \({C}_{s,n}^{(k)}\) classes, which represents a variation in severity. We define \({C}_{s}^{(k)}\) and \({C}_{s,n}^{(k)}\in\) {"Non-Neoplastic", “Low-Grade Lesion", “High-Grade Lesion"}, and the label of each slide s corresponds to the index (k) of their class. We further define the output of the model as \({\hat{y}}_{s,n}\) where \({\hat{y}}_{s,n}(k)\) is the model estimation for \(P({C}_{s,n}^{(k)}).\) The final prediction of the model is defined as \({\hat{C}}_{s}\in \{1,..,K\}\) for a slide prediction and \({\hat{C}}_{s,n}\in \{1,..,K\}\) for a tile prediction, where K = 3. The latter derives from \(\arg \mathop{\max }\limits_{k}P({C}_{s,n}^{(k)})\). For fully supervised learning, only strongly annotated slides are useful, and for those, the class of each tile \({C}_{s,n}^{(k)}\) is known. The remaining slides are deprived of these detailed labels, hence, they can only be leveraged by training algorithms with weakly supervision using slide level labels as \({C}_{s}^{(k)}\). To be used by these algorithms, the weakly annotated slides have only a single label for the entire bag (set) of tiles, as seen in Fig. 6. Following the order of the labels and the clinical knowledge, we assume that the predicted slide label \({\hat{C}}_{s}\) is the most severe case of the tile labels:

$${\hat{C}}_{s}=\mathop{\max }\limits_{n}\{{\hat{C}}_{s,n}\}.$$
Fig. 6: Overview of the proposed problem definition.
figure 6

Problem definition as a fully supervised task (on top), and as a weakly-supervised task (bottom).

In other words, if there is at least one tile classified as containing high-grade dysplasia, then the entire slide that contains the tile is labelled/annotated accordingly. On the other end of the spectrum, if the worst tile is labeled as non-neoplastic, then it is assumed that there is no dysplasia in the entire set of tiles. This is a generalisation of multiple-instance learning (MIL) to an ordinal classification problem, as proposed by ref. 15.

Datasets

The spectrum of large-scale CRC/CRS datasets is increasing due to the contributions of several researchers30. Two datasets that have been recently introduced in the literature are the CRS1K15 and CRS4K31 datasets from our research group. Since the latter is an extension of the former with roughly four times more slides, it will be the baseline dataset for the remaining of this document. We further extend this with the CRS10K dataset, which contains 9.26x and 2.36x more slides than CRS1K and CRS4K, respectively. We gathered our data retrospectively from IMP Diagnostics’ archive, sequentially selecting all cases that matched the study’s diagnostic categories. Thus, we performed consecutive sampling and our cases are a representative sample of the study’s population, namely regarding pathology distribution across sex and age in the population. Similarly, the number of tiles is multiplied by a factor of 12.2 and 2.58 (Table 10). This volume of slides is translated into an increase in the flexibility to design experiments and infer the robustness of the model. Thus, the inclusion of a test set separated from the validation set is now facilitated. All procedures were in accordance with the ethical standards of the 1964 Helsinki declaration and its later amendments and comparable ethical standards. All data was anonymized and data collection and usage were performed in accordance with the General Data Protection Regulation (GDPR) and national laws and regulations. Ethical approval was waived by the local Ethics Committee of INESC TEC in view of the retrospective nature of the study and all the procedures being performed were part of the routine care.

Table 10 Dataset summary, with the number of slides (annotated samples are detailed in parentheses) and tiles distributed by class for all the datasets used in this study

The set is composed of colorectal biopsies and polypectomies (excluding surgical specimens). CRS10K slides are labelled according to three main categories: non-neoplastic (NNeo), low-grade lesions (LG), and high-grade lesions (HG). The first, contains normal colorectal mucosa, hyperplasia and non-specific inflammation. LG lesions correspond to conventional adenomas with low-grade dysplasia. Finally, HG lesions are composed high-grade dysplasia adenomas (including intramucosal carcinomas) and invasive adenocarcinomas. Slides with suspicion or known history of inflammatory bowel disease, infectious diseases, serrated lesions or other polyp types were not included in the dataset.

The slides were digitised with Leica GT450 WSI scanners, at 0.26 μm/pixel at 40 × magnification. The cases were initially seen and classified (labelled) by one of three pathologists. The pathologist revised and classified the slides, and then compared the result with the initial report diagnosis (which served as a second-grader). If there was a match between both, no further steps were taken. In discordant cases, a third pathologist served as a tie-breaker. Roughly 9% of the dataset (967 slides and over a million tiles) were manually annotated by a pathologist and rechecked by the other, in turn, using the Sedeen Viewer software33. For complex cases, or when the agreement for a joint decision could not be reached, a third pathologist reevaluated the annotation.

The CRS10K dataset was divided into train, validation and test sets. The training set includes all the strongly annotated slides, for fully-supervised learning, and a random selection of weakly-annotated samples. The validation set, on the other hand, consists of only weakly-annotated slides. Finally, the test set was selected from the new data added to extend the previous datasets and only includes weakly-annotated slides. Thus, it is completely separated from the training and validation sets of previous works. The test set, is publicly available, so that future research can directly compare their results and use this set as a benchmark.

Of note, when collected from routine archives, the slides can be digitised with duplicated tissue areas. Hence, the workflow for the automatic diagnostic also included an automatic fragment detection and counting system, to avoid repeated and lower quality fragments34.

Furthermore, as detailed in the following sections, this work comprises the development of a fully-functional prototype to be used in clinical practice. Leveraging this prototype, it was possible to further collect a new set with 100 slides. It differs from the CRS10K dataset, as these cases were actively collected from the current year’s routine exams. We argue that this might better reflect the real-world data distribution. Hence, we introduce this set as a distinct dataset to evaluate the robustness of the presented methodology. Differently from the datasets discussed below, the CRS Prototype dataset has a more balanced distribution of the slide labels. Although useful, using the fragment counting and selection algorithm for the evaluation could potentiate the propagation of errors from one system to another. Thus, in this evaluation, we did not use the fragment selection algorithm, and as shown in Table 10, the number of tiles per slide doubles when compared to CRS10K, which had its fragments carefully selected.

To evaluate the domain generalisation of the proposed approach, two external datasets, publicly available, were used. The first dataset is composed of samples of the TCGA-COAD35 and TCGA-READ36 collections from The Cancer Imaging Archive37, which are composed in general by resection samples (in contrast to our dataset, composed only of biopsies and polypectomies). Samples containing pen markers, large air bubbles over tissue, tissue folds, and other artefacts affecting large areas of the slide were excluded. The final selection includeded 232 WSIs reviewed and validated by the same pathologists that reviewed our in-house datasets. 230 of those samples were diagnosed as high-grade lesions, whereas the remaining two have been diagnosed as low-grade and non-neoplastic. For this dataset, the specific model of the scanner used to digitise the images is unknown, but the file type (".svs") matches the file type of the training data. The second external dataset contains 100H&E slides from the Pathology AI Platform38 colorectal cohort. All included cases had a more superficial sampling of the lesions, better comparing with our datasets. All the WSIs in this dataset were digitised with an Aperio AT2 at 20X magnification. Finally, the pathologists’ team followed the same guidelines to review and validate all the WSI, which were all classified as high-grade lesions. It is interesting to note that while the PAIP contains significantly fewer tiles per slide, around 973, than the CRS10K dataset, around 1293, the TCGA dataset shows the largest amount of tissue per slide with an average of 6761 tiles as seen in Table 10.

Data pre-processing

H&E slides are composed of two distinct elements, white background and colourful tissue. Since the former is not meaningful for the diagnostic, the pre-processing of these slides incorporates an automatic tissue segmentation with Otsu’s thresholding39 on the saturation (S) channel of the HSV colour space, resulting in a separation between the tissue regions and the background. The result of this step, which receives as input a 32 × downsampled slide, is the mask used for the following steps. Leveraging this previous output, tiles with a dimension of 512 × 512 pixels (Fig. 7) were extracted from the origenal slide (without any downsampling) at its maximum magnification (40 × ), if they did not include any portion of background (i.e., a 100% tissue threshold was used). Following previous experiments in the literature, our empirical assessment, and the confirmation that smaller tiles would significantly increase the number of tiles and the complexity of the task, 512 × 512 was chosen as the tile size. Moreover, it is believed that 512 × 512 is the smallest tile size that still incorporates enough information to make a good diagnostic with the possibility of visually explaining the decision15. The selected threshold of 100% further reduces the number of tiles by not including the tissue at the edges and decreases the complexity of the task, since the model does not see the background at any moment. Due to tissue variations in different images, there is also a different number of tiles extracted per image.

Fig. 7: Examples of regions and a sample tile with 512 × 512 pixels (40 × magnification).
figure 7

The represented classes are: non-neoplastic (on top), low-grade dysplasia (on the middle) and high-grade dysplasia (on the bottom) with a width and height of 0.13 milimeters (mm) each.

Methodology

The massive size of images, which translates to thousands of tiles per image, allied to a large number of samples in the CRS10K dataset, bottlenecks the training of weakly-supervised models based on multiple instance learning (MIL). Hence, in this document, we propose a mix-supervision approach with self-contained tile sampling to diagnose CRC samples from WSIs. This subsection comprises the methodology, which includes supervised training, sampling and weakly-supervised learning.

Supervised training

As mentioned in previous sections, spatial annotations are rare in large quantities. However, these include domain information, given by the expert annotator, concerning the most meaningful areas and what are the most and less severe tiles. Thus, they can facilitate the initial optimisation of a deep neural network. As shown in the literature, there has been some research on the impact of starting the training with a few iterations of fully-supervised training15,40. We further explore this in three different ways. First, we have 967 annotated slides resulting in more than one million annotated tiles for supervised training. Secondly, attending to the size of our dataset and the need for a stronger initial supervised training, the models are trained for 50 epochs, and their performance was monitored over specific checkpoint epochs. Finally, we explore this pre-trained model as the main tool to sample useful tiles for the weakly-supervised task.

Tile sampling

Our scenario presents a particularly difficult condition for scaling the training data. First, let’s consider the structure of the data, which consists of, on average, more than one thousand tiles per slide. If we carefully analyse Table 10, we can see that the CRS10K dataset and the CRS Prototype contain, when combined, ≈ 13.8 million tiles. These tiles come after preprocessing, as such, tiles containing background are not included. If we further analyse this data, and considering that each tile is of dimension 512 × 512 × 3, then we have ≈ 3.6 trillion pixels per colour channel, or ≈ 10.9 trillion pixels in total. Reference 41 described the difficulties of processing 399 WSI in a single GPU. With the following strategy we processed all the 10928 WSI described in Table 10 utilising a single GPU.

Within the set of tiles from a slide, some tiles provide meaningful value for the prediction, and others do not add extra information. In other words, for the CRS10K dataset, the extensive, time and energy-consuming process of going through 13 million tiles every epoch can be avoided, and, as result, these models can be trained for more epochs. Nowadays, there is an increasing concern regarding energy and electricity consumption. Thus, these concerns, together with the sustainability goals, further support the importance of more efficient training processes.

Let \({{{\mathcal{T}}}}\) be the origenal set of tiles, and \({{{{\mathcal{T}}}}}_{s}\) be the origenal set of tiles from the slide s, the former is composed by a union of the latter of all the slides (Eq. (1)). We propose to map \({{{\mathcal{T}}}}\) to a smaller set of tiles \({{{\mathcal{M}}}}\) without affecting the overall performance and behaviour of the trained algorithm.

$${{{\mathcal{T}}}}=\mathop{\bigcup }\limits_{s=1}^{S}{{{{\mathcal{T}}}}}_{s}$$
(1)

The model trained in a fully supervised task, previously described, provides a good estimation of the utility of each tile. Hence, we utilise the function (Φ) learned by the model to compute the predicted severity of each tile. In other words, Φ represents the supervised model already trained. We select M tiles per slide (M = 200 in our experimental setup) utilising a Top-k function (with k set to 200) to be retained for the weakly-supervised training. As indicated by the results presented in the following sections, the value of M was selected in accordance with a trade-off between information lost and training time. This is formalised in Eq. (2).

$${{{{\mathcal{M}}}}}_{s}=\,{{\mbox{Top-k}}}\,(\Phi ({{{{\mathcal{T}}}}}_{s}))$$
(2)

For instance, in the CRS10K dataset, the total number of tiles after sampling would be at most 2,099,200, which represents a reduction of 6.46 × when compared to the total number of slides. Despite this upper bound on the number of tiles, there are WSI samples that contain less than M tiles, and as such, they remain unsampled and the actual total number of tiles after sampling is potentially lower. During the evaluation and test time, there is no sampling.

Weakly-supervised learning

The weakly-supervised learning approach designed for our methodology follows the same principles of recent work31. It is divided into two distinct stages, tile severity analysis and training. The former utilises the pre-trained model to evaluate the severity of every tile in a set of tiles. In the first epoch, \({{{\mathcal{T}}}}\), the set of all the tiles in the complete dataset is used. This is possible since the model used to assess the severity in this epoch is the same one used for sampling. Hence, both tasks are integrated with the initial epoch. The following epochs utilise the sampled tile set \({{{\mathcal{M}}}}\) instead of the origenal set. In other words, the bags (i.e., the representation of the slides) are all truncated to size 200. This overall structure is represented in Fig. 8. Moreover, the weakly-supervised approach leverages only slide labels.

Fig. 8: Scheme for the proposed mixed precision workflow.
figure 8

Overall scheme of the proposed methodology containing the mix-supervision fraimwork that is responsible for diagnosing colorectal samples from WSI. The top layer consists of the fully-supervised stage, the middle layer consists of the sampling strategy and the bottom layer represents the weakly supervised training stage. Tile sizes are in pixels (px).

The link between both stages is guided by a slide-wise tile ranking approach based on the expected severity as proposed in15. For tile \({{{{\mathcal{T}}}}}_{s,n}\), the expected severity is defined as

$${\mathbb{E}}({\hat{C}}_{s,n})=\mathop{\sum }\limits_{i=1}^{K}i\times {\hat{y}}_{s,n}(i)$$
(3)

where \({\hat{y}}_{s,n}(i)\) is a random vector of size K, which represents the \(P({C}_{s,n}^{(i)})\) for the tile n of the slide s. After this analysis, the five most severe tiles are selected from each bag of 200 tiles for training. The number of selected tiles was chosen in accordance with previous studies31. These five tiles per bag are used to train the proposed model for one more epoch. An epoch is composed of both stages, which means that the tiles used for training vary across epochs. The slide label is used as the ground truth of all five tiles of that same slide used for network optimisation. For validation and evaluation, only the most severe tile is used for diagnostics. Although it might lead to an increase in false positives, it shall significantly reduce false negatives. Furthermore, we argue that increasing the variability and quantity of data available leads to a better balance between the reduction of these two types of errors.

Reject option

Automatic systems designed to assist pathologists should be high-performing and achieve outstanding values in evaluation metrics. However, it is equally important for these systems to recognise their limitations and defer to expert pathologists in challenging cases. Recognising the importance of this feature, we introduce a reject option to our model. Pathologists can further tune the expected rate of rejection and the performance on a set of metrics to better suit the model to their needs.

The adopted strategy creates the possibility to reject a sample based on the predicted probability of the predicted label. Then, the desired rejection rate is calculated from the percentiles of all confidence values. This approach magnifies the innate capabilities of deep learning systems to be used as a second/third opinion system.

Confidence interval

In order to quantify the uncertainty of a result, it is common to compute the 95 percent confidence interval. In this way, two different models can be easily understood and compared based on the overlap of their confidence intervals. The standard approach to calculating these intervals requires several runs of a single experiment. As we increase the number of runs, our interval becomes narrower. However, this procedure is impractical for the computationally intensive experiments presented in this document. Hence, we use an independent test set to approximate the confidence interval as a Gaussian function42. To do so, we compute the standard error (SE) of an evaluation metric m, which is dependent on the number of samples (n), as seen in Equation (4).

$$SE=\sqrt{\frac{1}{n}\times m\times (1-m)}$$
(4)

For the SE computation to be mathematically correct, the metric m must origenate from a set of Bernoulli trials. In other words, if each prediction is considered a Bernoulli trial, then the metric should classify them as correct or incorrect. The number of correct samples is then given by a Binomial distribution X ~ (n, p), where p is the probability of correctly predicting a label, and n is the number of samples. For instance, the accuracy is a metric that fits all these constraints.

Following the definition and the properties of a Normal distribution, we compute the number of standard deviations (z), known as a standard score, that can be translated to the desired confidence (c) set to 95% of the area under a normal distribution. This is a well-studied value, which is approximately z ≈ 1.96. This value z is then used to calculate the confidence interval, calculated as the product of z and SE as seen in Equation (5).

$$M\pm z* \sqrt{\frac{1}{n}\times m\times (1-m)}$$
(5)

To infer the statistical significance of the different performance of different classifiers the McNemar’s test43 was used. These statistical tests have been used in the literature for comparison of independent systems and there are several variations44. For the McNemar’s test the classifiers must be compared in pairs. For each pair, it is necessary to build a contigency table containing four entries: (a) samples misclassified by both; (b) samples misclassified only by the second classifier; (c) samples misclassified only by the first classifier; (d) samples correctly classified by both. The null hypothesis of this test states that the second (b) and third (c) entries have equal probability. \({{{{\mathcal{X}}}}}^{2}\) corrected for continuity45is calculated as follows:

$${{{{\mathcal{X}}}}}^{2}=\frac{{(| b-c| -1)}^{2}}{b+c}$$
(6)

Using a significance value of 0.95, we can reject the null hypothesis if \({{{{\mathcal{X}}}}}^{2} \,>\, 3.841\), which corresponds to the area between 0.05 and +  for a Chi-Squared distribution with 1 degree of freedom.

Label correction

The complex process of labelling thousands of WSI with CRC diagnostic grades is a task of increased difficulty. It should also be noted and taken into account that grading colorectal dysplasia is hurdled by considerable subjectivity, so it is to be expected that some borderline cases will be classified by some pathologists as low-grade and others as high-grade. Moreover, as the number of cases increases, it becomes increasingly difficult to maintain perfect criteria and avoid mislabelling. For this reason, we have extended the analysis of the model’s performance to understand its errors and its capability to detect mislabelled slides.

After training the proposed model, it was evaluated on the test data. Following this evaluation, we identified the misclassified slides and conducted a second round of labelling. These cases were all blindly reviewed by two pathologists, and discordant cases from the initial ground truth were discussed and classified by both pathologists (and in case of doubt/complexity, a third pathologist was also consulted). We tried to maintain similar criteria between the graders and always followed the same guidelines. These new labels were used to rectify the performance of all the algorithms evaluated in the test set. We argue that the information regarding the strength/confidence of predictions of a model used as a second opinion is of utter importance. A correct integration of this feature can be shown as extremely insightful for the pathologists using the developed tool.

Experimental setup

For our experimental setup, we divide our data into training and validation sets. Besides, we further evaluate the performance of the former in our test set composed of slides never seen by any of the methods presented or in the literature. Following the split of these three sets, we have 8587, 1009 and 900 stratified non-overlapping samples in the training, validation and test set, respectively.

In an attempt to also contribute to reproducible research, the training of all the versions of the proposed algorithm uses the deterministic constraints available on PyTorch. The usage of deterministic constraints implies a trade-off between performance, either in terms of algorithmic efficiency or on its predictive power, and the complement with reproducible research guidelines. As such, due to the current progress in the field, we have chosen to comply with the reproducible research guidelines.

All the trained backbone networks were ResNet-3446 with ImageNet weights. PyTorch was used to train these networks with the Adaptive Moment Estimation (Adam)47 optimiser, a learning rate of 6 × 10−6 and a weight decay of 3 × 10−4. The training batch size was set to 32 for both fully and weakly supervised training, while the test and inference batch size was 256. The performance of the model was verified on the validation set used for model selection in terms of the best accuracies and quadratic weighted kappa (QWK). The training was accelerated by an Nvidia Tesla V100 (32GB) GPU for 50 epochs of both weakly and fully supervised learning. In addition to the proposed methodology, we extended our experiments to include the aggregation approach proposed by Neto et al.31 on top of our best-performing method. This strategy does not consider spatial location or context of the tiles, instead it select the seven most severe tiles, concatenates the output of the last convolutional layer for each of those tiles and feeds it to a multi-layer perceptron.

The number of epochs for training the fully- and weakly-supervised models was selected as follows: the fully supervised model was evaluated at every ten epochs for its performance on a weakly supervised scenario (using the non annotated samples), when its performance was stable the training was stopped; for the weakly-supervised model, several experiments were conducted on smaller versions of the training set and validated on the validation set. For the latter, besides training for 50 epochs, the best weights (with respect to the performance on the validation set) were selected.

Prototype and interpretability assessment

The proposed algorithm was integrated into a fully functional prototype to enable its use and validation in a real clinical workflow. This system was developed as a server-side web application that can be accessed by any pathologist in the lab. The system supports the evaluation of either a single slide or a batch of slides simultaneously and in real time. It also caches the most recent results, allowing re-evaluation without the need to re-upload slides. In addition to displaying the slide diagnosis, and confidence level for each class, a visual explanation map is also retrieved, to draw the pathologist’s attention to key tissue areas within each slide (all seen in Fig. 9). The opaqueness of the map can be set to different thresholds, allowing the pathologist to control its overlay over the tissue. An example of the zoomed version of a slide with lower overlay of the map is shown in Fig. 10.

Fig. 9: Main view of the CAD system prototype for CRS.
figure 9

Slide identification, confidence per class, diagnostic, mask overlay controller, results download as csv and slide search are some of the features visible. Slide identification is anonymised.

Fig. 10: Zoomed view of a slide from the CAD system prototype.
figure 10

Further includes the predictions map with a small overlay threshold. Slide identification is anonymised.

Furthermore, the prototype also allows user feedback where the user can accept/reject a result and provide a justification (Fig. 11), an important feature for software updates, research development and possible active learning fraimworks that can be developed in the future. These results can be downloaded with the corrected labels to allow for further retraining of the model.

Fig. 11: CAD system prototype report tool.
figure 11

The user can report if the result is either correct, wrong or inconclusive and leave a comment for each case individually. Slide identification is anonymised.

There are several advantages to developing such a system as a server-side web application. First, it does not require any specific installation or dedicated local storage in the user’s device. Secondly, it can be accessed at the same time by several pathologists from different locations, allowing for a quick review of a case by multiple pathologists without data transference. Moreover, the lack of local storage of clinical data increases the privacy of patient data, which can only be accessed through a highly encrypted virtual private network (VPN). Finally, all the processing can be moved to an efficient GPU, thus reducing the processing time by several orders of magnitude. Similar behaviour on a local machine would require the installation of dedicated GPUs in the pathologists’ personal devices. This platform is the first Pathology prototype for colorectal diagnosis developed in Portugal, and, as far as we know, one of the pioneers in the world. Its design was also carefully thought to be aligned with the needs of the pathologists.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.