NCM 2021: Classification of pediatric acute leukaemia using full-length transcriptomics

Jeremy Wang, from the University of North Carolina, began the presentation by providing some background to his work. To set the scene, he highlighted the global discrepancy in childhood cancer 5-year survival rates, with a glaringly obvious pattern of poorer prognosis in lower income countries, including parts of Africa, Asia, and America. The underpinning behind this observation is down to the high capital costs associated with the machinery capable of characterising cellular phenotype, cytogenetics, and molecular genetics, which are used to characterise paediatric cancers. Put simply, capital costs and expertise are “prohibitive barriers” for low-income countries, particularly in genomics.

To address this, Jeremy saw an opportunity to take advantage of the low capital cost associated with nanopore sequencing to “leapfrog” the barrier set by existing technologies and to democratise the identification of factors that drive these paediatric tumours. Next, Jeremy discussed a significant advancement made in the field by Gu, who demonstrated that gene expression profiling alone can classify paediatric leukaemia’s and their lineages.  Jeremy went on to introduce the three main acute leukaemia lineages: B-cell lymphoblast, T-cell lymphoblast, and myeloid (myeloblast).

In the past, to carry out gene expression profiling, microarrays were used, albeit the limited and fixed capacity to test genes rendered it quite an inflexible approach. Short-read RNA sequencing is robust, and captures variability well at the appropriate coverage but it has the drawback of high capital costs.

The common classification methods for these assays use machine learning and are tractable for homogeneous data. Jeremy set out to design a comparable classification method for nanopore-based gene expression analysis that was practical in low-resource settings. Nanopore sequencing performed well, even with significantly degraded RNA — where many shorter reads exist, their sampling density across the transcriptome is comparable or higher than traditional short-read methods.

Jeremy then moved on to share some of the machine learning and bioinformatics tools he has been working on to perform lineage and genomic subtype classification. To illustrate how heterogeneity and sparsity confound traditional classification methods, Jeremy showed a t-SNE plot whereby the heterogenous and sparse sequencing depth data exhibits clustering when they should be neatly separated from each other. Jeremy sought a more robust machine learning method that was tailored towards heterogeneous and sparse datasets and lacked the locality assumptions that some of the other models have. To that end, he landed on a composite approach using partial least-squares (PLS) regression and a support vector machine (SVM).

Delving a little deeper into the approach, to distinguish between B-ALL, T-ALL, & AML gene expression profiles, he performed PLS regression doing each lineage vs the other lineages in turn, followed by pairwise one vs one PLS comparisons. These PLS regression converts the binary comparisons into low dimensional set of transform coordinates that distinguish between the two groups. The transformed coordinates derived from all of the various PLS regressions were then fed into a support vector machine which then segregated the cell lineages into their respective discrete classes.

He took 211 nanopore-based transcriptome profiles that represented the three major lineages (with a large degree of variation in sequencing depth). He then trained one of the composite models on the lineage classification first, and showed really promising results — achieving almost 98% accuracy in aggregate and really importantly they can use prediction probabilities from the SVM to set a really conservative cut off of 0.8, which gives 94% of the samples that are classified accurately. What’s more, as the size and scope of the datasets increase, the results may improve further.

For future work, Jeremy plans on improving the gene expression-based classification, which includes increasing the depth and breadth of samples across clinically relevant B-ALL and AML subtypes. Jeremy also has plans to optimise the classification methods to improve discrimination of lineages and the overall accuracy of this method. Going full circle, Jeremy is working with international collaborators to assess the future feasibility of implementing the pipeline in low-resource settings like Malawi, Pakistan, and El Salvador. Jeremy notes that importantly, these areas, despite falling into the low-resource settings category, do have access to flow cytometry to enable validation of the nanopore sequencing-based pipeline. Jeremy also wants to harness the capabilities of long nanopore reads to perform explicit transcript fusion detection, which he notes is very easy to identify by split mapping. Finally, Jeremy plans to investigate how they could, in the future, adapt and implement his pipelines for evaluating solid paediatric tumours in low-resource settings.

Authors: Jeremy Wang