Description
The advent of high throughput sequencing has enabled researchers to systematically evaluate the genetic variations in cancer, resulting in the identification of many cancer-associated genes. Although cancers in a same tissue are widely categorized in the same group, they demonstrate many differences among them with respect to their mutational profiles. Hence there is no “silver bullet” for treatment of a cancer type. This reveals the importance of developing a pipeline to accurately identify cancer-associated genes and re-classify cancer patients with similar mutational profiles. Classification of cancer patients with similar mutational profiles may help discover subtypes of cancer patients who might benefit from specific treatment types. In this study, we propose a new machine learning pipeline to identify protein-coding genes which are mutated in significant portion of samples to identify cancer subtypes. We applied our pipeline to 12270 samples collected from the International Cancer Genome Consortium (ICGC) which covered 19 cancer types. Here we identified 17 different cancer subtypes. Comprehensive phenotypic and genotypic analysis indicates distinguishable properties, including unique cancer-related signaling pathways, in which, for most of them, targeted treatment options are currently available. This new subtyping approach offers a novel opportunity for cancer drug development based on the mutational profile of patients. We also comprehensive study the causes of mutations among samples in each subtype by mining the mutational signatures which provides important insight into their active molecular mechanisms. Some of the pathways that we identified in most subtypes, including the cell cycle and the Axon guidance pathways, are frequently observed in cancer disease. Interestingly, we also identified several mutated genes and different rate of mutation in multiple cancer subtypes. In addition, our study on “gene-motif” suggests the importance of considering both the context of the mutations and mutational processes in identifying cancer-associated genes.