Skip to main content


Naim Rashid spoke on “Cancer Data Science: From Code to Clinic,” which aims to translate data science models for the benefit of cancer patients. Rashid shared a case study on pancreatic cancer, and explained that the median survival rate for this type of cancer is less than 11 months, due to the fact that detection often occurs at a very late stage. While this obviates the need for placing patients on their optimal therapies upfront, no molecular subtyping system previously for pancreatic cancer, making precision medicine approaches quite difficult. In 2015, Rashid and his collaborators were able to delineate two subtypes of pancreatic cancer by using a measurement of the gene expression profiles of patients and a sample of 20,000 genes in conjunction with a tool called non-negative matrix factorization. The team found that basal subtype patients had much a poorer survival rate and were more likely to have their tumors grow after treatment than those of the classical subtype. The study proved that the knowledge of the subtype for a particular individual could be extremely important in determining effective treatments  for them, but introduced the challenge of translating these findings to a clinical setting while keeping the issues of data normalization, sample purity, and profile clustering in mind. To combat these issues, the team developed a method called PurIST, a prediction model that takes patient's gene expression information and accurately outputs their pancreatic cancer subtype. The model showed strong replication of the original study and gave researchers the ability to create an certified algorithm that could be used in clinical trials. The team was recently able to patent this tool and license it out to diagnostic companies to include it in their diagnostic tool kits provided to hospitals around the country. Click here to view the talk on YouTube.


Naim Rashid, Associate Professor

Department: Department of Biostatistics | Faculty Profile

Featured on: May 26, 2022 (Event Page)

Session Title: Improving Health Outcomes (Event Recap

Tools, Information, and Resources:

  • PurIST GitHub Repo: This is the public GitHub repository that contains the R package for the Purity Independent Subtyping of Tumors (PurIST) algorithm.
  • runPURIST GitHub Repo: Check out the public repository that contains the GUI for technicians to run the PurIST algorithm with their data.
  • R for Statistical Computing: R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
  • RStudio: Inspired by innovators in science, education, government, and industry, RStudio develops free and open tools for R, and enterprise-ready professional products for teams who use both R and Python, to scale and share their work.
    • R Markdown: R Markdown provides an authoring framework for data science. 
  • GitHub: Millions of developers and companies build, ship, and maintain their software on GitHub—the largest and most advanced development platform in the world.
  • BIOS 735 - Introduction to Statistical Computing: This course teaches important concepts and skills for statistical software development using case studies. After this course, students will have an understanding of the process of statistical software development, knowledge of existing resources for software development, and the ability to produce reliable and efficient statistical software.