Ammon Washburn Abstracts

Ammon Washburn Abstracts

      Ammon Washburn
      Ph.D. Student
      Applied Mathematics GIDP

      Conference Summary
      Summer Institute in Statistics for Big Data
      Seattle, WA

Abstract

Module 3: Reproducible Research for Biomedical Big Data
Week 2, Session 3, Monday 8:30 AM - Wednesday 12:00 PM: Mon Jul 18 to Wed Jul 20
Instructor(s): Baggerly, Keith, Peng, Roger

The validity of conclusions from scientific investigations is typically strengthened by the replication of results by independent researchers. Full replication of a study’s results using independent methods, data, equipment, and protocols has long been, and will continue to be, the standard by which scientific claims are evaluated. However, in many fields of study, there are examples of scientific investigations which cannot be fully replicated, often because of a lack of time or resources. In such situations, there is a need for a minimum standard which can serve as an intermediate step between full replication and nothing. This minimum standard is reproducible research, which requires that datasets and computer code be made available to others for verifying published results and conducting alternate analyses. This standard is especially important in the context of biomedical research, where the results may determine patient care. Unfortunately, reviews of the current literature suggest that this “standard” is anything but. Examples of non-reproducible research resulting in improper treatment of patients have driven journals, funding agencies, and regulatory agencies to press for a greater standard of reproducibility. In this module, we will provide examples of systemic breakdowns demonstrating the need for reproducible research, and an introduction to tools for conducting reproducible research. Topics covered will include the types of breakdowns most commonly seen, current regulatory requests, literate statistical programming techniques, reproducible statistical computation, and techniques for making large-scale data analyses reproducible. We will focus on the R statistical computing language, and will discuss other tools that can be used for producing reproducible documents. We will assume some familiarity with R.

Module 4: Supervised Methods for Statistical Machine Learning
Week 2, Session 4, Wednesday 1:30 PM - Friday 5:00 PM: Wed Jul 20 to Fri Jul 22
Instructor(s): Simon, Noah, Shojaie, Ali

In this module, we will present a number of supervised learning techniques for the analysis of Biomedical Big Data. These techniques include penalized approaches for performing regression, classification, and survival analysis with Big Data. Support vector machines, decision trees, and random forests will also be covered. The main emphasis will be on the analysis of “high-dimensional” data sets from genomics, transcriptomics, metabolomics, proteomics, and other fields. These data are typically characterized by a huge number of molecular measurements (such as genes) and a relatively small number of samples (such as patients). We will also consider electronic health record data sets, which often contain many missing measurements. Throughout the course, we will focus on common pitfalls in the supervised analysis of Biomedical Big Data, and how to avoid them. The techniques discussed will be demonstrated in R. This course assumes some previous exposure to linear regression and statistical hypothesis testing, as well as some familiarity with R or another programming language.

Module 5: Unsupervised Methods for Statistical Machine Learning
Week 3, Session 5, Monday 8:30 AM - Wednesday 12:00 PM: Mon Jul 25 to Wed Jul 27
Instructor(s): Allen, Genevera, Liu, Yufeng

In this module, we will present a number of unsupervised learning techniques for finding patterns and associations in Biomedical Big Data. These include dimension reduction techniques such as principal components analysis and non-negative matrix factorization, clustering analysis, and network analysis with graphical models. We will also discuss large-scale inference issues, such as multiple testing, that arise when mining for associations in Biomedical Big Data. As in Module 4 on supervised learning, the main emphasis will be on the analysis of real high-dimensional data sets from various scientific fields, including genomics and biomedical imaging. The techniques discussed will be demonstrated in R. This course assumes some previous exposure to linear regression and statistical hypothesis testing, as well as some familiarity with R or another programming language.

 

Abstract for Lay Audience

Class/Module Information
Module 3: Reproducible Research for Biomedical Big Data

Replication is the idea that other scientists other than yourself should be able to get the same results as you, preferably with their own independent experiment. For example if you did an experiment that lead you to believe you found proof of cold fusion, other scientists should be able to do your experiment on their own and get similar results.

In the medical field this idea is very important but typically much harder to have. Expensive studies dealing with human subjects tend to be done once and only once. Even if experiments can be repeated multiple times then high variability reduces the reproducibility of the experiments themselves. However, the importance of replication in the medical field is higher than in other fields especially when the studies dictate the treatment of the individual.

In the event where replication is hard, there are standard procedures that should be known (but aren’t) to allow others to at least get your same results, even if they can’t do their own experiment. This class will teach several important ideas on how to enhance reproducibility in publications to help doctors as they try to make important decisions about patient care.

Module 4: Supervised Methods for Statistical Machine Learning
Supervised methods are methods that learn from pre-existing (nice) data that is in certain forms in order to classify future data. These are complicated methods in their own right but in the biomedical field there are additional challenges.

This class will teach the methods and explain how to help them adapt to such problems like high dimensional data (too many features) or missing or uncertain data. These problem are very important to address but they allow for much smaller error and will decrease overfitting. Overfitting in the biomedical context would be the treatments are predicted to work great but are actually really bad for patients treated later.

Module 5: Unsupervised Methods for Statistical Machine Learning
Unsupervised methods are those that just need only the data in order to be able to classify. They tend to pick up hidden patterns that the modeler didn’t know existed while supervised methods can only pick up what the modeler was looking for.

These methods have their own special challenges (like speed) as well as the challenges listed in the last module. This class will deal with those issues as well as teach the methods.