Applied Mathematics GIDP
Summer Institute in Statistics for Big Data
Summer Institute in Statistics for Big Data, July 2016
I attended three modules (or classes) at the institute which dealt with three areas: reproducible research, supervised machine learning, and unsupervised machine learning.
Reproducible research is research that others could reproduce exactly given your methods and data. This is important to ensure clarity and accuracy of the results given. It is not the same as replicable research or research that produces the same results with different data from different experiments but same methods. Replicable research is harder and the gold standard of scientific research but for many interesting and important studies this isn’t possible because of resources, variability in the experiments, or other issues. So a lesser but important standard is to make your data and methods publically available so others can follow your steps and get the same results.
In this same module we explored many different ways to help make our research more reproducible. We learn how to develop R packages and maintain them on GitHub. We also learned how to write up explanations and vignettes using Rmarkdown to make easy to follow documents that interweave R code, images, and graphs into text explaining each output.
The next two modules dealt with different methods and algorithms to “learn” the data. The supervised machine learning module detailed algorithms that learned for the purpose of regression or classification. For example we learned how to take genomic data and learn what genes predispose a patient for malignant or benign cancer. The unsupervised machine learning module taught about algorithms that find inherit structure in the data. We tested the techniques on breast cancer data and found as many have before that there are actually sub-types of cancer that have different prognosis and react differently to treatment.
I am currently writing a paper about using machine learning on cancer data. Because of these modules, I can now insure that the paper will be reproducible (and also replicable). I have already used some of the skills to generate visuals and check the basic assumptions of our data and have gotten much insight out of it.
This institute greatly expanded my knowledge of statistical research and I have already been able to use much of my knowledge to better my research in the few weeks that I have been back. I am grateful that the Hebert E. Cater Award allowed me this great opportunity to fuel my professional growth.