Please submit code along with solutions.
Question 1 (10 points)
Using the KIRC TCGA data set we discussed in class:
- Perform some exploratory data analysis to clean and summarize data (e.g. remove genes with all 0 expression).
- Split data into training and test sets, use both the SNV and gene expression data. Use
set.seed
to make your results reproducible.
- Build at least two Cox PH models on the training set. Could involve:
- Different values of \(\alpha\) (balance between \(\ell_1\) and \(\ell_2\) regularization)
- Prescreening versus no prescreening of predictors (use only training data for prescreening)
- Different criteria for selecting tuning parameters, e.g. built-in deviance measure versus Concordance index.
- Remove genes with low expression levels.
- Measure and report the performance of the models on the test set. In addition comment on model complexity as measured by number of covariates used. Which model do you prefer and for what purpose?
The report should be about 2 written pages and include a few plots / tables. Begin the report with an introductory paragraph about what you are trying to accomplish.