Project Dates

March 5: Find a partner by this date. If you cannot find a partner email me. Only groups of two are allowed (perhaps one group of three).
March 23: Submit a 1 page project plan (physical copy in class). See below for what the project plan should contain.
March 30: You will receive feedback on your project plan from me.
April 16 – May 1: Oral presentations given in class. Each pair of students will present for 20 minutes. See project expectations below for information on presentation content. Sign up sheets for dates will be distributed later.
May 1: Project report due. The project report is due May 1 regardless of your presentation date. See project expectations below for information on report content.

Project Expectations and Grading Rubric

Expectations

The 1 page project plan should summarize your goals and intended deliverables for the project. Since it can be difficult to estimate the time it will take to complete tasks, organize goals from certain to complete to more aspirational. The exact deliverables will vary significantly from project to project. However I expect most projects will involve about 40 hours of work from each student and produce hundreds of lines of code and many figures. My feedback on your project plan will help ensure that you have targeted the right level of project difficulty.

Each group will give a 20 minute presentation using slides. The project report and code will be stored in a group github repository. Email me (jlong@stat.tamu.edu) a link to the project repository on May 1. Make sure the repository is public. The report itself could be a compiled pdf, a Jupyter notebook, or some other file type that renders nicely on github (eg .rst). If you have a strong preference for another format, contact me to discuss. See Example Github Repos below for some examples of nice looking github project repositories.

Grading Rubric

Oral Presentation (30 points): Clarity of presentation about project topic, scientific / statistical questions, and results. Well composed slides.
Scope of Work (30 points): Have the basic goals / objectives in the 1 page project plan been met? Is there progress towards some of the more aspirational goals?
Code Quality and Reproducibility of Results (30 points): Are conclusions in project report supported by figures and code? Is the code easily accessible from the report, readable, and modifiable? Can an interested reader reproduce and/or extend your analysis?
Style, Grammar, Spelling (10 points): The five page report (excluding graphics) should be written in a professional style, similar to a journal article.

Project Suggestions

You are encouraged to come up with your own project topic. If nothing comes to mind, here are some ideas:

Translate R Package into Python: Many models and algorithms developed by statisticians only have R implementations. Take an R package and rewrite it in Python. Make your code available to the Python community via Github. Specific examples:
- Circular distributions such as Fisher von mises see lectures/vonmisesP.ipynb
- Zero–inflated poisson models (especially for regression) have little support in Python, but a lot of support in R. See here and here for some discussion about Python zero inflated models and discussion here for what is available in R.
R versus Python Speed Tests: There are many debates about the relative speed of R and Python. Bring some solid numbers to these debates by performing speed comparisons on commonly used functions. Your report should go beyond simply comparing numbers to discuss ease of using functions.
Application of a Statistical Model to a Data Set: Kaggle has many data sets and machine learning competitions. Fit a model to a Kaggle data set. The fitting must have a significant computational component e.g. involve writing code to optimize parameters or sample from a posterior.
Improved NBA Scoring Model: The NBA scoring model (based on data described here) had an mean absolute error of about 9.2. Build a better model, possibly by making different assumptions or incorporating different information, e.g. location of game. Make sure your new model has some significant computational component.
Mislabeled Training Data: Implement and test a model for fitting logistic regression (or some other classifier) to training data which is contaminated by mislabeled observations. See Logistic Regression with Mislabeled Data for some background on this model.
Sensor Network Self–Localization: In class we studied a sensor network that presented many challenges, including multimodal and banana shaped likelihoods. There are several projects one could do based on this work.
- Test the RAM MCMC sampler on this data. One could construct more complicated sensor networks with dozens of sensors and examine computation time and convergence of chains.
- Analytically compute the gradient for the network model. Test the BFGS optimizer with an analytically and numerically computed gradient for several network configurations. Compare methods based on computation speed.
Transmission Tomography: Lange presents MM and EM algorithms for transmission tomography in Section 12.11, Section 13.8, and Example 14.8.1. Summarize transmission tomography for the class and fit some algorithms to simulated and/or real data. There are several real data sets available here.
Netflix EM Mixture Model: We used a mixture model to identify reviewers with unusual tastes in a subset of the Netflix data. Project ideas:
- Fit this model to the entire Netflix data set (biggish data project).
- Extend / improve the model by making different assumptions and implement an EM (or other) algorithm to fit it.
- Use the model to identify (and potentially) eliminate unusual reviewers before running a matrix completion algorithm. Do you get better results by screening out these reviewers?
- Apply this model to some data set other than Netflix.
Circadian Gene Expression: Analyze the Circadian Gene Expression data set. See the link for specific project ideas.

Example Github Repos

You will submit your project as a github repo. The repo should contain the report. One option is to write your report in the README.rst or README.md file, similar to this project but with more text. That way the report will be the first thing I see on your repo.

Another option is for your readme just to describe the where code / report are. For example this project has a short README.md advertising what is in the repo which refers users to the files demo_R.ipynb and demo_python.ipynb.

statcomp

homepage for 689 statistical computing

Project Dates

Project Expectations and Grading Rubric

Project Suggestions

Example Github Repos