About the Course
You have probably heard that this is the era of "Big Data". Stories about companies or scientists using data to recommend movies, discover who is pregnant based on credit card receipts, or confirm the existence of the Higgs Boson regularly appear in Forbes, the Economist, the Wall Street Journal, and The New York Times. But how does one turn data into this type of insight? The answer is data analysis and applied statistics. Data analysis is the process of finding the right data to answer your question, understanding the processes underlying the data, discovering the important patterns in the data, and then communicating your results to have the biggest possible impact. There is a critical shortage of people with these skills in the workforce, which is why Hal Varian (Chief Economist at Google) says that being a statistician will be the sexy job for the next 10 years.
This course is an applied statistics course focusing on data analysis. The course will begin with an overview of how to organize, perform, and write-up data analyses. Then we will cover some of the most popular and widely used statistical methods like linear regression, principal components analysis, cross-validation, and p-values. Instead of focusing on mathematical details, the lectures will be designed to help you apply these techniques to real data using the R statistical programming language, interpret the results, and diagnose potential problems in your analysis. You will also have the opportunity to critique and assist your fellow classmates with their data analyses.
Recommended Background
Some familiarity with the R statistical programming language ( http://www.r-project.org/) and proficiency in writing in English will be useful. At Johns Hopkins, this course is taken by first-year graduate students in Biostatistics.
Course Format
The course will consist of lecture videos broken into 8-10 minute segments. There will be two major data analysis projects that will be peer-graded with instructor quality control. Course grades will be determined by the data analyses, peer reviews, and bonus points for answering questions on the course message board.
FAQ
-
How is this course different from "Computing for Data Analysis"?
This course will focus on how to plan, carry out, and communicate analyses of real data sets. While we will cover the basics of how to use R to implement these analyses, the course will not cover specific programming skills. Computing for Data Analysis will cover some statistical programming topics that will be useful for this class, but it is not a prerequisite for the course.
-
What resources will I need for this class?
A computer with internet access on which the R software environment can be installed (recent Mac, Windows, or Linux computers are sufficient).
-
Do I need to buy a textbook?
There is no standard textbook for data analysis. The course lectures will include pointers to free resources about specific statistical methods, data sources, and other tools for data analysis.