An Introduction to Statistical and Data Sciences via R
Chester Ismay and Albert Y. Kim
1 Introduction
Help! I’m fresh to R and RStudio and I need to learn about them! However, I’m entirely fresh to coding! What do I do? If you’re asking yourself this question, then you’ve come to the right place! Embark with our Introduction for Students.
- Are you an instructor hoping to use this book in your courses? Then click here for more information on how to instruct with this book.
- Are you looking to connect with and contribute to ModernDive? Then click here for information on how.
- Are you nosey about the publishing of this book? Then click here for more information on the open-source technology, in particular R Markdown and the bookdown package.
This is version 0.Two.0 of ModernDive published on August 02, 2017. For previous versions of ModernDive, see Section 1.Four.
1.1 Introduction for students
This book assumes no prerequisites: no algebra, no calculus, and no prior programming/coding practice. This is intended to be a gentle introduction to the practice of analyzing data and answering questions using data the way data scientists, statisticians, data journalists, and other researchers would.
1.1.1 What you will learn from this book
We hope that by the end of this book, you’ll have learned
- How to use R to explore data.
What do we mean by data stories? We mean any analysis involving data that engages the reader in answering questions with careful visuals and thoughtful discussion, such as How strong is the relationship inbetween per capita income and crime in Chicago neighborhoods? and How many f**ks does Quentin Tarantino give (as measured by the amount of swearing in his films)?. Further discussions on data stories can be found in this Think With Google article.
For other examples of data stories constructed by students like yourselves, look at the final projects for two courses that have previously used ModernDive:
This book will help you develop your “data science toolbox”, including implements such as data visualization, data formatting, data wrangling, and data modeling using regression. With these implements, you’ll be able to perform the entirety of the “data/science pipeline” while building data communication abilities (see Chapter 1.1.Two for more details).
In particular, this book will lean intensely on data visualization. In today’s world, we are bombarded with graphics that attempt to convey ideas. We will explore what makes a good graphic and what the standard ways are to convey relationships with data. You’ll also see the use of visualization to introduce concepts like mean, median, standard deviation, distributions, etc. In general, we’ll use visualization as a way of building almost all of the ideas in this book.
To impart the statistical lessons in this book, we have intentionally minimized the number of mathematical formulas used and instead have focused on developing a conceptual understanding via data visualization, statistical computing, and simulations. We hope this is a more intuitive practice than the way statistics has traditionally been instructed in the past and how it is commonly perceived.
Ultimately, you’ll learn the importance of literate programming. By this we mean you’ll learn how to write code that is useful not just for a computer to execute but also for readers to understand exactly what your analysis is doing and how you did it. This is part of a greater effort to encourage reproducible research (see Chapter 1.1.Three for more details). Hal Abelson coined the phrase that we will go after across this book:
“Programs must be written for people to read, and only incidentally for machines to execute.”
We understand that there may be challenging moments as you learn to program. Both of us proceed to fight and find ourselves often using web searches to find answers and reach out to colleagues for help. In the long run however, we all can solve problems swifter and more elegantly via programming. We wrote this book as our way to help you get began and you should know that there is a massive community of R users that are always blessed to help everyone along as well. This community exists in particular on the internet on various forums and websites such as stackoverflow.com.
1.1.Two Data/science pipeline
You may think of statistics as just being a bunch of numbers. We commonly hear the phrase “statistician” when listening to broadcasts of sporting events. Statistics (in particular, data analysis), in addition to describing numbers like with baseball batting averages, plays a vital role in all of the sciences. You’ll commonly hear the phrase “statistically significant” thrown around in the media. You’ll see articles that say “Science now shows that chocolate is good for you.” Underpinning these claims is data analysis. By the end of this book, you’ll be able to better understand whether these claims should be trusted or whether we should be wary. Inwards data analysis are many sub-fields that we will discuss across this book (however not necessarily in this order):
- data collection
- data wrangling
- data visualization
- data modeling
- inference
- correlation and regression
- interpretation of results
- data communication/storytelling
These sub-fields are summarized in what Grolemund and Wickham term the “data/science pipeline” in Figure 1.1.
Figure 1.1: Data/Science Pipeline
We will begin by digging into the gray Understand portion of the cycle with data visualization, then with a discussion on what is meant by neat data and data wrangling, and then conclude by talking about interpreting and discussing the results of our models via Communication. These steps are vital to any statistical analysis. But why should you care about statistics? “Why did they make me take this class?”
There’s a reason so many fields require a statistics course. Scientific skill grows through an understanding of statistical significance and data analysis. You needn’t be intimidated by statistics. It’s not the brute that it used to be and, paired with computation, you’ll see how reproducible research in the sciences particularly increases scientific skill.
1.1.Three Reproducible research
“The most significant instrument is the mindset, when commencing, that the end product will be reproducible.” – Keith Baggerly
Another purpose of this book is to help readers understand the importance of reproducible analyses. The hope is to get readers into the habit of making their analyses reproducible from the very beginning. This means we’ll be attempting to help you build fresh habits. This will take practice and be difficult at times. You’ll see just why it is so significant for you to keep track of your code and well-document it to help yourself later and any potential collaborators as well.
Copying and pasting results from one program into a word processor is not the way that efficient and effective scientific research is conducted. It’s much more significant for time to be spent on data collection and data analysis and not on copying and pasting plots back and forward across a multitude of programs.
In a traditional analyses if an error was made with the original data, we’d need to step through the entire process again: recreate the plots and copy and paste all of the fresh plots and our statistical analysis into your document. This is error prone and a frustrating use of time. We’ll see how to use R Markdown to get away from this tedious activity so that we can spend more time doing science.
“We are talking about computational reproducibility.” – Yihui Xie
Reproducibility means a lot of things in terms of different scientific fields. Are experiments conducted in a way that another researcher could go after the steps and get similar results? In this book, we will concentrate on what is known as computational reproducibility. This refers to being able to pass all of one’s data analysis, data-sets, and conclusions to someone else and have them get exactly the same results on their machine. This permits for time to be spent interpreting results and considering assumptions instead of the more error prone way of embarking from scrape or following a list of steps that may be different from machine to machine.
1.1.Four Final note for students
At this point, if you are interested in instructor perspectives on this book, ways to contribute and collaborate, or the technical details of this book’s construction and publishing, then proceed with the rest of the chapter below. Otherwise, let’s get began with R and RStudio in Chapter Two!
1.Two Introduction for instructors
This book is inspired by three books:
- “Mathematical Statistics with Resampling and R” (Chihara and Hesterberg 2011) ,
- “OpenIntro: Intro Stat with Randomization and Simulation” (Diez, Barr, and Çetinkaya-Rundel 2014) , and
- “R for Data Science” (Grolemund and Wickham 2016) .
The very first book, while designed for upper-level undergraduates and graduate students, provides an excellent resource on how to use resampling to impart statistical concepts like sampling distributions using computation instead of large-sample approximations and other mathematical formulas. The last two books are free options to learning introductory statistics and data science, providing an alternative to the many traditionally expensive introductory statistics textbooks.
When looking over the large number of introductory statistics textbooks that presently exist, we found that there wasn’t one that incorporated many freshly developed R packages directly into the text, in particular the many packages included in the tidyverse collection of packages, such as ggplot2 , dplyr , tidyr , and broom . Additionally, there wasn’t an open-source and lightly reproducible textbook available that exposed fresh learners all of three of the learning goals listed at the outset of Chapter 1.1.1.
1.Two.1 Who is this book for?
This book is intended for instructors of traditional introductory statistics classes using RStudio, either the desktop or server version, who would like to inject more data science topics into their syllabus. We assume that students taking the class will have no prior algebra, calculus, nor programming/coding practice.
Here are some principles and beliefs we kept in mind while writing this text. If you agree with them, this might be the book for you.
- Blur the lines inbetween lecture and lab
- With enlargened availability and accessibility of laptops and open-source non-proprietary statistical software, the rigorous dichotomy inbetween lab and lecture can be loosened.
- It’s much firmer for students to understand the importance of using software if they only use it once a week or less. They leave behind the syntax in much the same way someone learning a foreign language forgets the rules. Frequent reinforcement is key.
- Concentrate on the entire data/science research pipeline
- We believe that the entirety of Grolemund and Wickham’s data/science pipeline should be trained.
- We believe in “minimizing prerequisites to research”: students should be answering questions with data as soon as possible.
- It’s all about the data
- We leverage R packages for rich, real, and realistic data-sets that at the same time are easy-to-load into R, such as the nycflights13 and fivethirtyeight packages.
- We believe that data visualization is a gateway drug for statistics and that the Grammar of Graphics as implemented in the ggplot2 package is the best way to impart such lessons. However, we often hear: “You can’t instruct ggplot2 for data visualization in intro stats!” We, like David Robinson, are much more optimistic.
- dplyr has made data wrangling much more accessible to novices, and hence much more interesting data-sets can be explored.
- Use simulation/resampling to introduce statistical inference, not probability/mathematical formulas
- Instead of using formulas, large-sample approximations, and probability tables, we instruct statistical concepts using resampling-based inference.
- This permits for a de-emphasis of traditional probability topics, freeing up room in the syllabus for other topics.
- Don’t fence off students from the computation pool, throw them in!
- Computing abilities are essential to working with data in the 21st century. Given this fact, we feel that to shield students from computing is to ultimately do them a disservice.
- We are not instructing a course on coding/programming per se, but rather just enough of the computational and algorithmic thinking necessary for data analysis.
- Finish reproducibility and customizability
- We are frustrated when textbooks give examples, but not the source code and the data itself. We give you the source code for all examples as well as the entire book!
- Ultimately the best textbook is one you’ve written yourself. You know best your audience, their background, and their priorities. You know best your own style and the types of examples and problems you like best. Customization is the ultimate end. For more about how to make this book your own, see About this Book.
1.Trio Connect and contribute
If you would like to connect with ModernDive, check out the following links:
- If you would like to receive periodic updates about ModernDive (toughly every three months), please sign up for our mailing list.
- Contact Albert at [email protected] and Chester [email protected]
- We’re on Twitter at ModernDive.
If you would like to contribute to ModernDive, there are many ways! Let’s all work together to make this book as excellent as possible for as many students and instructors as possible!
- Please let us know if you find any errors, typos, or areas from improvement on our GitHub issues page.
- If you are familiar with GitHub and would like to contribute more, please see Section 1.Four below.
The authors would like to thank Nina Sonneborn, Kristin Bott, and the participants of our USCOTS two thousand seventeen workshop for their feedback and suggestions. A special thanks goes to Prof.В Yana Weinstein, cognitive psychological scientist and co-founder of The Learning Scientists for her extensive contributions.
1.Four About this book
This book was written using RStudio’s bookdown package by Yihui Xie (Xie 2017) . This package simplifies the publishing of books by having all content written in R Markdown. The bookdown/R Markdown source code for all versions of ModernDive is available on GitHub:
- Latest published version The most up-to-date release:
- Version 0.Two.0 released on August 02, two thousand seventeen (source code).
- Available at ModernDive.com
Could this be a fresh paradigm for textbooks? Instead of the traditional model of textbook companies publishing updated editions of the textbook every few years, we apply a software design influenced model of publishing more lightly updated versions. We can then leverage open-source communities of instructors and developers for ideas, devices, resources, and feedback. As such, we welcome your pull requests.