ProbabilityIntroduction
When we do data science, we begin with a data set and work to gain insights about the process that generated the data. Crucial to this endeavor is a robust vocabulary for discussing the behavior of data-generating processes.
It is helpful to initially consider data-generating processes whose randomness properties are specified completely and precisely. The study of such processes is called probability. For example, "What's the probability that I get at least 7 heads in 10 independent flips of a fair coin?" is a probability question, because the setup is fully specified: the coins have exactly 50% probability of heads, and the different flips do not affect one another.
The question of whether the coins are really fair or whether the flips are really independent will be deferred to our study of statistics. In statistics, we will have the outcome of a random experiment in hand and will be looking to draw inferences about the unknown setup. Once we are able to answer questions in the "setup outcome" direction, we will be well positioned to approach the "outcome setup" direction.
Exercise
Each of the questions below is a probability question or a statistics question. Select ones which are probability questions.
Solution. The first question is statistics. We don't know the probability of rain, and we are trying to draw an inference about it based on observed samples.
The second question is a probability question. We are given the setup and asked a question which assumes its validity.
The third question is also a probability question. We're told the dice are fair, and we're asked a question about the outcome of the rolls.
The third question is a statistics question, since the outcome of the rolls is known, and the probabilities are in question.