Wine is an acoholic beverage made by fermenting grape juice that some of you might try once you reach the age of 21. The journalism outlet Vox asked 19 staff to taste three different wines. They asked them to identify the most expensive wine, and to rate each one of the different wines.

The video reports that “almost half” (say 9) of the staffers identified the most expensive wine correctly. If we want to know whether people can tell expensive wine from cheap wine, what is the null hypothesis?

Compute the p-value for the null hypothesis using `pbinom`

, and make a conclusion about whether we have evidence that Vox staffers can (sometimes) tell expensive wine from cheap wine.

*Just for fun*: the experiment here is reminiscent of the classical lady tasting tea experiment. (But you do not need to use Fisher’s exact test).

The null hypothesis is that everyone has a 1/3 change of correctly guessing which wine is the most expensive.

`pbinom(q = 3, size = 19, prob = 0.33333) + (1 - pbinom(q = 8, size = 19, prob = 0.33333))`

`## [1] 0.2248114`

We’d get a value that’s 3 away from what we’d expect on average (about 6), so we cannot reject the null hypothesis that everyone has a 1/3 chance of guessing correctly.

This is a slight inaccurate, since the probabilities of `q = 3`

and `q = 8`

are not the same. Here is how we would compute the probability of getting something as improbable as 9 correct answers or more

`2*(1 - pbinom(q = 8, size = 19, prob = 0.33333))`

`## [1] 0.2922953`

(We do this by just multiplying the probability of getting 9 correct answers or more).

Either way, there is no evidence Vox staffers aren’t just guessing

The staffers also rated the wines on a 1-10 scale. The ratings were: Wine A ($8): 4.8/10 Wine B ($43): 4.8/10 Wine C ($14): 5.4/10

Suppose that your null hypothesis was that Wine B and Wine C would on average be rated the same. Test that hypothesis using `rnorm`

Note that you need to make an assumption about the \(\sigma\). Make a reasonable assumption, and explain why it is reasonable. Explore the effect of changing the value of \(\sigma\) to other reasonable values on the p-value.

Hint: the rule of thumb is that 95% of measurements are within the interval \([\mu - 2\sigma, \mu + 2\sigma]\). That means that \(\sigma > 3\) is likely not reasonable, since that would imply a lot of ratings outside the 1..10 range.

Let’s assume \(\sigma = 2\) this would give us ratings mostly within the range 1..10 (\(5\pm (2*2)\)).

The null hypothesis then is that both wines B and C on average will be rated \(5.1\). Assuming all raters are interchangeable (i.e., somewhat implausible, they don’t have different tastes), we can have

```
B.ratings <- rnorm(n = 19, mean = 5.1, sd = 2)
C.ratings <- rnorm(n = 19, mean = 5.1, sd = 2)
```

We can now repeat this experiment and see how often we observe a difference of 0.6.

```
fake.ratings <- function(sd){
B.ratings <- rnorm(n = 19, mean = 5.1, sd = sd)
C.ratings <- rnorm(n = 19, mean = 5.1, sd = sd)
mean(B.ratings)-mean(C.ratings)
}
diffs <- replicate(n = 10000, fake.ratings(sd = 2))
mean(abs(diffs) > 0.6)
```

`## [1] 0.3603`

There is no evidence that the wines would be rated differently by a larger group of raters, since we observe differences larger than 0.6 quite often.

What if the SD of ratings is 1? 0.5?

```
diffs <- replicate(n = 10000, fake.ratings(sd = 1))
mean(abs(diffs) > 0.6)
```

`## [1] 0.0668`

```
diffs <- replicate(n = 10000, fake.ratings(sd = 0.5))
mean(abs(diffs) > 0.6)
```

`## [1] 5e-04`

Note that we must have a null hypothesis *before* conducting the experiment. It is inappropriate to first look at the data and then test whether wine B is different from wine C. That is because, just due to random chance, *some* pair of wines will have different ratings if you are rating a lot of wines.

The appropriate null hypothesis before conducting any tests is that all wines are the same.

Use `rnorm`

in order to test this hypothesis. That is, find the probability that the range of the average ratings for the three wines is greater that or equal to \(|5.4-4.8| = 0.6\).

```
fake.ratings.range <- function(sd){
A.ratings <- rnorm(n = 19, mean = 5.0, sd = sd)
B.ratings <- rnorm(n = 19, mean = 5.0, sd = sd)
C.ratings <- rnorm(n = 19, mean = 5.0, sd = sd)
means <- c(mean(A.ratings), mean(B.ratings), mean(C.ratings))
max(means) - min(means)
}
diffs <- replicate(n = 10000, fake.ratings.range(sd = 2))
mean(abs(diffs) > 0.6)
```

`## [1] 0.6176`

Use `rnorm`

to determine the number of tasters needed in order to (usually) find a significant difference between two of the wines, if the true difference between the ratings is 0.6.

First, we figure out what a significant different would be for an sample size `n`

:

```
fake.ratings.range <- function(sd, n){
A.ratings <- rnorm(n = n, mean = 4.8, sd = sd)
B.ratings <- rnorm(n = n, mean = 4.8, sd = sd)
C.ratings <- rnorm(n = n, mean = 4.8, sd = sd)
means <- c(mean(A.ratings), mean(B.ratings), mean(C.ratings))
max(means) - min(means)
}
diffs <- replicate(n = 10000, fake.ratings.range(sd = 2, n = 50))
```

```
diffs <- replicate(n = 10000, fake.ratings.range(sd = 2, n = 135))
mean(diffs > 0.6)
```

`## [1] 0.0404`

So for \(n = 135\), a difference of 0.6 would be significant. How often would we get a difference that’s 0.6 or larger?

```
fake.ratings.range.actual.diff <- function(sd, n){
A.ratings <- rnorm(n = n, mean = 4.8, sd = sd)
B.ratings <- rnorm(n = n, mean = 4.8, sd = sd)
C.ratings <- rnorm(n = n, mean = 5.6, sd = sd)
means <- c(mean(A.ratings), mean(B.ratings), mean(C.ratings))
max(means) - min(means)
}
diffs <- replicate(n = 10000, fake.ratings.range.actual.diff(sd = 2, n = 135))
mean(diffs > 0.6)
```

`## [1] 0.9132`

About 91% of the time.