Homework 2 Key

Guidelines: Homeworks should be clear and legible, with answers clearly indicated and work shown. Homeworks will be given a minus, check, or check plus owing to completion and correctness. You are welcome to work with others but please submit your own work. Your homework must be produced in an R Markdown (.rmd) file submitted via github. If you are having trouble accomplishing this, please refer to the guide. This homework adapts materials from the work of Michael Lynch (//spia.uga.edu/faculty_pages/mlynch/), Tyler McCormick (http://www.stat.washington.edu/~tylermc/), and Open Intro (https://www.openintro.org/stat/textbook.php)

Topics

Topics covered in this homework include:

  • Probability
  • Random Variables
  • Inference and hypotheses

Problems

  1. Being hit by a chair is a common occurrence in WWE professional wrestling. The number of people hit upside the head with chairs can vary from program to program. Suppose that the number of chair-bashings per program is Normally distributed, with mean thirty-five (35) bashings and standard deviation of eight (8) bashings. What is the probability that one WWE program will contain less than nineteen (19) bashings or more than forty (40) bashings?

This is fairly straightforward; we can use the pnorm command to compute the probability of there being less than 19 bashings, the probability of there being more than 40 bashings, and then add them together:

#this is the lower tail (less than 19)
lower.pval <- pnorm(q=19,mean=35,sd=8,lower.tail=TRUE);lower.pval
## [1] 0.02275013
upper.pval <- pnorm(q=40,mean=35,sd=8,lower.tail=FALSE);upper.pval
## [1] 0.2659855
total.pval <- lower.pval + upper.pval; total.pval
## [1] 0.2887357

The probability that a WWE program will have less than nineteen bashings OR more than forty bashings is 0.289.

  1. A corporation has just received new machinery that must be installed and checked before it becomes operational. The following table shows a manager’s probability assessment for the number of days required before the machinery becomes operational:
Days 5 6 7 8 9 10
Prob. 0.06 0.21 0.37 0.20 0.13 0.03

Let A be the event “It will be more than 6 days before the machinery becomes operational,” and let B be the event “It will be less than 8 days before the machinery becomes available.”

  1. Find the probability of event A.

We can calculate this by summing the probabilities for each day more than 6:

event.A <- 0.37 + 0.20 + 0.13 + 0.03; print(event.A)
## [1] 0.73
  1. Find the probability of event B.
event.B <- 0.06 + 0.21 + 0.37; print(event.B)
## [1] 0.64
  1. Find the probability of the complement of event A.
1 - event.A
## [1] 0.27

The complement of event A is the set of outcomes not in event A, so we find the probability using 1 - p(A).

  1. Find the probability of the intersection of events A and B.

The intersection of events A and B is the set of potential outcomes that satisfy the event A and event B (i.e., where both are true). Since event A is A = {7, 8, 9, 10}, and B is B = {5, 6, 7}, the intersection of event A and B is \(A \cap B = {7}\), so \(p(A \cap B) = 0.37\).

  1. Find the probability of the union of events A and B.

The union of events A and B is the set of potential outcomes that satisfy either event A or event B (i.e., where either is true). Since event A is A = {7, 8, 9, 10}, and B is B = {5, 6, 7}, the union of event A and B is \(A \cup B = {5, 6, 7, 8, 9, 10}\), so \(p(A \cup B) = 1\) since this includes all possible events.

  1. An ultramarathon race official has found that 68% of first time entrants and 87% of repeat racers finish the race. Of all entrants, 64% are repeat racers and the remainders are first time entrants.
  1. What is the probability that a randomly chosen entrant is a repeat racer who will eventually finish the race?

The probability of a first time entrant finishing the race is p(0.68) and a repeat racer finishing the race is p(0.87). The probability of being a repeat racer is p(0.64). Thus, the probability of being both a repeat racer AND a finisher is 0.64 * 0.87 = 0.56.

  1. Find the probability that a randomly chosen entrant will eventually finish the race.

The probability of a randomly chose entrant finishing (regardless of whether it is their first time or not) is the calculated by finding the proportion of first-timers who finish and the proportion of repeat-runners who finish and adding these two together: 0.64 * 0.87 + 0.36 * 0.68 = 0.8.

  1. What is the probability that a randomly chosen entrant either is a repeat racer or will eventually finish (or both)?

To calculate this, we need to find the union of being a repeat racer and being a finisher (\(p(Repeat \cup Finisher)\)). Remember that we can’t double-count repeat racers who finish, so we need to add the probability of being a repeater to the probability of being a finisher and then subtract the probability of being a repeat finisher: p(Repeat) + p(Finish) - p(Finish + Repeat) = 0.64 + 0.80 - 0.56 = 0.88.

  1. A school received a grant to offer two extracurricular clubs, art and chess. Of the students 35% signed up for art club and 40% for chess club. Of those signing up for art club 20% signed up for chess club.
  1. What is the probability that a randomly selected student signed up for both clubs?

We know that 35% of students signed up for art, and 20% of those students signed up for chess club. So, we know that 0.35 * .20 = 0.07 students signed up for both.

  1. What is the probability that a randomly selected student who signed up for chess club also signed up for art club?

We know that there are 0.07 art club students who also joined the chess team. Thus, if 40% of all students are in the chess club and we know that 0.07 are in both this means that 0.33 are ONLY in the chess club. Essentially, this means that 7 out of every 40 chess club memers are also in the art club. We can divide 7/40 = 0.175 to get the probability of a chess club member being in art as well.

  1. What is the probability that a randomly chosen student signed up for at least one of these two clubs?

We know that 35% of students signed up for art, and 40% for chess, but we don’t want to double count the 7% of students who signed up for both. Thus, we add 0.35 + 0.40 - 0.07 = 0.68 to get the probability that a randomly chosen student signed up for at least one club.

  1. Records show that 442 students recently entered a Florida public school district. Of those 442, 84 have not received their vaccinations. The district’s physician is scheduled to go from school to school next Tuesday to give shots to those who need them. If we know that about 8% of students are absent on any given day, how many unvaccinated students are likely to miss the doctor visit?

The trick here is that the probability of being absent is unconditional on the probability of being unvaccinated (at least in this problem). That is, p(absent) = p(absent|unvaccinated). Thus, we can find the number of students likely to miss the doctor visit my multiplying the number of unvaccinated students by the probability of being absent: 0.08 * 84 = 6.72

  1. For questions 1a - 1c, assume a variable is normally distributed with a mean of 180 and a standard deviation of 25. Please show all work (i.e., relevant code - hint: the pnorm/qnorm/dnorm family will be your friend here)
  1. What is the probability that a single draw from this distribution, labeled X, will be greater than 210?
pnorm(210,180,25,lower.tail=FALSE)
## [1] 0.1150697

The probabilty that a random draw from N(180,25^2) will be greater than 210 is p(X > 210) = 0.1150697.

  1. What is the probability that a single draw from this distribution, labeled Y, will be less than 182?
pnorm(182,180,25,lower.tail=TRUE)
## [1] 0.5318814

The probabilty that a random draw from N(180,25^2) will be less than 182 is p(Y < 182) = 0.5318814.

  1. What is the probability that a single draw from this distribution, labeled Z, will be between 160 and 192?
1- pnorm(160,180,25,lower.tail=TRUE) - pnorm(192,180,25,lower.tail=FALSE)
## [1] 0.4725309

The probabilty that a random draw from N(180,25^2) will be less than 192 and greater than 160 is p(160 < Z < 192) = 0.4725309 (Remember to set your tails correctly so you find 1 - area_left_of_160 - area_right_of_192)

  1. The random variable x has the discrete probability distribution shown here:
x -2 -1 0 1 2
p(x) 0.15 0.15 0.35 0.25 0.10
  1. Find \(P(x<=0)\)

The probability that X is less than or equal to 0 is \(P(x<=0)\) = 0.15+0.15+0.35 = 0.65.

  1. Find \(P(x>-1)\)

The probabilty that X is greater than or equal to -1 is \(P(x>=-1)\) = 0.15+0.35+0.25+.10 = 0.85.

  1. Find \(P(-1<=x<=1)\)

The probability that X is greater than or equal to -1 and less than or equal to 1 is: \(P(-1<=x<=1)\) = 0.15 + 0.35 + 0.25 = 0.75.

  1. Find \(P(x<2)\)

The probability that X is less than 2 is \(P(x<2)\) = 1 - 0.10 = 0.9.

  1. Find \(P(-1<x<2)\)

The probability that X is greater than -1 and less than 2 is: \(P(-1<x<2)\) = 0.35 + 0.25 = 0.6.

  1. Consider a sequence of independent coin flips, each of which has probability \(p\) of being heads. Define a random variable \(X\) as the length of the run (either heads or tails) started by the first trial. (For example, \(X=3\) if either TTTH or HHHT is observed.) Write the distribution of X and compute \(E(X)\).

SKIP THIS PROBLEM

  1. Consider flipping an unequally weighted coin with probability of success \(.55\). Say that \(X\) is the random variable associated with this coin and that \(X=1\) if heads (success), 0 otherwise. Simulate 20 coin flips and store them as a variable called “cflip.” In R, do this with cflip=rbinom(20,1,.55). Use ?rbinom to make sure you understand what each part of the command stands for.
cflip=rbinom(20,1,.55)
  1. Take 20 samples with replacement from cflip and compute the mean of the sample. In R: cflipsamp=sample(cflip,20,replace=T)
cflipsamp=sample(cflip,20,replace=T)
  1. Repeat step B (but not step A) 500 times, storing the mean of the sample each time. You could do this using a “for” loop to repeat the same step many times. An example of code for a loop is below. However, try to find a non-loop way of accomplishing the same thing (e.g., the replicate function). resamp.mean=rep(NA,500) for(i in 1:500){ resamp.mean[i]<-mean(sample(cflip,20,replace=T))}.
cflip.samp.500 <- replicate(500,mean(sample(cflip,20,replace=T)))
  1. Plot a histogram of the re-sampled means. Does it look like what we know from normal theory that the sampling distribution of the mean should look like in this case?
hist(cflip.samp.500)

Yes, this is about what we would expect.The true probability is 0.55, and it looks like the distribution is approximately normally distributed around this value.

  1. Experiment with increasing/decreasing the number of original samples (draws within one “sample”) and the number of re-samples (number of times you repeat the sampling process). Pay attention to the impact changing either has on the resulting distribution of sample means, and provide some form of visual demonstration (table and/or figure) to show how sample size changes your results.

I am going to run four different experiments: (1) 20 draws, repeated 20 times; (2) 20 draws, repeated 500 times; (3) 500 draws, repeated 20 times; and (4) 500 draws, repeated 500 times:

par(mfrow=c(2,2))
hist(replicate(20,mean(sample(cflip,20,replace=T))),main='20d, 20s',xlab='')
hist(replicate(500,mean(sample(cflip,20,replace=T))),main='20d, 500s',xlab='')
hist(replicate(20,mean(sample(cflip,500,replace=T))),main='500d, 20s',xlab='')
hist(replicate(500,mean(sample(cflip,500,replace=T))),main='500d, 500s',xlab='')

As shown, what makes the largest difference is the number of samples taken. Even when we take 500 draws, but only 20 samples, the sample distribution is not very normally distributed. However, the plots comparing 500 samples of 20 draws vs. 500 samples of 500 draws are actually fairly similar to one another in shape; nonetheless, there is clearly a significant distinction in terms of spread. The 500 draw, 500 sample histogram is not only approximately normal but is closely centered on the true mean, whereas the histogram for the 20 draw, 500 sample experiment has a much wider spread.

  1. Consider a population of donuts with with an average survival time until becoming stale of \(\mu = 3\) days and a standard devation of \(\sigma = 0.25\) If you take a sample of 24 donuts, what are the chances the mean failure time of your sample will be (use a t-distribution):

SKIP THIS PROBLEM

  1. Greater than 3 days?

  2. Less than 1.5 days?

  3. Between 2.25 and 3.75 days?

  1. (also problem 4.17/4.18 in Open Intro 3) Identify hypotheses, for each item, write the null and alternative hypotheses in words and then symbols for each of the following situations.
  1. New York is known as city that never sleeps“. A random sample of 25 New Yorkers were asked how much sleep they get per night. Do these data provide convincing evidence that New Yorkers on average sleep less than 8 hours a night?

Null: New Yorkers sleep 8 hours a night on average Alternative: New Yorkers sleep less than 8 years a night on average.

\[H_0: \mu = 8\] \[H_A: \mu < 8\]

  1. Employers at a firm are worried about the effect of March Madness, a basketball championship held each spring in the US, on employee productivity. They estimate that on a regular business day employees spend on average 15 minutes of company time checking personal email, making personal phone calls, etc. They also collect data on how much company time employees spend on such non-business activities during March Madness. They want to determine if these data provide convincing evidence that employee productivity decreases during March Madness.

Null: Employees waste 15 minutes of time per day in March Alternative: Employees waste more than 15 minutes of time per day in March

\[H_0: \mu = 15\] \[H_A: \mu > 15\]

  1. Since 2008, chain restaurants in California have been required to display calorie counts ofeach menu item. Prior to menus displaying calorie counts, the average calorie intake of dinersat a restaurant was 1100 calories. After calorie counts started to be displayed on menus, a nutritionist collected data on the number of calories consumed at this restaurant from a random sample of diners. Do these data provide convincing evidence of a difference in the average calorie intake of a diners at this restaurant?

Null: Calorie consumption after the new requirement is the same as before Alternative: Calorie consumption after the new requirement is different than before.

\[H_0: \mu = 1100\] \[H_A: \mu != 1100\]

  1. Based on the performance of those who took the GRE exam between July 1, 2004 and June 30, 2007, the average Verbal Reasoning score was calculated to be 462. In 2011 the average verbal score was slightly higher. Do these data provide convincing evidence that the average GRE Verbal Reasoning score has changed since 2004?

Null: The average GRE verbal reasoning score has not changed since 2004. Alternative: The average GRE verbal reasoning score has changed since 2004.

\[H_0: \mu = 462\] \[H_A: \mu != 462\]

  1. (also 4.30 in Open Intro 3) A food safety inspector is called upon to investigate a restaurant with a few customer reports of poor sanitation practices. The food safety inspector uses a hypothesis testing framework to evaluate whether regulations are not being met. If he decides the restaurant is in gross violation, its license to serve food will be revoked.
  1. Write the hypotheses in words.

Null: The restuarant is not in gross violation Alternative: The restaurant is in gross violation

  1. What is a Type 1 Error in this context?

Concluding that the restaurant is in gross violation when it is not.

  1. What is a Type 2 Error in this context?

Failing to shut down the restaurant when it is in fact in gross violation.

  1. Which error is more problematic for the restaurant owner? Why?

For the restaurant owner, Type I error is most problematic, since it means that he will be shut down when he in fact should not be.

  1. Which error is more problematic for the diners? Why?’

For diners, Type II error is more problematic, since it means they could be exposed to an unsafe restaurant that should be shut down.

Report your process

You’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc. Give credit to your sources, whether it’s a blog post, a fellow student, an online tutorial, etc.

Rubric

Minus: Didn’t tackle at least 3 tasks. Didn’t interpret anything but left it all to the “reader”. Or more than one technical problem that is relatively easy to fix. It’s hard to find the report in our repo.

Check: Completed, but not fully accurate and/or readable. Requires a bit of detective work on my part to see what you did

Check plus: Hits all the elements. No obvious mistakes. Pleasant to read. No heroic detective work required. Solid.

The command below is helpful for debugging, please don’t change it

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    formatR_1.2     tools_3.2.2     htmltools_0.2.6
##  [5] yaml_2.1.13     stringi_0.5-5   rmarkdown_0.7   knitr_1.10.5   
##  [9] stringr_1.0.0   digest_0.6.8    evaluate_0.7