Lesson 11 December 2020

Practice with regression and statistical hypothesis II

Exercise 1

Read the dataset contained in the file rats.txt that describes the survival time of rats which had been given one poison out of 4 different types, and an antidote of 3 different types.

  1. Create a variable mortality, equal to the inverse of the survival time.
  2. Fit a normal distribution to the mortality data and plot the resulting distribution superimposed to a histogram of the data. Comment the result.
  3. Compute mean and standard deviation of the mortality for each combination of poison and antidote.
  4. Study how mean mortality depends on poison and antidote, using an appropriate linear model. Comment the results.

Exercise 2

Import the dataset containing information on the finishers of the Napa Valley Marathon in 2015. The dataset includes Age, Gender, hours to complete the marathon, mean speed in miles per hour and overall ranking.

  1. Compute the summary values of all variables, separately for Females and Males.
  2. Define a new variable kmph, where you transform the speed from miles to km per hour. Hint: one mile is 1.60934 km
  3. Visualize the distribution of speed in kmph for the gender using boxplots.
  4. Using the t-test, verify if the mean speed for the gender are statistically different. Describe all the lines of the output; discuss whether the test is appropriate and comment on the result.
  5. We suspect that the distribution of speed in kmph follows a shifted gamma distribution. Write the log-likelihood for the shifted gamma distribution and compute the MLE estimators (shift, shape and scale).
  6. Find the 95% confidence intervals for the three parameters (shift, shape and scale).
  7. Using an adequate plot, visually compare the estimated distribution with the data.

Exercise 3

Load the library MASS and consider the dataset “Pima.tr”, which contains the data about the population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona. The subjects were tested for diabetes according to World Health Organization criteria.

  1. Check if the data are read correctly. Compute the mean, quartiles and median of the numerical variables.

  2. According to the WHO, a BMI of less than 18.5 as ‘underweight’ and may indicate malnutrition, an eating disorder, or other health problems, while a BMI equal to or greater than 25 is considered ‘overweight’ and above 30 is considered ‘obese’. Define a new categorical variable ‘Obese’, which divides the patients according to the WHO definition of obesity based on the BMI. The following analysis are intended on this dataset.

  3. Using box plots, visualize the variable ‘glu’ and ‘bp’ at the varying of ‘Obese’. Comment on the result.

  4. Using a t.test, verify if the mean ‘bp’ of the group ‘Obese’ is statistically different from the other. Comment on the result.

Exercise 4

Consider the dataset defined and modified in the previous exercise.

  1. Define a linear model for the variable ‘glu’ at the varying of ‘bp’ (predictor). Comment on the results and explain the meaning of each row of the summary. Besides, visualize the regression line against the experimental data, analyze the residuals and compute the confidence intervals for the regression coefficients.

  2. Define a linear model for the variable ‘glu’ at the varying of ‘bp’ and ‘Obese’ (predictors). Compare the results with the previous regression model and comment on the results. Compare the AIC for both models and explain if it is wort to consider the latest.

  3. Perform and comment the ANOVA test on the latest regression model.

  4. Define a linear model for the variable ‘glu’ at the varying of all the remaining variables (predictors). Then, use the step function to determine the a formula. Comment on the variable selected by this method and interpret the results.

More exercises

Exercise 5

Consider the dataset containing economical and demographic data about 154 nations, such as literacy rate (%), gross domestic product (mld $) and military expenses ( $). Import also the dataset containing the list of 242 countries and their continent.

  1. Merge the two datasets removing the countries missing information on the continent.
  2. Substitute the NAs of ‘gdp’ with the median computed on the other countries of the same continent.
  3. Using box plots, visualize the variable ‘gdp’ at the varying of continent. Comment on the result.
  4. Can we consider a transformation of the ‘gdp’ variable to better compare the data? Which one? Can we consider the logarithm?
  5. Using the likelihood ratio, discuss if the mean of the logarithm(gpd) is statistically different between Europe and Oceania . Compare the results with the t.test and comment on.

Exercise 6

Consider the dataset, which includes brain and body weight, life span, gestation time, time sleeping, and predation and danger indices for 62 mammals (description) .

  1. Import the dataset and analyze the variables. After reading the description and considering only the variables ‘BodyWt’, ‘BrainWt’, ‘TotSleep’, ‘GestTime’ and ‘SpleepExp’, substitute eventual missing data with the median of the same variables.

  2. Using histograms, visualize the distribution of the variables ‘BodyWt’, ‘BrainWt’, ‘TotSleep’, ‘GestTime’.

  3. Can we use the ANOVA test to see if the mean of the variable ‘GestTime’ in the groups defined by ‘SpleepExp’ are statistically different? Is the homoscedastic hypothesis satisfied? Compute the test.

  4. Apply a logarithmic transformation to variable ‘BodyWt’. Define a linear model for the variable ‘GestTime’ at the varying of ‘logBodyWt’ (predictor). Comment on the results and explain the meaning of each row of the summary. Besides, visualize the regression line against the experimental data and analyze the residuals.

© 2017-2020 Federico Reali