Practice with regression and statistical hypothesis II

Exercise 1

Load the library MASS and consider the dataset “Pima.tr”, which contains the data about the population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona. The subjects were tested for diabetes according to World Health Organization criteria.

Check if the data are read correctly. Compute the mean, quartiles and median of the numerical variables.
According to the WHO, a BMI of less than 18.5 as ‘underweight’ and may indicate malnutrition, an eating disorder, or other health problems, while a BMI equal to or greater than 25 is considered ‘overweight’ and above 30 is considered ‘obese’. Define a new categorical variable ‘Obese’, which divides the patients according to the WHO definition of obesity based on the BMI. The following analysis are intended on this dataset.
Using box plots, visualize the variable ‘glu’ and ‘bp’ at the varying of ‘Obese’. Comment on the result.
Using a t.test, verify if the mean ‘bp’ of the group ‘Obese’ is statistically different from the other. Comment on the result.

Exercise 2

Consider the dataset defined and modified in the previous exercise.

Define a linear model for the variable ‘glu’ at the varying of ‘bp’ (predictor). Comment on the results and explain the meaning of each row of the summary. Besides, visualize the regression line against the experimental data, analyze the residuals and compute the confidence intervals for the regression coefficients.
Define a linear model for the variable ‘glu’ at the varying of ‘bp’ and ‘Obese’ (predictors). Compare the results with the previous regression model and comment on the results. Compare the AIC for both models and explain if it is wort to consider the latest.
Perform and comment the ANOVA test on the latest regression model.
Define a linear model for the variable ‘glu’ at the varying of all the remaining variables (predictors). Then, use the step function to determine the a formula. Comment on the variable selected by this method and interpret the results.

Exercise 3

Consider the dataset containing economical and demographic data about 154 nations, such as literacy rate (%), gross domestic product (mld $) and military expenses ( $). Import also the dataset containing the list of 242 countries and their continent.

Merge the two datasets removing the countries missing information on the continent.
Substitute the NAs of ‘gdp’ with the median computed on the other countries of the same continent.
Using box plots, visualize the variable ‘gdp’ at the varying of continent. Comment on the result.
Can we consider a transformation of the ‘gdp’ variable to better compare the data? Which one? Can we consider the logarithm?
Using the likelihood ratio, discuss if the mean of the logarithm(gpd) is statistically different between Europe and Oceania . Compare the results with the t.test and comment on.

Exercise 4

Consider the dataset, which includes brain and body weight, life span, gestation time, time sleeping, and predation and danger indices for 62 mammals (description) .

Import the dataset and analyze the variables. After reading the description and considering only the variables ‘BodyWt’, ‘BrainWt’, ‘TotSleep’, ‘GestTime’ and ‘SpleepExp’, substitute eventual missing data with the median of the same variables.
Using histograms, visualize the distribution of the variables ‘BodyWt’, ‘BrainWt’, ‘TotSleep’, ‘GestTime’.
Can we use the ANOVA test to see if the mean of the variable ‘GestTime’ in the groups defined by ‘SpleepExp’ are statistically different? Is the homoscedastic hypothesis satisfied? Compute the test.
Apply a logarithmic transformation to variable ‘BodyWt’. Define a linear model for the variable ‘GestTime’ at the varying of ‘logBodyWt’ (predictor). Comment on the results and explain the meaning of each row of the summary. Besides, visualize the regression line against the experimental data and analyze the residuals.