October exercises

Data manipulation exercises

Originally posted here.

Exercise 1

Consider the dataset (description) that contains the information about the hourly traffic volume in the interstate I-94.

  1. Import the dataset into R and verify the correctness of the content. Replace any errors with the median of the same variable.
  2. Define a R function that transforms the temperature from Kelvin to Celsius degrees. Using the function defined before, create a new variable TempC that contains the temperatures in Celsius.
  3. Create a new variable rain_YN that is 0 if rain_1h is 0, and 1 otherwise.
  4. At the varying of the variable rain_YN, compute the minimum, maximum, average and median for the temperature (in Celsius) and the traffic volume. Using box plots, visualize the distribution of the traffic volume at the varying of rain_YN.
  5. Using the previous results, discuss the common belief that there is more traffic on rainy days.

Exercise 2

Import the dataset that contains the data related to the Florida Gators football team representing the University of Florida.

  1. Import the file into R. Suggestion: convert the file to .csv.
  2. From the original dataset select the columns: Year, YearGame, Date, Opponent, OppWins_1A and GatorOutcm. All the following analyses are intended on this reduced dataset.
  3. Analyze the data structure and extract the average, standard deviation, quartiles, minimum and maximum (for all variables).
  4. Determine the team against which the Florida Gators competed most often. How many times have they won, lost and tied with that team?
  5. Regarding the games with this team, calculate the mean and standard deviation for the variable OppWins_1A.
  6. Visualize, using the box plots, the trend of the variable OppWins_1A concerning the result of the game.
  7. From the data, with which team between “Alabama”, “Kentucky” and “Tennessee” is there the highest chance of winning?

Exercise 3

  1. Import the dataset about the roman emperor and check if the variables are read correctly.
  2. Convert the columns birth, death, reign_start and reign_end to the appropriate type.
  3. Define a new column reign_length that contains the length of the reign of the emperor. Note that some dates are B.C.
  4. Define a new column that converts reign_length in years and round the result. Which is the emperor with the longest reign? And the shorter?
  5. Visualize and compare the distribution of the variable reign_length in years using a plot, a histogram and a box plot. Which one is the most informative? Which is the least?
  6. For each dynasty, compute the mean reign length and the number of emperors of the dynasty.
  7. For each dynasty, compute the percentage of emperor killed by assassination.
  8. Suppose that the probability of death by assassination in the Gordian dynasty can be modeled as a binomial distribution of parameter p. Compute and plot the likelihood at the varying of the parameter p. Estimate the p that maximizes the likelihood.

© 2017-2019 Federico Reali