General comments:
.Rmd
, and the generated output file .html
or .pdf
on Canvas before the due date/time..Rmd
file compiles without any errors. The marker will not spend time fixing the bugs in your code.tidyverse
packages to answer the questions in this assignment. Please use dplyr
for data wrangling/manipulation, ggplot2
for data visualisation, and lubridate
for dates/times. Some parts of Problem 2 will use plots from the fpp3
packages.Due: Friday 15 March 2023 at 16:00 PM (NZ time)
The lakers
data set (in the lubridate
package) contains play-by-play statistics of each Los Angeles Lakers basketball game in the 2008-2009 regular season. It contains the following variables:
Variable | Description |
---|---|
date |
Date of the game |
opponent |
Name of the opposition team |
game_type |
Home or away game |
time |
Time remaining on the game clock in a given period (counting down from 12 minutes) |
period |
The period of play (most games have four quarters, each 12 minutes in duration, noting that some games go into a 5-minute duration overtime if tied at the end of regular play) |
etype |
The type of play made (e.g., shot, turnover, rebound) |
team |
Name of the NBA team the player who made the play belongs to |
player |
Name of the player that the play was made by |
result |
Whether they won or lost the game |
type |
A more detailed description of the type of play made |
x |
The \(x\)-coordinate on the field of play (in ft) |
y |
The \(y\)-coordinate on the field of play (in ft) |
- 6 Marks
- Read in the
lakers
data set and convert this into atibble
object.- Keep only the rows relating to Kobe Bryant. Name this object
kobe
.- Transform the
date
variable into alubridate
date format (noting that it is currently in integer format).- Shot location is given by
x
andy
. The center of the hoop is located at the coordinates \((25, 5.25)\). Center thex
andy
variables to \((0, 0)\); you will want to overwritex
andy
in yourkobe
data set.
lakers <- as_tibble(lakers)
kobe <- lakers %>%
filter(player == "Kobe Bryant")
kobe$date <- ymd(kobe$date)
kobe$x <- kobe$x - 25
kobe$y <- kobe$y - 5.25
- 6 Marks
- Subset the
kobe
data set by only considering plays that are shot attempts (i.e., whereetype
is equal toshot
). Name this new data setkobe.shot
.- Make a scatter plot of the centered shot location, colouring the points by
result
. You should use thegeom_point
layer.- Set the transparency of the points to
alpha = 0.5
.- Use the default colour scheme, but reverse the colour order so that shots made is green(ish) and shots missed is red(ish). Hint: You can use
scale_colour_discrete
with an additional argument.
kobe.shot <- kobe %>%
filter(etype == "shot")
kobe.shot %>%
ggplot(mapping = aes(x = x, y = y, colour = result)) +
geom_point(alpha = 0.5) +
scale_colour_discrete(direction = -1) +
theme_minimal() +
labs(x = "loc_x(ft)",
y = "loc_y(ft)",
colour = "result",
title = "Scatter Plot of the centered shot location")
- 6 Marks
- Using the
kobe.shot
data set, produce a 2-dimensional density plot (with contours) of Kobe Bryant’s shot locations. You will want to use bothgeom_density_2d_filled
andgeom_density_2d
. Do not colour byresult
.- Remove the legend using
legend.position
argument in thetheme
layer.
kobe.shot %>%
ggplot(mapping = aes(x = x, y = y)) +
geom_density_2d_filled() +
geom_density_2d() +
theme(legend.position = "none") +
labs(x = "loc_x(ft)",
y = "loc_y(ft)",
title = "2-dimensional density plot of the centered shot location")
- 9 Marks
- Within the
kobe.shot
data set, create a variable calleddistance
that calculates the distance a shot was taken from hoop. You will need to use Pythagoras’ theorem, i.e., \(\text{distance} = \sqrt{x^2 + y^2}\).- Then create another variable within your
kobe.shot
data set calledindicator
that concatenates the values ofresult
withgame_type
. Hint: You can use thepaste
function. You should end up with a variable in your data set that takes on the four values: “made home”, “made away”, “missed home”, “missed away”.- Plot histograms showing the distribution of distance using
geom_histogram
. Usefacet_wrap
to create seperate panels for all values ofindicator
. (You should end up with four panels on the same figure).- Fill the histograms by
indicator
such that the interior of the bars are different colours for the four different groups.
- Remove the legend.
kobe.shot %>%
mutate(distance = sqrt(x^2 + y^2),
indicator = paste(result, game_type)) %>%
ggplot(mapping = aes(x = distance, fill = indicator)) +
geom_histogram(bins = 30) +
theme(legend.position = "none") +
facet_wrap(~indicator) +
labs(x = "Distance(ft)",
y = "Frequency",
fill = "indicator",
title = "Histograms of distance")
- 11 Marks
- Subset the original
kobe
data set (not thekobe.shots
data set) by considering plays that are only free throws (i.e., whereetype
is equal tofree throw
). Call this new data setkobe.free
.- Within the
kobe.free
data set, calculate the total number of points from free throws per game as well as the free throw percentage per game. You will want to use thegroup_by
,summarise
,sum
, andn
functions.- Plot Kobe Bryant’s free throw percentage per game using
geom_segment
to create vertical line segments from 0 to the free throw percentage. Your \(x\)-axis should bedate
and your \(y\)-axis should be free throw percentage.- Add transparency proportional to the total number of points per game. (i.e., a larger number of points should have darker line segments).
kobe.free <- kobe %>%
filter(etype == "free throw")
kobe.free.points <- kobe.free %>%
group_by(date) %>%
summarise(total_points = sum(points),
percentage = total_points / n())
kobe.free.points %>%
ggplot(mapping = aes(x = date, y = percentage)) +
geom_segment(aes(xend = date,y = 0, yend = percentage, alpha = total_points)) +
theme_minimal() +
labs(title = "Histograms of Kobe Bryant's free throw percentage per game")
- 7 Marks
- Using the
kobe
data set, find the unique dates that Kobe Bryant played in the 2008-2009 regular season. Hint: You will want to use thegroup_by
,summarise
, andn_distinct
functions. You should end up with a data set with 78 rows. Name this data setkobe.week
.- Then create a variable that tells you the day of the week the game was played. You will need an appropriate
lubridate
function.- Plot a bar chart that shows the frequency of games played on each of the seven days of the week.
- Comment on the most common and least common game days.
The most common day is Tuesday, and the least common day is Monday and Saturday.
kobe.week <- kobe %>%
group_by(date) %>%
summarise(n_distinct(date))
kobe.week %>%
mutate(week = wday(date, label = TRUE)) %>%
ggplot(mapping = aes(x = week)) +
geom_bar() +
labs(x = "day",
y = "frequency",
title = "barplot")
Total possible marks for Problem 1: 45 Marks
The data set auckland_temps.csv
contains the monthly average temperatures in Auckland from July 1994 until January 2024. [Data source: https://cliflo.niwa.co.nz]
- 5 Marks
- Read in the data using
read_csv
(don’t useread.csv
).
- Convert the
Month
variable into the correct date format. Hint: You will need to use a function from thetsibble
package.- Coerce your
tibble
to atsibble
object withMonth
as the index.
data_csv <- read_csv("auckland_temps.csv")
## Rows: 355 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (1): Temperature
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_tsibble <- data_csv %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month)
- 7 Marks
- Create a time plot, seasonal plot, and subseries plot of the data.
- Comment on the seasonality in the plots. Which month has the highest average temperatures, and which month has the lowest?
- Comment on whether there is a trend in the data and if so, in what direction.
Febrary has the highest average temperatures, July has the lowest. Yes. The temperature undergoes seasonal changes, being low in winter, gradually rising in spring, reaching its peak in summer, and then declining in autumn.
data_tsibble %>%
autoplot(Temperature) +
labs(x = "time",
y = "temperature",
title = "time plot")
data_tsibble %>%
gg_season(Temperature, labels = "none") +
labs(y = "temperature",
title = "seasonal plot")
data_tsibble %>%
gg_subseries(Temperature) +
labs(x = "time",
y = "temperature",
title = "subseries plot")
- 8 Marks
- Create a lag plot for the first 12 lags. Use the
point
geometry and setalpha = 0.5
.- Write a sentence or two explaining what autocorrelation is.
- Comment on the patterns you observe in the lag plot, explaining why we see these specific autocorrelation patterns for lags 1, 6, and 12.
Autocorrelation assesses how well the values of a variable correlate with their own previous values
The temperatures in adjacent months are similar, showing a significant positive correlation. In Figure 6, with a time difference of six months, the seasons are opposite; winter and summer temperatures are opposites, situated on the outermost sides, while autumn and spring temperatures are similar and closer. In Figure 12, with a time difference of 12 months, representing a complete cycle of the four seasons, the seasons are the same, and the temperatures are almost identical, displaying a strong positive correlation.
data_tsibble %>%
gg_lag(Temperature, lags = 1:12, geom = "point", alpha = 0.5) +
labs(title = "Lag plot")
- STATS 786 only 10 Marks
- In this question, you will recreate the lag plot from (3).
- Instead of using
gg_lag
, you will use functions from thedplyr
andggplot2
packages to create your own lag plot. You may also need functions fromlubridate
,forcats
, andtidyr
.- Try to get your plot as close to the
gg_lag
plot you produced in (3). You will get full marks if your plot is exactly the same as what you get withgg_lag
. Marks will be deducted for inconsistencies.
- Note: There are many ways to solve this problem, but here are some things you may want to consider: how to create lagged variables in your data set, how to name them, how to convert your data set into a long format, and how to facet your plot.
data_tsibble_lag <- data_tsibble %>%
mutate(month_name = month(Month,label = TRUE)) %>%
mutate(Value_lag1 = lag(Temperature, 1)) %>%
mutate(Value_lag2 = lag(Temperature, 2)) %>%
mutate(Value_lag3 = lag(Temperature, 3)) %>%
mutate(Value_lag4 = lag(Temperature, 4)) %>%
mutate(Value_lag5 = lag(Temperature, 5)) %>%
mutate(Value_lag6 = lag(Temperature, 6)) %>%
mutate(Value_lag7 = lag(Temperature, 7)) %>%
mutate(Value_lag8 = lag(Temperature, 8)) %>%
mutate(Value_lag9 = lag(Temperature, 9)) %>%
mutate(Value_lag10 = lag(Temperature, 10)) %>%
mutate(Value_lag11 = lag(Temperature, 11)) %>%
mutate(Value_lag12 = lag(Temperature, 12))
data_tsibble_lag <- pivot_longer(data_tsibble_lag,
cols = Value_lag1 :Value_lag12,
names_to = "Lag",
values_to = "Lag_value",
values_drop_na = TRUE)
data_tsibble_lag$Lag <- factor(data_tsibble_lag$Lag,
levels = c("Value_lag1","Value_lag2","Value_lag3", "Value_lag4","Value_lag5","Value_lag6","Value_lag7","Value_lag8","Value_lag9","Value_lag10","Value_lag11","Value_lag12"),
labels = c("lag 1","lag 2","lag 3", "lag 4","lag 5","lag 6","lag 7","lag 8","lag 9","lag 10","lag 11","lag 12"))
data_tsibble_lag %>%
ggplot(aes(x = Lag_value, y = Temperature, color = month_name)) +
geom_point(alpha = 0.5) +
facet_wrap(~Lag) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey") +
labs(title = "Lag plot",
x = "lag(Temperature, n)",
y = "Temperature",
color = "season") +
theme(aspect.ratio = 1)
Total possible marks for Problem 2: 20 Marks for 326 30 Marks for 786
Total possible marks for Assignment 1: 65 Marks for 326 75 Marks for 786