Lab3

# 1. 
time bzcat /course/data/nyctaxi/csv/fhvhv/fhvhv_tripdata_2023-01.csv.bz2 | wc -l

  18479032
  real  1m23.131s
  user  1m21.865s
  sys   0m11.371s

ls -l /course/data/nyctaxi/csv/fhvhv/fhvhv_tripdata_2020-*.csv.bz2 | wc -l

Estimated time for decompress all data for the year 2020: 12 * 1m23.131s = 16m37.572s

ls -l /course/data/nyctaxi/csv/fhvhv/*.csv.bz2 | wc -l

Estimated time for decompress all data for all the CSV: 48 * 1m23.131s = 66m29.64s

# 2.
time ls /course/data/nyctaxi/csv/fhvhv/fhvhv_tripdata_2020-*.csv.bz2 | parallel -j20 'bzcat {} | wc -l' > sizes-2020.txt

  real  1m40.582s
  user  11m28.248s
  sys   1m14.739s

The actual run time was 1m40.582s, while the estimated time was 16m37.572s, which represents an improvement of approximately 9.92 times. Using parallel significantly increased the speed.

# 3. 
sizes_2020 <- as.integer(readLines("sizes-2020.txt"))
# get info name and size
file_paths <- Sys.glob("/course/data/nyctaxi/csv/fhvhv/fhvhv_tripdata_2020-*.csv.bz2") 
file_sizes <- unname(sapply(file_paths, function(file) file.info(file)$size))
# add size info to data frame
sizes_df <- data.frame(path = file_paths, size = file_sizes, records = sizes_2020)
# build model
model <- lm(records ~ size, data = sizes_df)
# Predictions
predictions <- predict(model, newdata = sizes_df)
# Compute RMSE
rmse <- sqrt(mean((sizes_df$records - predictions)^2))
print(paste("RMSE:", rmse))

## [1] "RMSE: 4544565.71589259"

### The model is not good because the RMSE is very large.

# Plot
plot(sizes_df$size, sizes_df$records, 
     xlab = "File Size (bytes)", ylab = "Number of Records", 
     main = "File Size vs. Number of Records", pch = 20, col = "blue")
# add fitted line
abline(model, col = "red", lwd = 2)

### 2023-01 records 18479032 
size_202301 <- file.info("/course/data/nyctaxi/csv/fhvhv/fhvhv_tripdata_2020-01.csv.bz2")$size
# Predictions
pre_records_202301 <- predict(model, newdata = data.frame(size = c(size_202301)))
print(paste0("The estimated number of records for 2023-01 file is ",pre_records_202301))

## [1] "The estimated number of records for 2023-01 file is 8657357.91851824"

diff = 18479032-pre_records_202301
print(paste0(".\nThe auctual number of records is 18479032.\nThe difference is " ,diff))

## [1] ".\nThe auctual number of records is 18479032.\nThe difference is 9821674.08148176"

# 4. 
months <- seq(as.Date("2020-01-01"), by = "month", length.out = length(file_sizes))
df2020 <- data.frame(year = months, size = file_sizes)
pre_records_2020 <- predict(model, newdata = df2020)
df2020$predictions = pre_records_2020
### plot
plot(df2020$year, df2020$predictions, ty = "o",
     xlab = "Year", ylab = "The number of trips per month", 
     main = "Predicted trips per month in 2020", pch = 20, col = "blue")

### At the beginning of 2020, the number of trips was low and showed a downward trend. After March, it began to rise, reaching a peak in April, followed by a significant decline.

# 5.
for fn in `ls /course/data/nyctaxi/csv/fhvhv/fhvhv_tripdata_202[01]-03.csv.bz2`; do bzcat $fn | awk '{print $1}' | sort -S200m | uniq -c | grep HV; done

9836781 HV0003
 336606 HV0004
3219541 HV0005

real    1m4.360s
user    1m17.882s
sys 0m4.624s

10173376 HV0003
 107314 HV0004
3946703 HV0005

real    1m9.314s
user    1m22.525s
sys 0m4.995s

HV0003 increased by 336,595 in one year. HV0004 decreased by 229,292. HV0005 increased by 727,162.

# 6. Write a summary of your findings.
By using parallel processing, the time required to count the number of lines in compressed CSV files was significantly reduced, with a speed increase of approximately 9.92 times.
We built a linear model to predict the number of records based on file size. However, the model showed poor accuracy, with a large RMSE, indicating that file size alone cannot reliably predict the number of records. 
The number of trips in 2020 fluctuated significantly: starting low at the beginning of the year, then further decreasing, sharply rising in March, peaking in April, and followed by a noticeable decline. 
The analysis of vendor data across two years showed that HV0003 saw a slight increase in trips, HV0004 experienced a significant decrease, and HV0005 showed substantial growth.

Lab3

Yunshu Ran(yran945)

2024-08-11