The raw day-by-day data files can be found online at Prof. Nicholas Horton’s website, in the form of .csv.gz files (i.e. compressed .csv format).
At the time of writing, there are 500 daily data files in total, covering all active ValleyBike days from 28 June 2018 to 5 October 2020 for which there is available, non-corrupted data.
The data for 6 days has been corrupted at the source, and the process of fixing them is under way. Until then, note that the data for the following days is not available on the website:
To import a day’s worth of data from the raw .csv.gz daily files on the website, the import_day
utility function is provided. It takes 3 parameters:
day
: the day for which the data is desired, as a string of the format "YYYY-MM-DD"
, e.g. "2019-05-22"
. It can be any day between 28 June 2018 to 5 October 2020, although ValleyBike only operates from April to November. Supplying an invalid date, a date for which no data has been recorded, or a date that corresponds to a corrupted file will all yield an empty tibble.return
: the type of data to be returned (one of: "clean"
, "anomalous"
, "all"
). Defaults to "clean"
.future_cutoff
: the next-day cutoff (in hours) past which observations are categorized as “anomalous”. This cutoff is necessary since some rides may last past midnight, which is not an anomaly unless it extends too far into the future. As such, this parameter defaults to 24.0 hours (i.e. only observations up to 24 hours after the given day are considered non-anomalous).EXAMPLES:
import_day("2018-07-25", return = "anomalous") %>%
head()
# A tibble: 6 x 6
route_id user_id bike time longitude latitude
<chr> <chr> <chr> <dttm> <dbl> <dbl>
1 route_07_2018@2b42… 97c1f2d5-942… 1131 NA -72.6 42.3
2 route_07_2018@8302… d4ff327d-749… 1322 NA -72.7 42.3
3 route_07_2018@8f2d… f02baa80-e88… 1112 NA -72.6 42.3
4 route_07_2018@eaa6… 20785e9f-15a… 1064 NA -72.6 42.2
5 route_07_2018@eaa6… 20785e9f-15a… 1064 2018-07-28 20:19:49 -72.7 42.3
6 route_07_2018@eaa6… 20785e9f-15a… 1064 2018-07-28 20:19:54 -72.7 42.3
import_day("2020-10-05", return = "clean") %>%
head()
# A tibble: 6 x 6
# Rowwise:
route_id user_id bike time longitude latitude
<chr> <chr> <chr> <dttm> <dbl> <dbl>
1 route_10_2020@5e46… bdb6e51c-f2e… 1399 2020-10-05 10:03:06 -72.6 42.2
2 route_10_2020@5e46… bdb6e51c-f2e… 1399 2020-10-05 10:03:11 -72.6 42.2
3 route_10_2020@5e46… bdb6e51c-f2e… 1399 2020-10-05 10:03:16 -72.6 42.2
4 route_10_2020@5e46… bdb6e51c-f2e… 1399 2020-10-05 10:03:21 -72.6 42.2
5 route_10_2020@5e46… bdb6e51c-f2e… 1399 2020-10-05 10:03:26 -72.6 42.2
6 route_10_2020@5e46… bdb6e51c-f2e… 1399 2020-10-05 10:03:31 -72.6 42.2
To import a month’s worth of data from the raw .csv.gz daily files on the website, the import_month
utility function is provided. It uses import_day
repeatedly behind the scenes in order to collate the data from all days of the desired month.
The function takes one required parameter, month
, as a string of the format "YYYY-MM"
, e.g. "2019-05"
. It also takes any optional parameters to forward to import_day
, i.e. return
and future_cutoff
.
NOTE: The parameter future_cutoff
is set to 24 for all by-month files. If data beyond that future cutoff is needed, one must import the data from scratch using a higher future_cutoff
value.
EXAMPLE:
# DEFAULTS: return = "clean", future_cutoff = 24
april2019 <- import_month("2019-04")
# incorporate the monthly data file in the package
usethis::use_data(april2019, overwrite = TRUE)
To access all of the raw .csv.gz daily files from the website as a single unified data frame (60+ million observations over 2018-2020), the get_full_data
utility function is provided. Note that it takes quite long to run, and yields a rather large object.
EXAMPLE:
full_data <- get_full_data()
To aggregate the full data into a one-row-per-trip dataset, the aggregate_trips
utility function is provided. It computes a variety of additional metrics, such as the trip duration, the most likely start and end stations, the start and end times, etc. Please see the package manual, the function code, or the in-build documentation for more information on the variables.
EXAMPLE:
trips <- aggregate_trips(full_data)
# incorporate the newly-aggregated trips data in the package
usethis::use_data(trips, overwrite = TRUE)
To aggregate the trips data into a one-row-per-user dataset, the aggregate_users
utility function is provided. Like the trips
aggregation, the users
aggregation computes a variety of additional metrics, such as the number of trips per user, the top start and end stations, the average duration per trip, etc. Please see the package manual, the function code, or the in-build documentation for more information on the variables.
EXAMPLE:
users <- aggregate_users(trips)
# incorporate the newly-aggregated users data in the package
usethis::use_data(users, overwrite = TRUE)
If for some reason you want to download all of the raw files from the website onto your local machine, you can use the download_files
utility function. The function takes one required parameter, path
, which specifies the path where the files will be downloaded. The function also takes an optional parameter, overwrite
, which specifies whether to overwrite the already-existing files at the given path (defaults to FALSE
). You would only need to set overwrite = TRUE
if the old files have been changed and you want to replace them, which is unlikely to happen.
EXAMPLE:
download_files(path = "~/Desktop/raw-files")
Once the download_files
function is done running, all daily trajectory data files will be available for use in the specified directory.