Documented ValleyBike Data Import Workflow

Data Location

The raw day-by-day data files can be found online at Prof. Nicholas Horton’s website, in the form of .csv.gz files (i.e. compressed .csv format).

At the time of writing, there are 500 daily data files in total, covering all active ValleyBike days from 28 June 2018 to 5 October 2020 for which there is available, non-corrupted data.

Corrupted Data

The data for 6 days has been corrupted at the source, and the process of fixing them is under way. Until then, note that the data for the following days is not available on the website:

  • 2018-09-01
  • 2018-09-30
  • 2018-10-05
  • 2018-10-13
  • 2018-10-20
  • 2019-04-19

Data Import (Day)

To import a day’s worth of data from the raw .csv.gz daily files on the website, the import_day utility function is provided. It takes 3 parameters:

  1. day: the day for which the data is desired, as a string of the format "YYYY-MM-DD", e.g. "2019-05-22". It can be any day between 28 June 2018 to 5 October 2020, although ValleyBike only operates from April to November. Supplying an invalid date, a date for which no data has been recorded, or a date that corresponds to a corrupted file will all yield an empty tibble.
  2. return: the type of data to be returned (one of: "clean", "anomalous", "all"). Defaults to "clean".
  3. future_cutoff: the next-day cutoff (in hours) past which observations are categorized as “anomalous”. This cutoff is necessary since some rides may last past midnight, which is not an anomaly unless it extends too far into the future. As such, this parameter defaults to 24.0 hours (i.e. only observations up to 24 hours after the given day are considered non-anomalous).

EXAMPLES:

import_day("2018-07-25", return = "anomalous") %>%
  head()
# A tibble: 6 x 6
  route_id            user_id       bike  time                longitude latitude
  <chr>               <chr>         <chr> <dttm>                  <dbl>    <dbl>
1 route_07_2018@2b42… 97c1f2d5-942… 1131  NA                      -72.6     42.3
2 route_07_2018@8302… d4ff327d-749… 1322  NA                      -72.7     42.3
3 route_07_2018@8f2d… f02baa80-e88… 1112  NA                      -72.6     42.3
4 route_07_2018@eaa6… 20785e9f-15a… 1064  NA                      -72.6     42.2
5 route_07_2018@eaa6… 20785e9f-15a… 1064  2018-07-28 20:19:49     -72.7     42.3
6 route_07_2018@eaa6… 20785e9f-15a… 1064  2018-07-28 20:19:54     -72.7     42.3
import_day("2020-10-05", return = "clean") %>%
  head()
# A tibble: 6 x 6
# Rowwise: 
  route_id            user_id       bike  time                longitude latitude
  <chr>               <chr>         <chr> <dttm>                  <dbl>    <dbl>
1 route_10_2020@5e46… bdb6e51c-f2e… 1399  2020-10-05 10:03:06     -72.6     42.2
2 route_10_2020@5e46… bdb6e51c-f2e… 1399  2020-10-05 10:03:11     -72.6     42.2
3 route_10_2020@5e46… bdb6e51c-f2e… 1399  2020-10-05 10:03:16     -72.6     42.2
4 route_10_2020@5e46… bdb6e51c-f2e… 1399  2020-10-05 10:03:21     -72.6     42.2
5 route_10_2020@5e46… bdb6e51c-f2e… 1399  2020-10-05 10:03:26     -72.6     42.2
6 route_10_2020@5e46… bdb6e51c-f2e… 1399  2020-10-05 10:03:31     -72.6     42.2

Data Import (Month)

To import a month’s worth of data from the raw .csv.gz daily files on the website, the import_month utility function is provided. It uses import_day repeatedly behind the scenes in order to collate the data from all days of the desired month.

The function takes one required parameter, month, as a string of the format "YYYY-MM", e.g. "2019-05". It also takes any optional parameters to forward to import_day, i.e. return and future_cutoff.

NOTE: The parameter future_cutoff is set to 24 for all by-month files. If data beyond that future cutoff is needed, one must import the data from scratch using a higher future_cutoff value.

EXAMPLE:

# DEFAULTS: return = "clean", future_cutoff = 24
april2019 <- import_month("2019-04")

# incorporate the monthly data file in the package
usethis::use_data(april2019, overwrite = TRUE)

Data Import (Full)

To access all of the raw .csv.gz daily files from the website as a single unified data frame (60+ million observations over 2018-2020), the get_full_data utility function is provided. Note that it takes quite long to run, and yields a rather large object.

EXAMPLE:

full_data <- get_full_data()

Data Aggregation (Trips)

To aggregate the full data into a one-row-per-trip dataset, the aggregate_trips utility function is provided. It computes a variety of additional metrics, such as the trip duration, the most likely start and end stations, the start and end times, etc. Please see the package manual, the function code, or the in-build documentation for more information on the variables.

EXAMPLE:

trips <- aggregate_trips(full_data)

# incorporate the newly-aggregated trips data in the package
usethis::use_data(trips, overwrite = TRUE)

Data Aggregation (Users)

To aggregate the trips data into a one-row-per-user dataset, the aggregate_users utility function is provided. Like the trips aggregation, the users aggregation computes a variety of additional metrics, such as the number of trips per user, the top start and end stations, the average duration per trip, etc. Please see the package manual, the function code, or the in-build documentation for more information on the variables.

EXAMPLE:

users <- aggregate_users(trips)

# incorporate the newly-aggregated users data in the package
usethis::use_data(users, overwrite = TRUE)

Data Download

If for some reason you want to download all of the raw files from the website onto your local machine, you can use the download_files utility function. The function takes one required parameter, path, which specifies the path where the files will be downloaded. The function also takes an optional parameter, overwrite, which specifies whether to overwrite the already-existing files at the given path (defaults to FALSE). You would only need to set overwrite = TRUE if the old files have been changed and you want to replace them, which is unlikely to happen.

EXAMPLE:

download_files(path = "~/Desktop/raw-files")

Once the download_files function is done running, all daily trajectory data files will be available for use in the specified directory.