library(knitr)
## Warning: package 'knitr' was built under R version 4.4.3
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
This is a project guided by Google for the Google Data Analyst
Certificate. This is the bike-share project, based on a fictional
company: “Cyclist”. I will be guided and reminded of the steps of going
through a completed project.The steps follow the ASK, PREPARE,
PROCESS, ANALYZE, SHARE, ACT requirements, where the aim is to
investigate how annual and casual riders use the bikes differently. The
thinking that goes into executing the steps and analyzing the data will
be my own.
The key question to address is: How do annual and casual riders use Cyclist bikes differently?
In this instance “annual users” are those who pay for a year’s worth
of the service as subscribers. “Casual users” are those who buy
single-ride or full-day passes.
BUSINESS TASK
It is more profitable to have
annual memberships. Hence determining the differences between the two
user groups is a key task. Based on these differences recommendations
can be made to increase the number of annual (subscribed) users, by
converting casual riders into annual cyclists.
KEY
STAKEHOLDERS
In terms of the customer base, the
stakeholders involved are the casual and annual users respectively.
Within the company, Cyclist, Lily Moreno is the director of marketing
who requires the recommendations to improve the marketing strategy. She
will, in turn, report to the executive team to approve any recommended
marketing program.
Some sub-questions to drive the investigation include:
These sub-questions will enable the marketing team to design a campaign based on where the most casual users are and when they are most likely to use the Cyclist services.
The data has been made available by Motivate International Inc. The specific data sets considered are Divvy 2019 Q1 and Divvy 2020 Q2. The data are organised in a wide format, providing each ride event with the following attributes:
The Divvy 2019 Q1 data set also includes “bikeid” and
“tripduration” and “gender” attributes with “birthyear”.Usertype is
designated as “Subscriber” vs “Customer”.
Data Reliability
The data has been made public
under the licensed agreement between Lyft Bikes and Scooters and the
City of Chicago’s Divvy bicycle sharing service. The riders’ personally
identifiable information has been removed from the data.
Examining the two data sets on excel shows that for the data for the purposes of this investigation are reliable and comprehensive for the given periods of covered by the data set. It must be noted that focusing on the 2019 and 2020 Q1 periods, gives a limited view of reflecting on 2025 but will serve for the purposes of this investigation.
There are some small discrepancies such as different terms being used for ‘casual’ and ‘annual’ users as defined at the beginning of this report.
There is missing data and inconsistencies in the gender, user_type and and trip_duration of the 2019 data set but these data will not be used in this report.
Data Relevance
From the start and end times provided
the duration of casual and annual users’ trips can be determined. From
the provided start date of each trip it is possible to determine which
days of the week are more popular for the different types of users. From
the start_station_IDs it is possible to find which stations are used
more by casual and annual users respectively. Finally, from the provided
start times of rides we can determine the most popular use times for the
two different user types.
Problems For the purposes of this report the problems that can be resolved in cleaning are the columns, in both data sets, that contain a merged time and date and the user type classification as mentioned above.
Tools
Excel will be used to process and perform
the initial exploration of the data as the size of the dataset allows
for this. The analysis will be performed using Rstudio, for the
inconvenience of reporting and visualization purposes.
Cleaning actions
Irrelevant attributes for the purposes of the investigation were removed.
For both data sets the merged date and time for both the start and end of trips were split and changed into a yyyy/mm/dd and hr:min:sec format.
The 2020 dataset was filtered from oldest to newest date, in line with the 2019 dataset.
In both datasets usertype was changed to “casual” and “annual”, in line with the definition made above. In the 2020 dataset “member” was changes to annual. In the 2019 dataset “Subscriber” was changed to “annual” and “Customer” was changes to “casual”.
A discrepancy was noted in the 2019 data set where trip ID 21742463 was designated as “alos”. In the real world this would be queried. Since this was not an option the entry was changed to “annual”, as there are 341906 annual users and only 23162 casual users it is simply more probable that “alos” was an annual user.
All the attribute titles were standardised and put into the same order for the two data sets.
The two data sets are combined and an overview of the column attributes is given:
library(tidyverse)
Divvy_2019 <- read_csv("Divvy_2019_Q1.csv")
Divvy_2020 <- read_csv("Divvy_2020_Q1.csv")
combined_data <- bind_rows(Divvy_2019, Divvy_2020)
colnames(combined_data)
## [1] "start_date" "start_time" "end_date"
## [4] "end_time" "start_station_id" "start_station_name"
## [7] "end_station_id" "end_station_name" "usertype"
## [10] "ride_length" "Weekday"
In terms of overall numbers of causal and annual users, we have
count_casual <- combined_data %>%
count(usertype) %>%
filter(usertype == "casual") %>%
pull(n)
count_annual <- combined_data %>%
count(usertype) %>%
filter(usertype == "annual") %>%
pull(n)
print(paste("There are", count_casual, "casual users"), quote = FALSE)
## [1] There are 71525 casual users
print(paste("There are", count_annual, "annual users"), quote = FALSE)
## [1] There are 720314 annual users
difference <- count_annual - count_casual
print(paste("So there are", difference, "more annual users" ), quote = FALSE)
## [1] So there are 648789 more annual users
It was discovered that 234 ride_lengths were less than a second (probably an error in entry because this impossible), and recorded incorrectly. These data were removed to find the average ride length.
library(hms)
combined_data_clean <- combined_data %>%
mutate(ride_length = as.numeric(hms::as_hms(ride_length)))
The mean travel time for casual and annual users is determined. For casual users in hh:mm:ss we have:
mean_casual_sec <- combined_data_clean %>%
filter(usertype == "casual") %>%
summarise(mean_ride_casual = round(mean(ride_length, na.rm = TRUE))) %>% pull(mean_ride_casual)
mean_annual_sec <- combined_data_clean %>%
filter(usertype == "annual") %>%
summarise(mean_ride_annual = round(mean(ride_length, na.rm = TRUE))) %>% pull(mean_ride_annual)
mean_casual_hms <- hms::as_hms(mean_casual_sec)
mean_annual_hms <- hms::as_hms(mean_annual_sec)
print(mean_casual_hms)
## 00:39:12
For annual users in hh:mm:ss we have:
print(mean_annual_hms)
## 00:11:32
Considering the days of the week, the top 3 days of the week for counts of number of rides for casual users on that day is calculated:
count_days <- combined_data %>%
group_by(usertype) %>%
count(Weekday)
count_days <- count_days %>%
rename(count_of_rides = n)
top3_casual <- count_days %>%
filter(usertype=="casual") %>%
arrange(desc(count_of_rides)) %>%
slice_head(n=3)
print(top3_casual)
## # A tibble: 3 × 3
## # Groups: usertype [1]
## usertype Weekday count_of_rides
## <chr> <dbl> <int>
## 1 casual 1 18652
## 2 casual 7 13473
## 3 casual 6 8522
Thus we have Sunday with 18,652 total casual rides; Saturday with 13,473 total casual rides; and Wednesday with 8522 total rides.
As a proportion of the total casual rides these three days represent:
total_casual <- count_days %>%
filter(usertype=="casual") %>%
pull(count_of_rides) %>%
sum(na.rm=TRUE)
top3_sum<- top3_casual %>%
pull(count_of_rides) %>%
sum(na.rm = TRUE)
proportion <- round(top3_sum/total_casual, 2)
print(proportion)
## [1] 0.57
Now we consider which stations are more popular as starting point
Count_start <- combined_data %>%
group_by(start_station_name) %>%
summarise(
casual_count = sum(usertype=="casual"),
annual_count = sum(usertype == "annual"),
.groups = "drop") %>%
arrange(desc(casual_count))
Here we determine the proportion of casual riders who start their rides from the top 65 most popular stations. In total there are 607 start stations.
top65_proportion <- Count_start %>%
summarise(
sum_casualrides = sum(casual_count),
top_65 = sum(casual_count[1:65]),
proportion = round(top_65/sum_casualrides, 2)) %>%
pull(proportion)
print(paste("So the top 65 most popular starting stations represent", top65_proportion, "proportion of casual users." ), quote = FALSE)
## [1] So the top 65 most popular starting stations represent 0.51 proportion of casual users.