Author Archives: Kali McLennan

Final Project – Citibike Activity in 2024

For my final project of the Spring 2025 Data Visualization and Design class at the CUNY Graduate Center I chose to work with the Citibike ridership dataset from 2024. I began this project with the desire to create a tool where anyone can explore this massive dataset and answer their own questions. This post combines some graphics from that tool as well as some more run-of-the-mill visualization in effort to make sense of this mountain of data.

The Data

Citibike distributes ridership data via an Amazon AWS S3 bucket. The method of packaging for the year of 2024 is monthly ZIP files that contain a seemingly random number of CSV files with the actual ride data. There is not much of a data dictionary for this dataset, but the columns and data types are easy to infer. Each row in this dataset contains information about a unique ride, including:

  • Ride start and end date and time
  • The identification numbers and names for the start and end stations
  • GPS coordinates for the start and end stations
  • Bike type (classic or electric)

Data Processing

To handle the amount of data I leveraged Python 3.12 along with several data-focused libraries such as Pandas, GeoPandas, Multiprocessing, etc. The code for this project is available on GitHub. Data processing took place over several steps beginning by extracting all stations along with their GPS locations and the first month a ride was observed in 2024 (around 80 stations were installed over the year). Ride data was then extracted and cleaned to minimize data redundancy. In the end the processed data (almost 30GB!) is written in JSON format so that a web front-end could be constructed to allow exploration of the data.

Data Exploration

I will be honest, the visualization of data via bars, pies, and charts is really of secondary interest to me in this project. The vast majority of my work with this data is on the exploration side, via a front-end that is available here. When loaded, all stations will have markers placed on the map. These station markers can be selected to view rides that were recorded as either starting or ending at that station. Various statistics are presented to provide more context, including an hourly histogram of activity, the net bike flux for the active month, and breakdown of inbound/outbound rides. The entire year of data is available in this tool and the Options window provides controls to select which month and what kinds of rides are displayed.

Visualizations

I wanted to investigate a series of questions that I felt could be addressed via this data. Tableau has limitations in both row count and dataset size that required some very creative aggregations to make 45 million rides available to visualize with the software, so I do apologize for any loss of detail involved.


“Who is riding Citibike, and when do they ride?”

In 2024 there were around 45 million Citibike rides. These bikes are active 24 hours a day, criss-crossing the city in an endless tapestry of wheels on pavement. Below is a heatmap of ride activity that can be filtered to highlight the different temporal patterns of casual riders compared to those of Citibike members. Casual riders are most active on the weekend and somewhat during the dinner hours throughout the week. Members, on the other hand, are more likely to be hitting the streets during commute hours for workers (7-9 AM and 5-7 PM.


“Who rides electric bikes and who rides classic bikes?”

This visual highlights the type of bikes that are chosen by Casual Riders as well as Citibike Members. Casual riders are most likely to be tourists, while members are likely to be commuters, exercise bikers, or micro-mobility focused individuals. While both groups ride mostly electric bikes, a much higher percentage of Citibike Members choose classic bikes instead.


“What influences the choice between Electric and Classic bike?”

I wanted to investigate some hunches about factors that may influence a riders choice between riding a classic bike or an electric one. Citibike is present in all New York City boroughs except Staten Island, though only Manhattan has Citibike stations available throughout the entire borough. Between that and the population density of Manhattan, it is pretty obvious that Manhattan will have quite a bit more Citibike activity than the other boroughs. More importantly, it has the greatest density of stations, so navigating the borough is very convenient on bike compared to the outer boroughs.


“How far is the rider going?”

Thinking over the trends mentioned above, I set about investigating the relationship between ride distance and choice between electric and classic bike. I find this visual to be the most interesting, but also the most frustrating. It takes quite a while to load, and I cannot recommend playing with the bike type filter. It’s your life though, so feel welcome to click things.

For rides of longer than about 3km there is a clear preference for electric bikes. A This is highlighted in the next chart which shows the bike types chosen for trips that go between boroughs vs trips that stay within one borough.

Finally, connecting these two, we see below that the average length of a ride is significantly longer for trips between boroughs than for trips that are confined within one borough.

Conclusions

This project has presented some significant challenges as well as some very interesting insight into the behaviors of Citibike riders across the city! As one would expect, your average Citibiker appears to be a New York City resident who is using bikes to commute to and from work during the Monday to Friday grind. They may have some additional rides each week outside of that activity, maybe to meet friends for dinner or to go to an activity or store. The main research question that I was drawn to was “What factors influence a riders choice between an electric or classic bike?”, and the answer seems to be quite complex. The factor with the most obvious impact is simply ride length, but this involves confounding factors like going between boroughs. The bridges connecting boroughs are infamously steep, and I can say from personal experience that I am usually willing to part with a few bucks to not arrive at work drenched in sweat from crossing the Queensboro Bridge!

Please visit the data exploration page and have a look at activity around the system for yourself! Though it is a small part of this post, it represents the majority of my work on this project.

Attribution: https://www.goodfon.com/city/wallpaper-badfon-new-york-city-manhattan-1814.html

Project 2: Walking Through Time and Space

Background

We all know that we should be getting more steps in our daily lives. Guidance from the medical field suggests that people aim for 10,000 steps per day. There have been many, many studies that support this recommendation. Walking is a low-impact form of exercise that has many dose-dependent health benefits. In particular, increased walking has been repeatedly linked to lower risks of adverse cardiovascular events, lower cholesterol, and overall higher satisfaction in life[1].

For many years, my partner (Sarah) and I lived in the suburban expanse of Norman, Oklahoma, a college town just south of the vast sprawl of Oklahoma City. On April 10, 2022 we moved to New York City. New Yorkers are well known to be walkers, and a quick Google search indicates that New Yorkers walk up to 3x as much as the average American.

For this project I wanted to analyze and visualize my own walking habits spanning a number of years, including time when we lived in Oklahoma and following our move to New York City to compare my daily walking habits over time and location.

Data Sources

Let’s be honest… we are a phone addicted culture! For this project this is quite a good thing, as I have carried an iPhone with me essentially everywhere I have gone for the better part of a decade. Following the Apple instructions, I was able to export the entire history of data from the Health app. Using the Apple Health Parser python utility from alxdrcirilo on GitHub I was able to parse this enormous amount of data into a usable format (comma-separated value) quickly.

Later in the project design I became interested in exploring patterns in my walking as they relate to temperature/season. I utilized the NOAA Climate Data Online platform to retrieve daily records of average temperature, maximum temperature, and precipitation amounts covering all dates from 1/1/2021 to 3/23/2025. For the dates from 1/1/2021 to 4/9/2022, I used the USW00013967 (Oklahoma City Will Rogers Airport) station and for days from 4/10/2022 onward I USW00014732 (LaGuardia Airport) station. While temperature can vary by a small amount over the geographical area of a city, I believe that these two stations provide “accurate enough” data for this project.

Data Cleanup

Early on in this project I learned that the Apple HealthKit data structure stores steps in variable duration “walks”. Essentially, if the phone remains still for more than a few seconds then the next time the phone starts moving HealthKit starts a new “walk”. This results in highly variable time windows that may be as short as a few seconds or as long as half an hour. A few rows of raw data are shown below.

To deal with this I rolled up all the records for each day to arrive at a total number of steps for each calendar day. In doing so, I did lose much of the sub-day detail in the data. I hope to revisit this project in the future to add visualizations of how my walking trends change on the hourly level, but for now I have chosen to simply focus on daily trends.

Data from NOAA was in a very simple structure (CSV) with each row having fields for the date, average temperature, maximum temperature, and precipitation in inches. This data covered each day of the desired time range and had no missing data for any day. Thus, it did not require any efforts to cleanup. Data from NOAA was merged with the daily step totals in the Tableau data source window using the date as the key between the two sources. A few rows of the final data source structure are below.

General Trends

As an overview, the following chart presents the average daily number of steps for each year from 2021 through 2025. From this data alone, it is very clear that moving from a suburban life with very limited non-car transit options to a major city with multiple types of transit had a dramatic impact on the number of steps I took.

The following calendar represents daily steps from 1/1/2021 through 3/23/2025. Days are color coded into one of 6 buckets based on step count, with the increment between steps being 5000. Thus grey colored days represent between 0 and 5000 steps, and the darkest green represents step counts >25,000. The selected year can be changed at the top of the chart with the < and > buttons in the top right.

Even with this very high-level view of the data, it is very easy to identify the day that my partner and I moved to New York City (April 10, 2022). Prior to this date the overwhelming majority of days are grey colored or one of the two lightest shades of green (<10,000 steps).

Charting Days With at Least 10,000 Steps

While there is a dose dependent benefit from walking (one article says there is an 8-11% reduction in premature death for every 2,000 steps per day[2]), the overwhelming guidance from health organizations is to aim for 10,000 steps per day. In a modification of the previous calendar, I have color coded each day based simply on whether or not I achieved 10,000 steps. This presentation makes the change beginning in April 2022 even more apparent!

Visualized a different way, the count of days per year with at least 10,000 steps trended upward over the entire data set. This means that with each passing year I am walking more, and hopefully gaining more benefits from all this walking.

Does Temperature Impact My Walking

Finally, I combined the Apple HealthKit data with the NOAA daily temperature data to analyze any relationship between ambient temperature and the number of steps I take. While the data has a lot of variability and a non-linear relationship, it is clear that I seem to avoid much walking on days with temperatures under 40F or above 80F.

Conclusion

This project involved a vast amount of data. In total, there were 69,336 raw rows of step data from Apple HealthKit. I very much wish I had been able to retain the hourly data, but my attempts to parse the time format at the hour level did not go well. I plan to come back to this when I have spare time and add in this hourly detail.

Thanks to the data source being an object that I have to carry with me to record data, there is essentially no risk that any of the step totals have been exaggerated. If anything there would be steps that were uncounted for short trips (to the bodega for example), or movements around my house/office where I didn’t pick my phone up.

I have worked as hard as I could to avoid bias in this project, and while I cannot claim to be bias-free, I have not identified any areas where I see bias in the analysis or visualizations.

Expanding on this work would be somewhat easy due to the way HealthKit stores data. It would likely be relatively trivial to add in heart rate data, or blood oxygen saturation data, as these are recorded as well. I wanted to be careful with scope creep on this project and opted to not include any of that data, but it would almost certainly yield some interesting information!

Bibliography

[1] Wattanapisit, Apichai, and Sanhapan Thanamee. “Evidence behind 10,000 steps walking.” J Health Res 31.3 (2017). https://www.thaiscience.info/Journals/Article/JHRE/10985252.pdf
[2] https://www.kumc.edu/about/news/news-archive/jama-study-ten-thousand-steps.html

Project 1: 311 Complaint Dataset

Research Question

On January 5, 2025 the Central Business District Tolling Program (CBDTP), also known as Congestion Pricing or Congestion Relief, went into effect. This program charges drivers for most vehicular traffic south of 60th Street via the use of license plate scanners installed throughout midtown and lower Manhattan. This program has received a large amount of political and social opposition, both within New York as well as neighboring states whose commuters travel into Manhattan for work and pleasure. The primary stated goal of this program is to reduce traffic in midtown and lower Manhattan by incentivizing commuters to make use of the abundant public transit options in the form of ferries, busses, commuter trains, and the subway.

My primary question is: Does the data collected by the New York City 311 complaint database provide any evidence that this program is having the intended effect of reducing the amount of traffic in the congestion relief zone?

Google Maps display of the Congestion Relief Zone.
Google Maps display of the Congestion Relief Zone

Dataset and Selected Complaints

The New York City Open Data Portal has a freely accessible 311 Complaint Dataset that covers approximately 43 million complaints since 2010. Members of the public may lodge a complaint with the 311 service via phone, mobile device, or internet. The 311 Complaint Dataset is updated nightly and each row in the dataset represents one complaint. These complaints are assigned to appropriate departments or agencies of the city and are thoroughly documented. I began by reviewing the Complaint Type and Descriptor fields to build a list of complaint data which could serve as a proxy for measuring the amount of traffic in the Congestion Relief Zone. The combination of search criteria used for this project are:

 

FieldSelected Values
Created Date>=January 1, 2022 12:00 AM *
<= February 28, 2025 11:45 PM
Complaint TypeIllegal Parking
Noise – Commercial
Noise – Vehicle
Traffic
DescriptorBlocked Bike Lane
Blocked Crosswalk
Blocked Hydrant
Blocked Sidewalk
Commercial Overnight Parking
Double Parked Blocking Traffic
Double Parked Blocking Vehicle
Overnight Commercial Storage
Parking Permit Improper Use
Posted Parking Sign Violation
Car/Truck Horn
Car/Truck Music
Engine Idling
Congestion/Gridlock
Drag Racing
Incident Zip10001
10002
10003
10004
10005
10006
10007
10009
10010
10011
10012
10013
10014
10016
10017
10018
10019
10022
10036
10038
10280
10282
Note on Created Date: During the analysis it was determined that January and February 2022 had significantly reduced values as a result of the Omicron wave of the COVID-19 pandemic and were filtered out of the data prior to visualization.

Visualizations

Distribution of Complaint Types

Complaints by Type & Descriptor

The first thing I did was to compare the number of complaints for January and February of each year (2023-2025) to get a baseline for what to expect from the data. Some categories, such as Blocked Bike Lane, have received far fewer complaints, while Parking Permit Improper Use and Posted Parking Sign Violation show increased complaints this year.

It is curious that Posted Parking Sign Violation and Parking Permit Improper Use have increased in the number of reports. I suspect there are many factors which affect a persons likelihood of lodging a complaint and that these complaints are not the ideal proxies for the question at hand, but I still believe them to be important to the discussion.

Complaints by Month

Charting the number of complaints each month since January 2023 shows that the start of 2025 has the lowest number of complaints related to vehicles in the Congestion Relief Zone. This gap grows to an impressive reduction in complaints when the Parking Sign and Parking Permit categories are removed. Either way, this is a positive signal that Congestion Control is having an impact on the number of complaints that are being opened!

Mapping Complaints

From these maps it is easy to see that the number of complaints in January and February 2025 are dramatically lower than the numbers for previous years.

Conclusion

The launch of the Congestion Control program in Manhattan in early 2025 has had a clear impact on the number of complaints related to vehicles that originate in the affected zip codes. Complaints are down across nearly all categories in comparison to 2024.

The future of this program is uncertain, with the current presidential administration applying pressure from the federal government to end this program. If the opposition succeeds we can all look forward to a more crowded, louder, and less convenient experience in the streets of Manhattan.

Hello world!

Welcome to CUNY Academic Commons. This is your first post. Edit or delete it, then start blogging!

Skip to toolbar