
Background
We all know that we should be getting more steps in our daily lives. Guidance from the medical field suggests that people aim for 10,000 steps per day. There have been many, many studies that support this recommendation. Walking is a low-impact form of exercise that has many dose-dependent health benefits. In particular, increased walking has been repeatedly linked to lower risks of adverse cardiovascular events, lower cholesterol, and overall higher satisfaction in life[1].
For many years, my partner (Sarah) and I lived in the suburban expanse of Norman, Oklahoma, a college town just south of the vast sprawl of Oklahoma City. On April 10, 2022 we moved to New York City. New Yorkers are well known to be walkers, and a quick Google search indicates that New Yorkers walk up to 3x as much as the average American.
For this project I wanted to analyze and visualize my own walking habits spanning a number of years including time when we lived in Oklahoma and following our move to New York City.
Data Sources
Let’s be honest… we are a phone addicted culture! For this project this is quite a good thing, as I have carried an iPhone with me essentially everywhere I have gone for the better part of a decade. Following the Apple instructions, I was able to export the entire history of data from the Health app. Using the Apple Health Parser python utility from alxdrcirilo on GitHub I was able to parse this enormous amount of data into a usable format (comma-separated value) quickly.
Later in the project design I became interested in exploring patterns in my walking as they relate to temperature/season. I utilized the NOAA Climate Data Online platform to retrieve daily records of average temperature, maximum temperature, and precipitation amounts covering all dates from 1/1/2021 to 3/23/2025. For the dates from 1/1/2021 to 4/9/2022, I used the USW00013967 (Oklahoma City Will Rogers Airport) station and for days from 4/10/2022 onward I USW00014732 (LaGuardia Airport) station. While temperature can vary by a small amount over the geographical area of a city, I believe that these two stations provide “accurate enough” data for this project.
Data Cleanup
Early on in this project I learned that the Apple HealthKit data structure stores steps in variable duration “walks”. Essentially, if the phone remains still for more than a few seconds then the next time the phone starts moving HealthKit starts a new “walk”. This results in highly variable time windows that may be as short as a few seconds or as long as half an hour. A few rows of raw data are shown below.

To deal with this I rolled up all the records for each day to arrive at a total number of steps for each calendar day. In doing so, I did lose much of the sub-day detail in the data. I hope to revisit this project in the future to add visualizations of how my walking trends on the hourly level, but for now I have chosen to simply focus on daily trends.
Data from NOAA was in a very simple structure with each row having fields for the date, average temperature, maximum temperature, and precipitation in inches. This data covered each day of the desired time range and had no missing data for any day. Thus, it did not require any efforts to cleanup. Data from NOAA was merged with the daily step totals using the date as the key between the two sources. A few rows of the final data source structure are below.
