Because Every Good Post Needs an Intro

Since moving back to AZ full time, I’ve taken a keen interest in how temperature affects my running performance. I run in the neighborhood of 80 mi/week (~ 130 km/week) and am hoping to lay down 250 mi (~ 400 km) or so in this year’s Across the Years, so anything that makes my running life harder is worth considering. In case you aren’t aware, summers are really hot here. I believe our last daily high below 100F (37.8C for the rest of the world) was May 26th. Since then, we’ve seen temperatures as high as 115F (46.1C), and our daily low has been as hot as 95F (35C). I’m quite literally living in conditions you’ll see in heat acclimatization studies, which means I’m either running on the treadmill at work or starting my runs at 8 or 9PM. And lemme tell ya, those runs still feel terrible lots of days.

But how terrible are they? Or said another way, how much is the temperature negatively affecting my performance? We can do a quick and dirty analysis to figure it out! I say dirty because we’re going to follow the KISS principle when it comes to model building–I’m only going to consider the effects of temperature and path gradient while absolutely ignoring things like humidity, elevation, wind speed, and air quality. Moreover, despite the obvious temporal aspect of the data (acclimatization says, “what?”), we’re going to take a page from Strava’s book and ignore it completely. :) Nobel prize-winning analysis this is not, but it helps give me a better idea of what I’m working with here in the Copper State. The data can be downloaded here, though, so feel free to run your own analyses!

Anywho, I’ll spare you how I set up my data collection and storage pipeline for the data we’ll be using today, but I think I’ll talk about aspects of that in some future blog posts. For now, we need four pieces of information from my data: pace, heartrate, temperature, and gradient. Why include gradient? Well, the roads around my neighborhood have long stretches of 1-2% inclines, which is annoying when it’s hotter than the fires of hell, and I have a sneaking suspicion that heat makes even modest climbs worse for me. Plus, it’s already exported by Strava’s API, so I don’t have to go searching for it. Let’s get to it.

Data Gathering

I store my running data in a DuckDB. For all my sport performance homies out there working with mountains of data (especially raw data exports from STATSports, Vald, etc.), I would seriously recommend you learn some SQL and make DuckDB part of your pipeline #NotAnAd. At any rate, I have 3 tables that I’m gathering my data from–a run header table (basic information about the run), an hourly temperature table, and a data stream table. The temperature data comes from Open-Meteo, while the run data are pulled from Strava’s API. Let’s show the query first then talk about it.

SELECT
    rd.activity_id,
    time_bucket(
        INTERVAL '1 day',
        start_date + INTERVAL(utc_offset) SECOND
    )::DATE AS date,
    time // 60 AS time_block,
    AVG(velocity_smooth) AS avg_velocity,
    AVG(grade_smooth) AS avg_grade,
    AVG(heartrate) AS avg_heartrate,
    AVG(heartrate / velocity_smooth) AS avg_efficiency,
    AVG(temperature * 9 / 5 + 32) AS avg_temp
FROM run_headers AS rh
INNER JOIN run_data AS rd
    ON rh.id = rd.activity_id
INNER JOIN daily_temperatures AS dt
    ON rh.id = dt.activity_id
    AND measurement_time =
        time_bucket(
            INTERVAL '1 hour',
            start_date + INTERVAL(time) SECOND + INTERVAL 30 MINUTE
        )
WHERE type = 'Run'
AND start_lon IS NOT NULL
AND rh.distance BETWEEN 5000 AND 32000
AND time <= 7200
GROUP BY rd.activity_id, date, time_block
HAVING COUNT(time) = 60
AND avg_efficiency < 60
ORDER BY date, rd.activity_id, time_block;

Alright, a lot just happened. We’ll take things in a bit of a circuitous order, but it should help you keep everything straight.

If you know zero SQL

SELECT and FROM are the inspirations for select(), mutate(), summarize(), and reframe() in {dplyr}. If we pretend we’ve already completed the inner joins across dataframes containing the same data as the database tables, the equivalent call in {dplyr} would look something like:

joined_data |>
    # We'll pretend the distance field from run_headers is named run_distance
    filter(
        between(run_distance, 5000, 32000),
        time <= 7200
    ) |>
    mutate(time_block = time %/% 60) |>
    # Drop blocks where I take breaks or have incomplete data for whatever reason
    filter(n() == 60, .by = c(activity_id, time_block)) |>
    reframe(
        avg_velocity = mean(velocity_smooth),
        sd_velocity = sd(velocity_smooth),
        ...,
        .by = c(activity_id, date, time_block)
    ) |>
    filter(avg_efficiency < 60)

If you know some SQL

DuckDB is if modern SQL and modern data analysis packages had a love child. You’ll notice familiar statements and functions like WHERE and GROUP BY, but you may or may not have come across things like INTERVAL, HAVING, and inline casting (the ::DATE part, which is the same as CAST(... AS DATE)). It mostly boils down to the flavors of SQL you’ve used in the past because each DBMS is going to be a little different in terms of syntax. DuckDB reads similarly to PostgreSQL (Postgres for short), but you can find equivelent functions in most any modern DBMS.

To highlight a few specifics,

time_bucket(
    INTERVAL '1 day',
    start_date + INTERVAL(utc_offset) SECOND
)::DATE AS date

is converting the timestamps (start_date) to their local dates. Most of my runs have been in NJ, TN, and AZ (the former two being EST/EDT, the latter being MST year-round), while the timestamps are stored as UTC. Strava provides the UTC offset in seconds, though, so we can shift the timestamps to local time then cast them to date. Not particularly important for the current analysis, but worthwhile for examining seasonality in the data.

I do something similar in the INNER JOIN to the daily_temperatures table

INNER JOIN daily_temperatures AS dt
    ON rh.id = dt.activity_id
    AND measurement_time =
        time_bucket(
            INTERVAL '1 hour',
            start_date + INTERVAL(time) SECOND + INTERVAL 30 MINUTE
        )

because the time field is simply the number of seconds from the start of the run. The temperature data are reported hourly, so I want to attach the nearest hourly temperature to the data. I’m sure I could have done some interpolation nonsense to estimate the temperature at each time point, but that’s way more work for no payoff.

Other housekeeping

Temperature is converted to freedom units, but speed and efficiency use metric (i.e., \(m/s\)). I don’t have a satisfying rationale for you other than I like it that way
Speaking of efficiency, it’s a horribly tortured set of units (\(\frac{beats/min}{m/s}\)), but outside dropping 40 grand for a K5 so I can know my \(\dot{V}O_{2}\) across all my runs, it’s about the closest we’re going to get in terms of understanding the relationship between my internal and external workload/intensity/whatever
- Papers may call this efficiency index or running effectiveness and often calculate it as \(\frac{m/s}{W/kg}\), but 1) I think running power is stupid and people who advocate for it should feel bad and 2) \(\frac{m/s}{W/kg}\) or \(\frac{m/s}{beats/min}\) produces really tiny units that make my tiny brain hurt
I group the data into 60-second blocks and calculate the mean for each block across each run similarly to Strava’s process for their grade-adjusted pace model
- This helps reduce the variability in the raw 1Hz data (bad GPS signal, heart rate strap artifacts, etc.) while also giving me some filter conditions to remove obviously-outlying data from the analysis
- I do the filtering in R (except for the hard-coded efficiency cutoff), but it would be easy enough to use a CTE in the DuckDB query
- I’m not outright removing intervals, recovery periods, etc. from the data because that’s way more involved than the simple analysis here, so the data won’t be as clean as they could be
I only include data from 2023 and 2024 in the example dataset because the sessions and volumes are much different compared to previous years
Finally, I limit the data to 1) outdoor runs 2) between 5k and 32k while 3) intentionally truncating longer sessions to 120’
- My running has historically been geared toward half marathons, so your boy ain’t very efficient once the distance creeps toward 25k+

One last filter

I still want to clean up the data a touch before getting to the meat of the analysis. Again, we’ll follow a similar approach to Strava in that I’ll remove 60-second segments that I would consider “abnormal” in the context of a given run–strides in an otherwise-easy run, short climbs and descents, heart rate spikes from nearly being hit by a car, y’know normal running things. I also remove segments outside 30-100F (-1-38C) and segments with grades \(\gt\pm3\) because I have severely limited data at those extremes and am not confident including them will accurately represent my running performance in those conditions.

filtered_data <-
    run_data |>
    mutate(
        across(
            matches("avg") & !matches("temp"),
            list(abs_z = ~ abs(as.vector(scale(.x))))
        ),
        .by = activity_id
    ) |>
    filter(
        between(avg_grade, -3, 3),
        between(avg_temp, 30, 100),
        avg_grade_abs_z < 2,
        avg_velocity_abs_z < 2,
        avg_efficiency_abs_z < 2
    ) |>
    select(-matches("_z"))