Leicester Parkrun Results Analysis
In this post, I explore the event results from 223 runnings of the Leicester Victoria parkrun; asking questions of interest and creating colourful plots! I do not aim to cover everything, instead I’ve commented where I think further exploration could be most interesting.
For many runners waking up on a Saturday morning, parkrun is on their mind. Parkruns are free, weekly community events held in public spaces all around the world. Covid-19 has unfortunately led to the cancellation of many parkruns, which collectively see millions of people turn up to walk, jog or run the 5 kilometre courses. My local course, Leicester Victoria is one of them. The event regularly has a few hundred attendees, which seems to be growing all the time. Born in November 2015, there have since been 223 parkrun events in the park – plenty of data for my analysis.
Sourcing Parkrun Data
Parkrun event data is freely accessible on the parkrun website. Runner’s are encouraged to register, and once they have done so, they can print or download their runner barcode. On completing their 5km each week, runners are given a finishing token. The token defines their finishing position, and is scanned by a volunteer, along with their runner barcode. This uniquely matches their details with the performance at the event, ready to be tabulated and posted onsite.
I’ve recently learnt how to use selenium as a web-based automation tool, specifically to return content from a web page source code. ‘Scraping’ with Chromedriver has allowed me to automate collation of the 223 sets of results, saving me from clicking on each separate url and doing a copy/paste. What would have taken a couple of days, now runs in the background in a couple of hours. With a bit of list manipulation and string formatting, I had a data frame – here are the first 4 rows:
I then used the Dark Sky API to source weather data for the latitude and longitude of Victoria Park for 9am on the dates of all events. I was impressed with the available information: I opted for rainfall, temperature, windspeed, pressure, and a description of the weather, e.g. clear, overcast, light rain.
With these merged together on event date, and a couple of features engineered for later analysis, I had a frame with 29 columns and around 60,000 rows of Leicester parkrun data.
Analysis
Before I coded a single line; I had a few questions as motivation. Yet as I worked through these, more questions arose. I enjoyed the exploratory freedom. Here are the highlights of a very long Jupyter notebook!
My Results
I’ve run the event 8 times. I’ve won it once, though more recently have been juggling and running at the same time, which has slowed me down significantly.
I’m the second fastest Scott out of 17. Strong running by Scott Green in 17:11.
Wreake Runners
I’ve been a member of Wreake Runners since moving to Syston in 2018. Parkrun already does a good job of collating club club statistics on purpose-built club pages. As a club, we’ve had 441 outings on the course. How does this compare to other clubs?
We see that we rank 4th for total number of runs at the course, some way behind the leading 3.
To extend this, I could source the meeting places of each club, and find the distance from Victoria Park using the Google Distance Matrix API for example. I would expect that clubs based closer to the park to have a larger number of runs, out of convenience alone.
I could also normalise the above on the number of distinct runners, for a perhaps different result boosting smaller clubs up the list.
Runner Names
Initially, I wondered which names were fastest. It’s a list of rare names – one of a kind at Leicester Parkrun who are fast runners. Not particularly interesting, I found a similar list of more unusual names for slowest runners too.
Instead, turning our attention to the most common names, which first name has recorded the most runs at the event?
One challenge I ran into was the shortening of names: Andrew to Andy, Peter to Pete, Katie to Kate. I decided to replace these with the longer version. Additionally, I dealt with alternative spellings (Steven, Stephen) I aimed to adjust the most common names, and accept that this has not been completed.
Congratulations Andrew’s and Sarah’s!
Do Club Runners Run Faster?
The average club runner time is 3 minutes faster than the average unattached time. Does this answer the question. At some level yes, but there are other hidden factors. Male finishing times are on average faster than female times, and we might expect the average 30 year old to run quicker than the average 70 year old. To answer this in more detail, we might want to check the distribution of ages and genders across clubs. For now, I’ll stop at this top level result.
Running Speed vs. Position
When I encourage friends and colleagues to give parkrun a go, the most common reservation is that they will come last. Whilst unlikely, given that some people walk around each week, it could be reassuring to predict their finishing position (i.e. no where near last) given an expected running time. Cue the most colourful graph of my findings!
Looks neat, but what does it show? We see that the number of runners is generally increasing over time: an observation more clearly shown below. We also see that as running speed decreases, there is a larger variance of finishing position.
Taking the median time by finish position, results in a surprisingly smooth curve up to around 350th place (40 minutes). A good 5km target for people new to running is 30 minutes, which would see them finish in the top 200 in 50% of events.
The increase in attendance has seen the average parkrun time increase:
I also found that the average time is faster in the winter, again, probably a result of lower attendance. One cruel narrative is that beginner’s take up running in the summer, and give up by the winter. A kinder narrative is that these runners, after a few months of training are themselves faster and help bring the average time down.
Winning Times
We are reminded in every event briefing that parkrun is not a race: people run for many different reasons. Taken from the parkrun website: ’parkrun is a positive, welcoming and inclusive experience where there is no time limit and no one finishes last. Everyone is welcome to come along, whether you walk, jog, run, volunteer or spectate.’ All well and good, but there’s no denying that the front runners certainly have some ambition to win and improve their times. A natural question is to predict the finishing time of the winner. As a fairly strong club runner (17 minute 5km shape), I’m rarely too far from the front if I run well.
Here is the winning time by event date. It’s very erratic, and I’m not hopeful that a time series model would fit well.
A couple of ideas for further exploration at a later date:
- The Leicestershire Road Running League / Cross Country League. These events are usually on a Sunday. The fastest runners in the county may take the Saturday off parkrun or choose to run slower.
- Periodicity in winning time in relation to the mean/median. Let T be the median winning time at the Leicester parkrun, and weeks be labelled F or S if the winning time is Faster or Slower than T respectively. If there is periodicity in the string of F’s and S’ then the slower weeks (and event numbers) can be predicted
The proven strategy to win races is to run faster. A more tactical strategy is to pick your races!
Weather Impact
How does the weather play a part in the number of runners? Are people put off more by the rain, or by the wind? I produced a heatmap of the correlation coefficients between the weather variables and event attendance. I particularly like the coolwarm cmap.
Looks like visibility is a big factor followed by wind speed. However, much to my surprise, we find that attendance increases as wind speed increases – the Leicestershire runners are a hardy bunch!
I’ve coloured the above by event year, which appears to be a lurking variable. This looks like an occurance of Simpson’s Paradox: overall there is positive correlation between wind speed and event attendance, but for each year, there is negative correlation between the two variables. Explained since the event attendance has increased over time.
Further exploration could consider the change in attendance from week to week, overlayed with the weather data. After a couple of weeks of poor weather, attendance may bounce back as runners reset their weather expectations.
Summary
On the whole, this just scratches the surface of questions which could be asked. It’s a clean dataset which has been fun to collate and work with. There is also an opportunity to compare data from different parkruns. For instance, there is a second parkrun in Leicester at Braunston park, and some runners would visit both throughout the year. Tracking the attendance at a network of events could be a neat visualisation.
Another question often thrown around in running clubs is the difficulty of various parkrun courses: where should I go to run a personal best? Perhaps for another time.
Until next time,
Scott