Let's face it - sitting in traffic is about as much fun as poking yourself in the eye with a biro.
I was recently and frustratingly stuck in traffic, whilst trying to park in Cambridge. So I started wondering if the machine-learning we are building into Stratiam, could also help with parking space prediction, by suggesting when and where to park at a precise point of time in the future. This way it would be easier to park in town, which in turn could reduce congestion, charges, parking tickets, air pollution and poking-of-eyes.
So, I sat down with The API Guy, a.k.a. ‘Chris’, and we began to discuss what sort of relevant data might be good predictor for Cambridge parking.
We agreed historical parking data, by car-park, hour of day and day of week might be a good start point. Since rush-hour times and weekends would likely have some impact. We also thought weather data might have some bearing but weren’t sure to what extent. Do warm sunny days draw more car drivers in? Or does opposite occur, and the rain forces people to ditch their cycle/walk in favour of driving in? Do public holidays make certain car parks busier or quieter? What about graduation days? There were so many questions and variables we could explore. We sketched out how our basic machine learning model might look.
Planning our machine learning inputs and outputs for parking prediction.
The multivariate dependencies we wanted to understand, were akin to other data sets we’d been transforming in Stratiam, so we decided to build a 'predictive analytics machine learning' model and started laying out the desired inputs and outputs.
After a few searches we found a basic feed from the Cambridge City Council showing real-time spaces by car park. This is a fairly basic, 'current live state' view, that doesn't show any historical data, but we could use this to set up a schedule to check this on an hourly basis and start building up the parking history ourselves.
There are also several weather data feeds available for both historical weather and weather-forecasts. Relevant reference data-sets (e.g. public holidays & graduation days) was also available online in different forms.
It was time for The API Guy to start tapping into those APIs. So, we loaded him up with coffee.
The API guy wrote a load of Cloud Functions in order to automate the feeds we needed and also load these in Google's BigQuery.
He set our parking feeds to 'append' new data (so we could grow the parking history) and also he appended historical-weather but would 'overwrite' new weather-forecast data - when the new weather-forecast became available. The resulting tables for which looked something like the following:
Before we could initiate any sort of machine learning on our data we had to get it into a sensible shape. The historical parking conditions needed to be split out by their component time parts (e.g. day of week, hour of day), historical weather and forecasted weather needed to unioned together to create a continuous ‘weather’ stream and we needed something of a ‘data scaffold’ as a way to contain all past and future events. Any overlaps in the data also needed to be excluded so they weren’t double-counted and all metrics needed to be converted into a consistent form.
Our ELT & 'data scaffold' mock-up.
There are several cloud based machine learning platforms out there, but since our data collection tech and the resulting stored data, all sit on the Google Cloud Platform. So it made sense to use this as a base from which to build our machine-learning model. Another suitable machine-learning alternative would be Amazon’s SageMaker which is part of AWS. Our visualisation front-end is powered by Looker.
More details for each can be found in the reference notes below.
We’re setting up our machine learning for predictive forecasting, for which we’ve used a linear regression model. We’ve also set our evaluation ‘split’ to 20% (this is the percent of known vs predicted model output data so we can evaluate how well the model is actually performing). There’s a whole science to best practices here, but the typical rule-of-thumb for what we’re trying to do would be somewhere in the range 20-30% for the evaluation share of the split. Once our model is up and running, there are a set of useful metrics we can output from the model.
Linear-regression parking model evaluation metrics.
A good KPI for our model’s performance here is the r-squared value, which will sit between 0 (poor correlation) and 1 (perfect correlation). The better the r-squared value the better the accuracy of our parking predictability forecast. As r-squared is a significant KPI for us we also apply some conditional formatting (and emojis) for good and bad performance thresholds.
Once we are happy with how the model’s performance is evaluated we can set up our series of ‘prediction’ values. In our case we’re looking to ‘predict’ how many cars will park at a given point of time in the future, so this is our output. These future predictions are then joined back onto our original data scaffold array - so the corresponding future values can be accessed by day, hour of day, car park and so on...
Voila! Once everything is set up correctly we can build a viz of our known parking values alongside our ‘predicted parking’ future values. All being well, and as per this case, then the chart’s future values should take a similar shape as for the historicals, In addition, the historical ‘predicted’ values (if this can be described as a thing) are also shown on the chart to display how well our predictions would have worked against actuals, based on the evaluation split logic we’d defined earlier.
Total parked cars and forecasted parked cars.
Multi-weather and parking prediction.
Occasionally our data sets can receive null or zero values, and as per the example image. These can occur when the data-feed (or website in this case) had gone down and will break the historical data series, which would in turn also impact our machine learning model. For which there are a few correction approaches.
If we take any single weather 'predictor' metric, then we'll find it holds a poor correlation with the total number of parked cars (r-squared values less than 0.5). Where we see something of a week correlation, then this is usually the case because it is inherent in the data. For example, the sun comes out during the day, and so the temperature & UV index increases during the day - this coincides with more people parking in the city during the day. But we must remember correlation and causation are different things.
The beauty of our machine-learning model, is that it takes into account multiple weather metric variables, alongside time-of-day, day-of-week, and other values to give us our predictor outputs.
[dashboard 26]
At a city-level there is strong correlation between parking-availability and traffic-congestion - but we need to do further work to fully understand the relationship at a carpark level. This is because traffic congestion is reported at a precise geo-point and/or road name - but you can't always correlate this to an individual car-park, since this road may join cross two or more car-parks. Nevertheless the traffic congestion reports at a street-level are useful compliments to the parking predictor dashboard, and for anyone looking to plan a trip into the city centre, so we've built these in alongside. We've also included historical trend data so that users can take previous traffic events as signals for future planning.
Cambridge parking charges vary by hour of day and day of week and typically these are more expensive during peak hours. Cost can be a factor when deciding where to park. Someone looking to park their car may be looking for the cheapest, least-full car park. So it is useful to compare on each. At the time of writing 'Grafton East' is both cheaper and has more available spaces than the Grand Arcade car park.
Finding the cheapest, least-full car park by comparison on the real-time feed.
Crucially, the real value from our parking-prediction model would be to interrogate the data to ask: "Where will be the cheapest and least-full car park one hour from now?" (or time it takes you to drive to the city). This way you can plan your trip around what capacity will be like when you arrive, not at the point of leaving your house/workplace/etc. This removes the need to use your smart phone whilst driving (never recommended) and instead give you a suggested destination before you leave.
Our suggested parking car park recommendations, one hour from now, would now look a little more like this.
Hop, step and a jump into our predictions, one hour from now, and our parking recommendations are very different. In this case, and for our arrival time, the Queen Anne car park will be the cheapest and least-busy.
We wanted some imagery to brighten up our dashboard. The design elements here are a little bit subjective, and there are no right and wrongs here. In our example we’ve used some emojis to represent weather conditions and modified a simple vector of King's College to represent current weather conditions in Cambridge. What visual elements you use here is only limited by imagination, but here is our simple set to accommodate for most weather conditions in Cambridge (it rarely snows here).
Example Cambridge Weather Icons
Dashboards should be designed for a particular end user's set of requirements, which in this case we didn't have a specific end user in mind. So the below is merely a collection of general elements from the above examples, to show how everything might fit together in a dashboard format.
[dashboard 23]