Glasgow’s Music Hotspot -Coursera IBM Capstone Project

Full notebook with code can be found here: https://github.com/alexbaylis41/IBM-CAPSTONE-PROJECT

A. Introduction

The basis of this study is to help a group of investors source a suitable location for a new music hub location in Glasgow, Scotland.

Glasgow is Scotlands largest metropolis, with nearly 600,000 residents located in its central region and up to 1.8 million living in the surrounding suburbs. It’s long been considered the prominent cultural hot spot in not just Scotland, but the UK. This was supported recently with the announcement that Glasgow is the UK’s first UNESCO City of Music and UK’s top cultural and creative city (European Commission 2019).

From gigs in small venues and bars to incredible shows in arenas and city parks, Glasgow is a hotspot for live performance. Whether bagpipes or beats, traditional or techno, Glasgow is not afraid to make some noise.

For this reason, Ive decided to investigate the ideal location to open a new music hub. A one stop venue where musicains come to be educated, rehearse and perfrom their own music.

  • It’s important that this location is suitable to draw in the maximum number of people, therefore amenities such as public transport and restaurants/cafes/bars would ideally be available.
  • This venue will require a building with a large sqft area. This might rule out certain central locations due to size constraints.
  • Lastly, information about rental and property price data will be important. An up and coming location, slightly outside the city centre may be more affodable and therefore more preferable.

To help choose a location for the hub, I will analyse the various districts and neighbourhoods with these tools:

  • Beautiful Soup: Used to scrape website data for further processing
  • Pandas: Create data frames for easy integration and manipulation of data
  • Scikit-learn — KNN imputer class to fill in missing values based on nearest neighbours
  • Geopy Geocoders: To convert address into latitude and longitude values using Nominatim
  • Matplotlib: For plotting various graphs
  • Folium Map: Visualise your data on a Leaflet map
  • Choropath Map: Visualise how rental prices vary across Glasgow neighbourhoods on a thematic map using colour variations to highlight differences
  • K-means Clustering: Unsupervised learning technique. Creates clusters of similar neighbourhoods from the scraped data. This information will be then highlighted and clearly displayed on the folium map. Its this technique that will allow me to make informed decisions on best location candidates that fit the criteria set above.

B. Data

The data used to analyse neighbourhoods was sourced from several locations using web scraping or via an API.

Scraped from wikipedia using the beautiful soup library and subsequently cleaned up to label the folium map with accurate district information.

I found this data freely available from the National Records of Scotland website. It was used to create a choropleth map using geo json data of each post code area. I then superimposed a colormap of variences in rent prices per neighborhood.

Map of postcode districts using geo json data

Unfortunatley, I never found a data source for commercial property rental prices per postcode area. I did however find a dataset containing average rental prices of properties by bedroom number. Although not exactly what i was looking for, I believed it would act as an accurate indicator of price variation between postcodes none the less.

There were a few missing values in this dataset which I had to account for using imputation. I discuss this further in the methodology section.

Using the API service, this will provide useful location data including — local amenities i.e popular shops, cafes and also transport links.

C. Methodology

To create my main dataframe, I initially scraped postcode data using the beautiful soup library. I then went on to clean and remove any unnecessary characters and postcodes that were out with the city limits.

Initial data frame

Next, I used the postcodes from the table to search for the longitude and latitude information using the geolocator library.

I had some minor issues initially when some postcodes returned incorrect locations e.g certain postcodes having coordinates in the Philippines!

After messing around with the address parameters however, I finally had all the geo data and added it to the data frame.

Data frame with latitude and longitude data

As mentioned earlier, I couldnt find a working dataset with typical commercial rental prices per postcode.

Instead, I settled with a dataset containing the 2019 average prices for renting a 2 bedroom property.

I finished of the dataframe with the price data attached.

Data frame with rental prices included. Notice the NaN values.

K-nearest neighbor is an algorithm that is useful for matching a point with its closest (k) neighbours in a multi-dimensional space.

It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data.

Since there was a large number of missing values, I decided not to simply remove all the districts with missing values. Instead, I landed on KNN imputation, witch out of the various other imputing techniques, including replacing values with mean, mode etc, I concluded that KNN imputation would be best method to provide good rent price estimates, for each neighbourhood, based on the the prices of its surrounding postcode areas.

Sci-kit learns KNN imputer class made this job simple using a smallish k size of 4 and setting the weights parameter to ‘distance’. This would weigh neighborhoods closest more highly than those further away.

This leaves us with our complete data frame we can use for further analysis.

Complete data frame with added imputed price data

To further understand and visualise the Glasgow neighbourhoods I utilised the folium library to build an interactive map with markers of each postcode location.

Finally, using the foursquare API I searched for a list of venues in a 1000m radius of each postcode district. I then created a data frame out of all the venues in each area.

D. Data Analysis

To visualise the rent price variation per neighbourhood, I created a choropleth map. This is a great tool to quickly see which area had the highest and lowest rent prices.

Choropleth map of Glasgow rental prices

For example, in dark red is the most expensive neighborhoods to rent a property. These include the city centre (G1, G2) the west end (G3, G12) and some of the western suburbs (G61, G62).

To anaylse the various amenities available to each postcode area, I grouped the venues by category type and counted the total categories per neighborhood. There were 198 unique venues in the city.

Total count of unique venue categories per neighbourhood

Thes most amenitiy rich areas can be seen with 100+ unique venues. This is mostly in city centre areas as expected.

One hot encoding is a technique used to encode categorical data (here the venue categories in each of the neighborhoods) to a binary format for use in ML modeling.

It creates a sparse data frame, taking every venue category variable and allocating it a 1 if present in that specific neighborhood and 0 if absent.

One hot encoding table

This was then further processed by grouping the rows by neighborhood and calculating the mean frequency of occurence for each venue category.

Basically it allows us now to notice the most common venue types in each neighbourhood.

For any investors this is useful information to understand as it clearly illustrates the overall character of an area. For example, is it a cosmopolitan area with cafes and bars? An outskirts area with shopping centres, or an industrial area etc?

We processed this data and presented it in a data frame with the top 5 venue categories per neighborhood.

Top venue categories per neighbourhood

K-Means Clustering

K-means is an unsupervised learning technique, that is it doesnt require any labeling of the data before use. Its goal is to group similar instances into clusters.

I first had to scale the price data using Sci-kit learns StandardScaler.

Standardization is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

Once this was done, I added it into the one hot encoded data frame with all the venue category data. We now have our data set ready for K-means clustering.

It’s no easy task finding the optimum number of clusters to accurately represent your data, but there are some tools that help make a decision.

To help we use a metric called a silhouette score.

A silhoutette score varys between -1 and 1, with close to 1 indicating instances are well inside their cluster and close to 0 meaning they are on the cluster boundary. Close to -1 means they are likely in the wrong cluster.

By iterating through different numbers of clusters (k) we can plot a graph to highlight the ‘elbow’ or drop off, where adding more clusters is unlikely to help.

Silhouette Scores. Notice the elbow at cluster 3 where it dips and begins to level out. This is a good indicator of the ideal cluster.

I felt the graph above wasnt overly helpful in making a call on the cluster numbers. So I decided to plot a silhouette diagram.

From the graphs below , the height of each bar represents the number of instances in each cluster, and the width is the sorted silhouette coefficients of the instances in each cluster (wider is better).

The dashed line is the silhouette score, or the mean silhouette coefficient.

We want all clusters to be beyond this line as that means each instance is contained far from the boundary of the cluster.

We can see from the diagram clusters 3, 4, 5 look most promising.

Ive decided to settle on cluster 4, since the clusters are more evenly shaped and all reach over the dotted line.

After fitting the K means algorithm to the data. Finally I added the cluster column to our the data frame, organising it by these clusters.

This is our final dataframe.

Final dataframe with clusters, prices and top venue categories

E. Results

To best visualise the neighbourhood allocation to each cluster, I added them to the choropleth map.

Now together with the price colormap we can better inform investors in potential venue candidates.

Unfortunately, I struggled to find any clear differences between the clusters, with many sharing similar features. This made it hard to draw any clear conclusions from the clustering alone.

To help better choose candidates based on our intial criteria of high amenities/low rental costs, I plotted a graph to help spot the best matches.

Graph of Number of Venue Categories, Rental prices per Postcode District

From this plot we can clearly see a correlation between postcodes with high number of venue categories (local amenities) and high average rent price.

Our goal was to find locations with both the highest number of amenities for the lowest rent price.

From the plot, my top choices include:

G14 — Whiteinch, Scotstoun,

G40 — Bridgeton, Calton, Dalmarnock

G42 — Battlefield, Govanhill, Mount Florida

G67 — Cumbernauld (south) G53 — Darnley, Pollok, Crookston

The venue category totals together with prices are displayed in the table ‘Top picks’ below.

Top locations based on high amenities and low rent prices

F. Discussion

The aim of this investigation was to highlight suitable areas to plan a music hub in Glasgow.

I hoped the K-means clustering algorithm would have return a list of similar districts with common amenities and prices to help answer this question. But instead no strong conclusions could be drawn based on the generic clusters produced.

As with most machine learning algorithms, poor data in, leads to poor data out. Limitations including not finding proper commercial property rental prices and having to impute missing values in what price data we had, could have contributed to the poor modelling results.

Due to the lack of clear differences in the clusters, the only safe conclusions we could gather from the data was from the plot of amenities vs price and manually searching for best candidates.

I guess in this case the simplest solution is the best.

Lastly the choropleth maps proved to be a useful visualisation tool to compare rental prices for each neighbourhood at a glance. If the clustering had yielded significant similar neighbourhoods, plotted on the map would have been a great tool to quickly see possible venue suiters.

G. Conclusion

This project attempted to find a solution to a hypothetical business problem utilising some of the current data science tools and techniques.

Although in this occasion a perfect solution was not found, these techniques when used alongside good data can produce accurate results which would be highly beneficial to anyone looking to gain insights into selected city districts and neighbourhoods in future projects.

Thank for taking the time to read.