Glasgow’s Music Hotspot -Coursera IBM Capstone Project

Full notebook with code can be found here: https://github.com/alexbaylis41/IBM-CAPSTONE-PROJECT

A. Introduction

The basis of this study is to help a group of investors source a suitable location for a new music hub location in Glasgow, Scotland.

Venue Location Criteria

  • It’s important that this location is suitable to draw in the maximum number of people, therefore amenities such as public transport and restaurants/cafes/bars would ideally be available.
  • This venue will require a building with a large sqft area. This might rule out certain central locations due to size constraints.
  • Lastly, information about rental and property price data will be important. An up and coming location, slightly outside the city centre may be more affodable and therefore more preferable.

Tools and Libraries for Analysis

To help choose a location for the hub, I will analyse the various districts and neighbourhoods with these tools:

  • Pandas: Create data frames for easy integration and manipulation of data
  • Scikit-learn — KNN imputer class to fill in missing values based on nearest neighbours
  • Geopy Geocoders: To convert address into latitude and longitude values using Nominatim
  • Matplotlib: For plotting various graphs
  • Folium Map: Visualise your data on a Leaflet map
  • Choropath Map: Visualise how rental prices vary across Glasgow neighbourhoods on a thematic map using colour variations to highlight differences
  • K-means Clustering: Unsupervised learning technique. Creates clusters of similar neighbourhoods from the scraped data. This information will be then highlighted and clearly displayed on the folium map. Its this technique that will allow me to make informed decisions on best location candidates that fit the criteria set above.

B. Data

The data used to analyse neighbourhoods was sourced from several locations using web scraping or via an API.

GLASGOW POSTAL CODES

GEOJSON DATA

Map of postcode districts using geo json data

PROPERTY RENTAL DATA

FOURSQUARE LOCATION DATA

Using the API service, this will provide useful location data including — local amenities i.e popular shops, cafes and also transport links.

C. Methodology

Create initial data frame

To create my main dataframe, I initially scraped postcode data using the beautiful soup library. I then went on to clean and remove any unnecessary characters and postcodes that were out with the city limits.

Initial data frame

Acquire longitude and latitude data with Geo locator

Next, I used the postcodes from the table to search for the longitude and latitude information using the geolocator library.

Data frame with latitude and longitude data

Scrape Rent Prices

As mentioned earlier, I couldnt find a working dataset with typical commercial rental prices per postcode.

Data frame with rental prices included. Notice the NaN values.

Impute Missing Price Values

K-nearest neighbor is an algorithm that is useful for matching a point with its closest (k) neighbours in a multi-dimensional space.

Complete data frame with added imputed price data

Create Folium Map

To further understand and visualise the Glasgow neighbourhoods I utilised the folium library to build an interactive map with markers of each postcode location.

Foursquare API

Finally, using the foursquare API I searched for a list of venues in a 1000m radius of each postcode district. I then created a data frame out of all the venues in each area.

D. Data Analysis

Comparing Average Rent Prices For Each Neighbourhood

To visualise the rent price variation per neighbourhood, I created a choropleth map. This is a great tool to quickly see which area had the highest and lowest rent prices.

Choropleth map of Glasgow rental prices

Explore Neighbourhood Venues

To anaylse the various amenities available to each postcode area, I grouped the venues by category type and counted the total categories per neighborhood. There were 198 unique venues in the city.

Total count of unique venue categories per neighbourhood

One Hot Encoding

One hot encoding is a technique used to encode categorical data (here the venue categories in each of the neighborhoods) to a binary format for use in ML modeling.

One hot encoding table
Top venue categories per neighbourhood

K-Means Clustering

K-means is an unsupervised learning technique, that is it doesnt require any labeling of the data before use. Its goal is to group similar instances into clusters.

Scaling Price Data

I first had to scale the price data using Sci-kit learns StandardScaler.

Silhouette Score

It’s no easy task finding the optimum number of clusters to accurately represent your data, but there are some tools that help make a decision.

Silhouette Scores. Notice the elbow at cluster 3 where it dips and begins to level out. This is a good indicator of the ideal cluster.

Silhouette Diagram

I felt the graph above wasnt overly helpful in making a call on the cluster numbers. So I decided to plot a silhouette diagram.

Fitting the Data

After fitting the K means algorithm to the data. Finally I added the cluster column to our the data frame, organising it by these clusters.

Final dataframe with clusters, prices and top venue categories

E. Results

To best visualise the neighbourhood allocation to each cluster, I added them to the choropleth map.

Graph of Number of Venue Categories, Rental prices per Postcode District
Top locations based on high amenities and low rent prices

F. Discussion

The aim of this investigation was to highlight suitable areas to plan a music hub in Glasgow.

G. Conclusion

This project attempted to find a solution to a hypothetical business problem utilising some of the current data science tools and techniques.