Data Analytics in Football - Part 1: Project Formulation, Data Acquisition and Preparation

Using data analytics and machine learning to predict football match final result.

Qamarul 'Ashraf
2 min readMar 27, 2021

Project Formulation

This project is developed in order to identify the factors that affects football match final result using data analytics techniques. The findings from the analytics techniques is then used to develop a predictive model based on historical data. The project scope covers a total of 10 seasons of 5 football leagues, specifically, European top 5 leagues which are; English Premier League, German Bundesliga, Italian Calcio A, Spanish Primera Division/La Liga and French Ligue 1. The requirements for this study are:

  1. Jupyter notebook. Python is used to obtain data, perform analytics and train predictive model.
  2. Historical data. This study obtains dataset from football-data.co.uk

To have a better understanding of what you are reading about, here is the system architecture:

System Architecture

Data Acquisition

The data is obtained from the website stated in the previous chapter. Rather than scrapping, the dataset is simply read using pandas module and stored into the data frame

# ACQUIRE DATASETS FROM football-data.co.uk# Englandeng11 = pd.read_csv(“https://www.football-data.co.uk/mmz4281/1011/E0.csv")eng12 = pd.read_csv(“https://www.football-data.co.uk/mmz4281/1112/E0.csv")

The code is repeated multiple times to store different league seasons into different dataframe. However, if reader intends to store the datasets directly into one single dataframe, by all means, proceed. use “df.head()” to view the first few line of the datadrame, or “df.tail()” for the last few lines.

Data Preparation

The next sensible step following data acquisition is to check the dataset for any missing or inconsistent values, or transform the dataset into whatever dear readers need. The process carried out first is parsing the date into datetime object. Remember that in Python, indentation is important. Therefore, a single tab is required at the lines that returns the value of a function.

# Parse date from string to datetime objectdef parse_date(date):# Converts date from string to datetime object.return datetime.strptime(date, '%d/%m/%y').date()def parse_date_other(date):# Converts date when strptime layout is differentreturn datetime.strptime(date, '%d/%m/%Y').date()

Next, new attributes are derived based. Of course after replacing the missing values by removing the columns, all that is left is just feed the dataset into the predictive model and be done with it. However, since this is a historical data, and the fact that the home/away/draw result is in the same row as the full time result ‘FTR’ (side note, this is the project’s class label), the predictive model will return an accuracy of 1.0. This is due to the fact the full time result DETERMINES which team won the league match. If the explanation is unclear, we will revisit this during predictive model development.

--

--

No responses yet