FantasyFootball feature engineering
In this project, we will try to predict the performances of the players of Italian football championship Serie A. The project is meant in the context of Fantasy Football, a popular game in which a group of friends organize a virtual team of real serie A players. The aim of the game is to score more than the opponen (the score is the sum of the performance evaluation marks) of the players of your team. Predicting the marks of your players could definitely be an advantage for a fantasy football coach.
In this first notebook, we will create the dataset and do a lot of feature engineering.
#import
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
Let try to extract data from the sqlite dataset
engine = create_engine('sqlite:///2014')
# Open engine connection: con
con = engine.connect()
table_names = engine.table_names()
print(table_names)
[u'games', u'marks_corriere', u'marks_fantagazzetta', u'marks_gazzetta', u'players', u'sqlite_sequence']
let’s focus for the moment on the results of just one sport newspaper such as Gazzetta
with engine.connect() as con:
rs = con.execute('SELECT * from games')
# Save results of the query to DataFrame: df
games = pd.DataFrame(rs.fetchall())
games.columns = rs.keys()
rs = con.execute('SELECT * from marks_gazzetta')
marks = pd.DataFrame(rs.fetchall())
marks.columns = rs.keys()
rs = con.execute('SELECT * from players')
players = pd.DataFrame(rs.fetchall())
players.columns = rs.keys()
players.head()
| id | name | team | role | |
|---|---|---|---|---|
| 0 | 2247 | Sportiello | Atalanta | P |
| 1 | 2248 | Benalouane | Atalanta | D |
| 2 | 2249 | Biava | Atalanta | D |
| 3 | 2250 | Drame' | Atalanta | D |
| 4 | 2251 | Zappacosta | Atalanta | D |
marks.head()
| id | day | player | team | mark | goal | gotgoal | penalty | spenalty | mpenalty | owngoal | assist | ycard | rcard | fmark | enter | exit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 21175 | 1 | 2247 | Atalanta | 7.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7.0 | 0 | 0 |
| 1 | 21176 | 1 | 2248 | Atalanta | 6.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6.0 | 0 | 0 |
| 2 | 21177 | 1 | 2249 | Atalanta | 6.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6.0 | 0 | 0 |
| 3 | 21178 | 1 | 2250 | Atalanta | 5.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 5.0 | 0 | 0 |
| 4 | 21179 | 1 | 2251 | Atalanta | 6.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6.0 | 0 | 0 |
Let’s join this dataset in order to obtain the player names together with the id (it won’t affect the results of course, but it will be easier to check the process
marks_merged = pd.merge(marks,players, how = 'inner', left_on='player', right_on='id')
marks_merged.head()
| id_x | day | player | team_x | mark | goal | gotgoal | penalty | spenalty | mpenalty | ... | assist | ycard | rcard | fmark | enter | exit | id_y | name | team_y | role | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 21175 | 1 | 2247 | Atalanta | 7.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 7.0 | 0 | 0 | 2247 | Sportiello | Atalanta | P |
| 1 | 21453 | 2 | 2247 | Atalanta | 7.5 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 6.5 | 0 | 0 | 2247 | Sportiello | Atalanta | P |
| 2 | 21733 | 3 | 2247 | Atalanta | 6.0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 5.0 | 0 | 0 | 2247 | Sportiello | Atalanta | P |
| 3 | 22012 | 4 | 2247 | Atalanta | 6.5 | 0 | 2 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 7.5 | 0 | 0 | 2247 | Sportiello | Atalanta | P |
| 4 | 22291 | 5 | 2247 | Atalanta | 5.0 | 0 | 3 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 2.0 | 0 | 0 | 2247 | Sportiello | Atalanta | P |
5 rows × 21 columns
marks_merged.sort_values(['day', 'team_x','name']).head()
| id_x | day | player | team_x | mark | goal | gotgoal | penalty | spenalty | mpenalty | ... | assist | ycard | rcard | fmark | enter | exit | id_y | name | team_y | role | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37 | 21176 | 1 | 2248 | Atalanta | 6.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 6.0 | 0 | 0 | 2248 | Benalouane | Atalanta | D |
| 281 | 21186 | 1 | 2258 | Atalanta | 6.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 6.0 | 1 | 0 | 2258 | Bianchi | Atalanta | A |
| 64 | 21177 | 1 | 2249 | Atalanta | 6.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 6.0 | 0 | 0 | 2249 | Biava | Atalanta | D |
| 302 | 21187 | 1 | 2259 | Atalanta | 5.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 4.5 | 1 | 0 | 2259 | Boakye | Atalanta | A |
| 140 | 21180 | 1 | 2252 | Atalanta | 6.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 6.0 | 0 | 0 | 2252 | Bonaventura | Milan | C |
5 rows × 21 columns
For the moment let’s use just a subset of the features, we will have time to get them back.
marks_final = marks_merged[['player','name','team_x','day','goal','penalty','owngoal','mark']]
Don’t really like that “team_x”
marks_final=marks_final.rename(columns = {'team_x':'team'})
let us focus on the games.
games.head()
| day | home | away | |
|---|---|---|---|
| 0 | 1 | Chievo | Juventus |
| 1 | 1 | Atalanta | Verona |
| 2 | 1 | Torino | Inter |
| 3 | 1 | Sassuolo | Cagliari |
| 4 | 1 | Genoa | Napoli |
as suspected, there are no results for the matches. We should obtain those from the player statistics. What will influence the results?
- Of course, the Goals, so consider the feature gol of the dataset marks
- What about the penalties? Let’s consider them as well
- Do not forget the unfortunate owngoals
def whoWon (teamH,teamA,day):
"""simple functions aimed to deliver the winner of each game"""
marksSpecificH = marks_final[(marks_final['day'] == day) & (marks_final['team'] == teamH)]
marksSpecificA = marks_final[(marks_final['day'] == day) & (marks_final['team'] == teamA)]
#print(marksSpecificH)
goalH = marksSpecificH['goal'].sum() + marksSpecificH['penalty'].sum() + marksSpecificA['owngoal'].sum()
#print(marksSpecificA)
goalA = marksSpecificA['goal'].sum() + marksSpecificA['penalty'].sum() + marksSpecificH['owngoal'].sum()
winner = 0
if (goalH > goalA): winner = 1
if (goalH < goalA): winner = 2
if (goalH == goalA): winner = 0
return winner
games['winner'] = map(lambda teamH,teamA,day: whoWon(teamH,teamA,day), games['home'], games['away'], games['day'])
games[games['day']==1]
| day | home | away | winner | |
|---|---|---|---|---|
| 0 | 1 | Chievo | Juventus | 2 |
| 1 | 1 | Atalanta | Verona | 0 |
| 2 | 1 | Torino | Inter | 0 |
| 3 | 1 | Sassuolo | Cagliari | 0 |
| 4 | 1 | Genoa | Napoli | 2 |
| 5 | 1 | Palermo | Sampdoria | 0 |
| 6 | 1 | Cesena | Parma | 1 |
| 7 | 1 | Roma | Fiorentina | 1 |
| 8 | 1 | Udinese | Empoli | 1 |
| 9 | 1 | Milan | Lazio | 1 |
Very good! now we have the results. However, I am afraid that we need more feature engineering if we want other interesting informations. Indeed, the rank of the team is quite an important matter, for this we need to rebuild the rankings.
days = np.arange(0,39,1)
teams = games1['home']
teams = teams.drop_duplicates()
points = pd.DataFrame(columns=np.arange(0,39,1))
points['Team'] = teams
points[0] = 0
cols = points.columns.tolist()
cols = cols[-1:] + cols[:-1]
points = points[cols]
points
| Team | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Chievo | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Atalanta | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Torino | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Sassuolo | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Genoa | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | Palermo | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | Cesena | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | Roma | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | Udinese | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | Milan | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | Napoli | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 11 | Empoli | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | Fiorentina | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13 | Cagliari | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 14 | Juventus | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | Verona | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 16 | Sampdoria | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 17 | Lazio | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 18 | Parma | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 19 | Inter | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20 rows × 40 columns
def updatePoints (teamH,teamA,day,winner):
if (winner == 1):
points.ix[points['Team'] == teamH, day] = 3
points.ix[points['Team'] == teamA, day] = 0
if (winner == 2):
points.ix[points['Team'] == teamA, day] = 3
points.ix[points['Team'] == teamH, day] = 0
if (winner == 0):
points.ix[points['Team'] == teamH, day] = 1
points.ix[points['Team'] == teamA, day] = 1
_ = map(lambda teamH,teamA,day,winner: updatePoints(teamH,teamA,day,winner), games['home'], games['away'], games['day'], games['winner'])
points
| Team | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Chievo | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 3 | 1 | 1 | 3 | 1 | 1 | 0 | 1 | 0 |
| 1 | Atalanta | 0 | 1 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | ... | 0 | 3 | 1 | 1 | 1 | 1 | 3 | 0 | 1 | 0 |
| 2 | Torino | 0 | 1 | 0 | 0 | 3 | 1 | 0 | 3 | 0 | ... | 3 | 1 | 1 | 3 | 1 | 0 | 0 | 3 | 0 | 3 |
| 3 | Sassuolo | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 3 | ... | 3 | 0 | 1 | 0 | 0 | 1 | 3 | 3 | 3 | 3 |
| 4 | Genoa | 0 | 0 | 1 | 3 | 1 | 0 | 3 | 1 | 3 | ... | 1 | 3 | 0 | 3 | 3 | 0 | 3 | 3 | 3 | 0 |
| 5 | Palermo | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 3 | 0 | ... | 0 | 3 | 3 | 0 | 1 | 1 | 0 | 3 | 0 | 3 |
| 6 | Cesena | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 7 | Roma | 0 | 3 | 3 | 3 | 3 | 3 | 0 | 3 | 1 | ... | 3 | 1 | 1 | 0 | 3 | 3 | 0 | 3 | 3 | 0 |
| 8 | Udinese | 0 | 3 | 0 | 3 | 3 | 3 | 1 | 0 | 3 | ... | 1 | 0 | 1 | 3 | 0 | 3 | 0 | 0 | 0 | 0 |
| 9 | Milan | 0 | 3 | 3 | 0 | 1 | 1 | 3 | 3 | 1 | ... | 3 | 1 | 1 | 0 | 0 | 0 | 3 | 0 | 3 | 3 |
| 10 | Napoli | 0 | 3 | 0 | 0 | 1 | 3 | 3 | 1 | 3 | ... | 0 | 3 | 3 | 3 | 0 | 3 | 1 | 3 | 0 | 0 |
| 11 | Empoli | 0 | 0 | 0 | 1 | 1 | 1 | 3 | 1 | 0 | ... | 0 | 0 | 1 | 1 | 3 | 3 | 0 | 0 | 1 | 0 |
| 12 | Fiorentina | 0 | 0 | 1 | 3 | 1 | 1 | 3 | 0 | 1 | ... | 3 | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 3 | 3 |
| 13 | Cagliari | 0 | 1 | 0 | 0 | 0 | 3 | 0 | 1 | 3 | ... | 0 | 0 | 0 | 3 | 0 | 3 | 1 | 0 | 3 | 3 |
| 14 | Juventus | 0 | 3 | 3 | 3 | 3 | 3 | 3 | 1 | 3 | ... | 3 | 0 | 3 | 0 | 3 | 3 | 1 | 3 | 3 | 1 |
| 15 | Verona | 0 | 1 | 3 | 3 | 1 | 0 | 3 | 0 | 0 | ... | 1 | 0 | 3 | 3 | 1 | 0 | 1 | 3 | 1 | 1 |
| 16 | Sampdoria | 0 | 1 | 3 | 1 | 3 | 3 | 3 | 1 | 1 | ... | 0 | 1 | 1 | 0 | 1 | 0 | 3 | 0 | 1 | 1 |
| 17 | Lazio | 0 | 0 | 3 | 0 | 0 | 3 | 3 | 3 | 3 | ... | 3 | 3 | 0 | 1 | 3 | 1 | 0 | 3 | 0 | 3 |
| 18 | Parma | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 3 | 1 | 3 | 0 | 0 | 1 | 0 | 1 | 1 |
| 19 | Inter | 0 | 1 | 3 | 1 | 3 | 0 | 0 | 1 | 3 | ... | 1 | 3 | 1 | 3 | 3 | 1 | 3 | 0 | 0 | 3 |
20 rows × 40 columns
#print(pointsTest)
points['total'] = 0
for i in range(39):
points['total'] = points['total'] + points[i]
points
| Team | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Chievo | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 3 | 1 | 1 | 3 | 1 | 1 | 0 | 1 | 0 | 43 |
| 1 | Atalanta | 0 | 1 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | ... | 3 | 1 | 1 | 1 | 1 | 3 | 0 | 1 | 0 | 37 |
| 2 | Torino | 0 | 1 | 0 | 0 | 3 | 1 | 0 | 3 | 0 | ... | 1 | 1 | 3 | 1 | 0 | 0 | 3 | 0 | 3 | 54 |
| 3 | Sassuolo | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 3 | ... | 0 | 1 | 0 | 0 | 1 | 3 | 3 | 3 | 3 | 49 |
| 4 | Genoa | 0 | 0 | 1 | 3 | 1 | 0 | 3 | 1 | 3 | ... | 3 | 0 | 3 | 3 | 0 | 3 | 3 | 3 | 0 | 59 |
| 5 | Palermo | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 3 | 0 | ... | 3 | 3 | 0 | 1 | 1 | 0 | 3 | 0 | 3 | 49 |
| 6 | Cesena | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 24 |
| 7 | Roma | 0 | 3 | 3 | 3 | 3 | 3 | 0 | 3 | 1 | ... | 1 | 1 | 0 | 3 | 3 | 0 | 3 | 3 | 0 | 70 |
| 8 | Udinese | 0 | 3 | 0 | 3 | 3 | 3 | 1 | 0 | 3 | ... | 0 | 1 | 3 | 0 | 3 | 0 | 0 | 0 | 0 | 41 |
| 9 | Milan | 0 | 3 | 3 | 0 | 1 | 1 | 3 | 3 | 1 | ... | 1 | 1 | 0 | 0 | 0 | 3 | 0 | 3 | 3 | 52 |
| 10 | Napoli | 0 | 3 | 0 | 0 | 1 | 3 | 3 | 1 | 3 | ... | 3 | 3 | 3 | 0 | 3 | 1 | 3 | 0 | 0 | 63 |
| 11 | Empoli | 0 | 0 | 0 | 1 | 1 | 1 | 3 | 1 | 0 | ... | 0 | 1 | 1 | 3 | 3 | 0 | 0 | 1 | 0 | 42 |
| 12 | Fiorentina | 0 | 0 | 1 | 3 | 1 | 1 | 3 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 3 | 3 | 64 |
| 13 | Cagliari | 0 | 1 | 0 | 0 | 0 | 3 | 0 | 1 | 3 | ... | 0 | 0 | 3 | 0 | 3 | 1 | 0 | 3 | 3 | 34 |
| 14 | Juventus | 0 | 3 | 3 | 3 | 3 | 3 | 3 | 1 | 3 | ... | 0 | 3 | 0 | 3 | 3 | 1 | 3 | 3 | 1 | 87 |
| 15 | Verona | 0 | 1 | 3 | 3 | 1 | 0 | 3 | 0 | 0 | ... | 0 | 3 | 3 | 1 | 0 | 1 | 3 | 1 | 1 | 46 |
| 16 | Sampdoria | 0 | 1 | 3 | 1 | 3 | 3 | 3 | 1 | 1 | ... | 1 | 1 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 56 |
| 17 | Lazio | 0 | 0 | 3 | 0 | 0 | 3 | 3 | 3 | 3 | ... | 3 | 0 | 1 | 3 | 1 | 0 | 3 | 0 | 3 | 69 |
| 18 | Parma | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | ... | 3 | 1 | 3 | 0 | 0 | 1 | 0 | 1 | 1 | 26 |
| 19 | Inter | 0 | 1 | 3 | 1 | 3 | 0 | 0 | 1 | 3 | ... | 3 | 1 | 3 | 3 | 1 | 3 | 0 | 0 | 3 | 55 |
20 rows × 41 columns
Cool, now let’s try to obtain the position in the rank for each day of the championship
pointsDay = pd.DataFrame(columns=np.arange(0,39,1))
pointsDay['Team'] = teams
pointsDay[0] = 0
for i in range(1,39):
pointsDay[i] = pointsDay[i-1] + points[i]
pointsDay
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | Team | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 3 | 3 | 3 | 4 | 4 | 4 | 4 | 4 | ... | 35 | 36 | 37 | 40 | 41 | 42 | 42 | 43 | 43 | Chievo |
| 1 | 0 | 1 | 4 | 4 | 4 | 4 | 4 | 7 | 7 | 8 | ... | 29 | 30 | 31 | 32 | 33 | 36 | 36 | 37 | 37 | Atalanta |
| 2 | 0 | 1 | 1 | 1 | 4 | 5 | 5 | 8 | 8 | 11 | ... | 43 | 44 | 47 | 48 | 48 | 48 | 51 | 51 | 54 | Torino |
| 3 | 0 | 1 | 1 | 2 | 3 | 3 | 3 | 4 | 7 | 10 | ... | 35 | 36 | 36 | 36 | 37 | 40 | 43 | 46 | 49 | Sassuolo |
| 4 | 0 | 0 | 1 | 4 | 5 | 5 | 8 | 9 | 12 | 15 | ... | 44 | 44 | 47 | 50 | 50 | 53 | 56 | 59 | 59 | Genoa |
| 5 | 0 | 1 | 1 | 2 | 3 | 3 | 3 | 6 | 6 | 9 | ... | 38 | 41 | 41 | 42 | 43 | 43 | 46 | 46 | 49 | Palermo |
| 6 | 0 | 3 | 3 | 4 | 4 | 5 | 6 | 6 | 6 | 6 | ... | 22 | 23 | 23 | 24 | 24 | 24 | 24 | 24 | 24 | Cesena |
| 7 | 0 | 3 | 6 | 9 | 12 | 15 | 15 | 18 | 19 | 22 | ... | 57 | 58 | 58 | 61 | 64 | 64 | 67 | 70 | 70 | Roma |
| 8 | 0 | 3 | 3 | 6 | 9 | 12 | 13 | 13 | 16 | 16 | ... | 34 | 35 | 38 | 38 | 41 | 41 | 41 | 41 | 41 | Udinese |
| 9 | 0 | 3 | 6 | 6 | 7 | 8 | 11 | 14 | 15 | 16 | ... | 42 | 43 | 43 | 43 | 43 | 46 | 46 | 49 | 52 | Milan |
| 10 | 0 | 3 | 3 | 3 | 4 | 7 | 10 | 11 | 14 | 15 | ... | 50 | 53 | 56 | 56 | 59 | 60 | 63 | 63 | 63 | Napoli |
| 11 | 0 | 0 | 0 | 1 | 2 | 3 | 6 | 7 | 7 | 7 | ... | 33 | 34 | 35 | 38 | 41 | 41 | 41 | 42 | 42 | Empoli |
| 12 | 0 | 0 | 1 | 4 | 5 | 6 | 9 | 9 | 10 | 13 | ... | 49 | 49 | 49 | 49 | 52 | 55 | 58 | 61 | 64 | Fiorentina |
| 13 | 0 | 1 | 1 | 1 | 1 | 4 | 4 | 5 | 8 | 9 | ... | 21 | 21 | 24 | 24 | 27 | 28 | 28 | 31 | 34 | Cagliari |
| 14 | 0 | 3 | 6 | 9 | 12 | 15 | 18 | 19 | 22 | 22 | ... | 70 | 73 | 73 | 76 | 79 | 80 | 83 | 86 | 87 | Juventus |
| 15 | 0 | 1 | 4 | 7 | 8 | 8 | 11 | 11 | 11 | 12 | ... | 33 | 36 | 39 | 40 | 40 | 41 | 44 | 45 | 46 | Verona |
| 16 | 0 | 1 | 4 | 5 | 8 | 11 | 14 | 15 | 16 | 16 | ... | 49 | 50 | 50 | 51 | 51 | 54 | 54 | 55 | 56 | Sampdoria |
| 17 | 0 | 0 | 3 | 3 | 3 | 6 | 9 | 12 | 15 | 16 | ... | 58 | 58 | 59 | 62 | 63 | 63 | 66 | 66 | 69 | Lazio |
| 18 | 0 | 0 | 0 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | ... | 19 | 20 | 23 | 23 | 23 | 24 | 24 | 25 | 26 | Parma |
| 19 | 0 | 1 | 4 | 5 | 8 | 8 | 8 | 9 | 12 | 15 | ... | 41 | 42 | 45 | 48 | 49 | 52 | 52 | 52 | 55 | Inter |
20 rows × 40 columns
rankDay = pd.DataFrame(columns=np.arange(0,39,1))
rankDay['Team'] = teams
for i in range(0,39):
rankDay[i] = pointsDay[i].rank(ascending=0)
rankDay[i] = map(lambda x: int(x), rankDay[i])
rankDay.sort_values(by=38)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | Team | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | 10 | 3 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | Juventus |
| 7 | 10 | 3 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | ... | 3 | 2 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | Roma |
| 17 | 10 | 17 | 10 | 13 | 16 | 9 | 8 | 6 | 5 | 4 | ... | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | Lazio |
| 12 | 10 | 17 | 15 | 9 | 8 | 9 | 8 | 10 | 11 | 10 | ... | 5 | 6 | 6 | 7 | 5 | 5 | 5 | 5 | 4 | Fiorentina |
| 10 | 10 | 3 | 10 | 13 | 11 | 8 | 7 | 7 | 7 | 8 | ... | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | Napoli |
| 4 | 10 | 17 | 15 | 9 | 8 | 12 | 10 | 10 | 8 | 8 | ... | 7 | 7 | 7 | 6 | 7 | 7 | 6 | 6 | 6 | Genoa |
| 16 | 10 | 10 | 5 | 6 | 5 | 4 | 3 | 3 | 3 | 4 | ... | 5 | 5 | 5 | 5 | 6 | 6 | 7 | 7 | 7 | Sampdoria |
| 19 | 10 | 10 | 5 | 6 | 5 | 6 | 10 | 10 | 8 | 8 | ... | 10 | 10 | 9 | 8 | 8 | 8 | 8 | 8 | 8 | Inter |
| 2 | 10 | 10 | 15 | 19 | 11 | 12 | 14 | 12 | 12 | 12 | ... | 8 | 7 | 7 | 8 | 9 | 9 | 9 | 9 | 9 | Torino |
| 9 | 10 | 3 | 2 | 4 | 7 | 6 | 5 | 4 | 5 | 4 | ... | 9 | 9 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | Milan |
| 5 | 10 | 10 | 15 | 16 | 16 | 18 | 19 | 15 | 17 | 14 | ... | 11 | 11 | 11 | 11 | 10 | 11 | 10 | 11 | 11 | Palermo |
| 3 | 10 | 10 | 15 | 16 | 16 | 18 | 19 | 18 | 15 | 13 | ... | 12 | 13 | 15 | 16 | 16 | 16 | 13 | 11 | 11 | Sassuolo |
| 15 | 10 | 10 | 5 | 3 | 5 | 6 | 5 | 7 | 10 | 11 | ... | 15 | 13 | 12 | 12 | 15 | 14 | 12 | 13 | 13 | Verona |
| 0 | 10 | 17 | 10 | 13 | 16 | 15 | 16 | 18 | 19 | 19 | ... | 12 | 13 | 14 | 12 | 13 | 12 | 14 | 14 | 14 | Chievo |
| 11 | 10 | 17 | 19 | 19 | 19 | 18 | 12 | 13 | 15 | 17 | ... | 15 | 16 | 16 | 14 | 13 | 14 | 15 | 15 | 15 | Empoli |
| 8 | 10 | 3 | 10 | 4 | 3 | 3 | 4 | 5 | 3 | 4 | ... | 14 | 15 | 13 | 14 | 13 | 14 | 15 | 16 | 16 | Udinese |
| 1 | 10 | 10 | 5 | 9 | 11 | 15 | 16 | 13 | 15 | 16 | ... | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | Atalanta |
| 13 | 10 | 10 | 15 | 19 | 20 | 15 | 16 | 17 | 12 | 14 | ... | 19 | 19 | 18 | 18 | 18 | 18 | 18 | 18 | 18 | Cagliari |
| 18 | 10 | 17 | 19 | 13 | 16 | 18 | 19 | 20 | 20 | 20 | ... | 20 | 20 | 19 | 20 | 20 | 19 | 19 | 19 | 19 | Parma |
| 6 | 10 | 3 | 10 | 9 | 11 | 12 | 12 | 15 | 17 | 18 | ... | 18 | 18 | 19 | 18 | 19 | 19 | 19 | 20 | 20 | Cesena |
20 rows × 40 columns
This is the rank day by day. The Juventus Team has been on the top from the very first days. Let’s add this information (the rank of the ownteam and the rank of the opponent) to the marks dataset.
def rankOwnTeam (OwnTeam, day):
rankOwnTeam = rankDay.ix[rankDay['Team'] == OwnTeam, day-1]
return int(rankOwnTeam)
marks_final['rankOwnTeam'] = map(lambda ownTeam, day: rankOwnTeam(ownTeam, day), marks_final['team'], marks_final['day'])
marks_final.head()
| player | name | team | day | goal | penalty | owngoal | mark | rankOwnTeam | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2247 | Sportiello | Atalanta | 1 | 0 | 0 | 0 | 7.0 | 10 |
| 1 | 2247 | Sportiello | Atalanta | 2 | 0 | 0 | 0 | 7.5 | 10 |
| 2 | 2247 | Sportiello | Atalanta | 3 | 0 | 0 | 0 | 6.0 | 5 |
| 3 | 2247 | Sportiello | Atalanta | 4 | 0 | 0 | 0 | 6.5 | 9 |
| 4 | 2247 | Sportiello | Atalanta | 5 | 0 | 0 | 0 | 5.0 | 11 |
Ok, done for the own team, let’s try to add the opponent, less easy because it is dependent on the calendar.
def playingHome (OwnTeam, day):
temp = games.ix[(games['day'] == day)]
return (len(temp[temp['home'] == OwnTeam]))
marks_final['playingHome'] = map(lambda ownTeam, day: playingHome(ownTeam, day), marks_final['team'], marks_final['day'])
marks_final.head()
| player | name | team | day | goal | penalty | owngoal | mark | rankOwnTeam | playingHome | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2247 | Sportiello | Atalanta | 1 | 0 | 0 | 0 | 7.0 | 10 | 1 |
| 1 | 2247 | Sportiello | Atalanta | 2 | 0 | 0 | 0 | 7.5 | 10 | 0 |
| 2 | 2247 | Sportiello | Atalanta | 3 | 0 | 0 | 0 | 6.0 | 5 | 1 |
| 3 | 2247 | Sportiello | Atalanta | 4 | 0 | 0 | 0 | 6.5 | 9 | 0 |
| 4 | 2247 | Sportiello | Atalanta | 5 | 0 | 0 | 0 | 5.0 | 11 | 1 |
We have added the info if it the player is playing home. it could be usefull!
def opponentTeam (OwnTeam, day):
temp = games.ix[(games['day'] == day)]
if (len(temp[temp['home'] == OwnTeam])):
opponent = games.ix[(games['day'] == day) & (games['home'] == OwnTeam), 'away']
else:
opponent = games.ix[(games['day'] == day) & (games['away'] == OwnTeam), 'home']
return str(opponent.values[0])
marks_final['opponentTeam'] = map(lambda ownTeam, day: opponentTeam(ownTeam, day), marks_final['team'], marks_final['day'])
marks_final.head()
| player | name | team | day | goal | penalty | owngoal | mark | rankOwnTeam | playingHome | opponentTeam | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2247 | Sportiello | Atalanta | 1 | 0 | 0 | 0 | 7.0 | 10 | 1 | Verona |
| 1 | 2247 | Sportiello | Atalanta | 2 | 0 | 0 | 0 | 7.5 | 10 | 0 | Cagliari |
| 2 | 2247 | Sportiello | Atalanta | 3 | 0 | 0 | 0 | 6.0 | 5 | 1 | Fiorentina |
| 3 | 2247 | Sportiello | Atalanta | 4 | 0 | 0 | 0 | 6.5 | 9 | 0 | Inter |
| 4 | 2247 | Sportiello | Atalanta | 5 | 0 | 0 | 0 | 5.0 | 11 | 1 | Juventus |
marks_final['rankOpponentTeam'] = map(lambda oppTeam, day: rankOwnTeam(oppTeam, day), marks_final['opponentTeam'], marks_final['day'])
marks_final.head()
| player | name | team | day | goal | penalty | owngoal | mark | rankOwnTeam | playingHome | opponentTeam | rankOpponentTeam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2247 | Sportiello | Atalanta | 1 | 0 | 0 | 0 | 7.0 | 10 | 1 | Verona | 10 |
| 1 | 2247 | Sportiello | Atalanta | 2 | 0 | 0 | 0 | 7.5 | 10 | 0 | Cagliari | 10 |
| 2 | 2247 | Sportiello | Atalanta | 3 | 0 | 0 | 0 | 6.0 | 5 | 1 | Fiorentina | 15 |
| 3 | 2247 | Sportiello | Atalanta | 4 | 0 | 0 | 0 | 6.5 | 9 | 0 | Inter | 6 |
| 4 | 2247 | Sportiello | Atalanta | 5 | 0 | 0 | 0 | 5.0 | 11 | 1 | Juventus | 1 |
let’s add some statistics (probably we will remove them because they will be strongly correlated)
def avgPrevious5 (playerName, day):
marksPlayer = marks_final.ix[(marks_final['name'] == playerName) & (marks_final['day'] >= (day-5)) & (marks_final['day'] < day), 'mark']
return marksPlayer.mean()
def avgSoFar (playerName, day):
marksPlayer = marks_final.ix[(marks_final['name'] == playerName) & (marks_final['day'] < day), 'mark']
return marksPlayer.mean()
def lastMark (playerName, day):
marksPlayer = marks_final.ix[(marks_final['name'] == playerName) & (marks_final['day'] == day-1), 'mark']
return marksPlayer.mean()
marks_final['avgPrevious5'] = map(lambda playerName, day: avgPrevious5 (playerName, day), marks_final['name'], marks_final['day'])
marks_final['avgSoFar'] = map(lambda playerName, day: avgSoFar (playerName, day), marks_final['name'], marks_final['day'])
marks_final['LastMark'] = map(lambda playerName, day: lastMark (playerName, day), marks_final['name'], marks_final['day'])
For the moment, let’s aim low, trying to predict if the player’s mark will be over 6 (sufficient in italian culture)
def isSuff (mark):
if mark>=6: return 1
else: return 0
marks_final['isSuff'] = map(lambda mark: isSuff (mark), marks_final['mark'])
marks_final.head()
| player | name | team | day | goal | penalty | owngoal | mark | rankOwnTeam | playingHome | opponentTeam | rankOpponentTeam | avgPrevious5 | avgSoFar | LastMark | isSuff | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2247 | Sportiello | Atalanta | 1 | 0 | 0 | 0 | 7.0 | 10 | 1 | Verona | 10 | NaN | NaN | NaN | 1 |
| 1 | 2247 | Sportiello | Atalanta | 2 | 0 | 0 | 0 | 7.5 | 10 | 0 | Cagliari | 10 | 7.000000 | 7.000000 | 7.0 | 1 |
| 2 | 2247 | Sportiello | Atalanta | 3 | 0 | 0 | 0 | 6.0 | 5 | 1 | Fiorentina | 15 | 7.250000 | 7.250000 | 7.5 | 1 |
| 3 | 2247 | Sportiello | Atalanta | 4 | 0 | 0 | 0 | 6.5 | 9 | 0 | Inter | 6 | 6.833333 | 6.833333 | 6.0 | 1 |
| 4 | 2247 | Sportiello | Atalanta | 5 | 0 | 0 | 0 | 5.0 | 11 | 1 | Juventus | 1 | 6.750000 | 6.750000 | 6.5 | 0 |
Uhh, we don’t have any mean for the first day of the championship. For the moment, let us feed that value with the average of the championship. Alternatively, we could add a “political correct” 6.
To be continued!
