FantasyFootball feature engineering

In this project, we will try to predict the performances of the players of Italian football championship Serie A. The project is meant in the context of Fantasy Football, a popular game in which a group of friends organize a virtual team of real serie A players. The aim of the game is to score more than the opponen (the score is the sum of the performance evaluation marks) of the players of your team. Predicting the marks of your players could definitely be an advantage for a fantasy football coach.

In this first notebook, we will create the dataset and do a lot of feature engineering.

#import
from sqlalchemy import create_engine
import pandas as pd
import numpy as np

Let try to extract data from the sqlite dataset

engine = create_engine('sqlite:///2014')
# Open engine connection: con
con = engine.connect()
table_names = engine.table_names()
print(table_names)
[u'games', u'marks_corriere', u'marks_fantagazzetta', u'marks_gazzetta', u'players', u'sqlite_sequence']

let’s focus for the moment on the results of just one sport newspaper such as Gazzetta

with engine.connect() as con:
    rs = con.execute('SELECT * from games')
    # Save results of the query to DataFrame: df
    games = pd.DataFrame(rs.fetchall())
    games.columns = rs.keys()
    
    rs = con.execute('SELECT * from marks_gazzetta')
    marks = pd.DataFrame(rs.fetchall())
    marks.columns = rs.keys()
    
    rs = con.execute('SELECT * from players')
    players = pd.DataFrame(rs.fetchall())
    players.columns = rs.keys()
players.head()
id name team role
0 2247 Sportiello Atalanta P
1 2248 Benalouane Atalanta D
2 2249 Biava Atalanta D
3 2250 Drame' Atalanta D
4 2251 Zappacosta Atalanta D
marks.head()
id day player team mark goal gotgoal penalty spenalty mpenalty owngoal assist ycard rcard fmark enter exit
0 21175 1 2247 Atalanta 7.0 0 0 0 0 0 0 0 0 0 7.0 0 0
1 21176 1 2248 Atalanta 6.0 0 0 0 0 0 0 0 0 0 6.0 0 0
2 21177 1 2249 Atalanta 6.0 0 0 0 0 0 0 0 0 0 6.0 0 0
3 21178 1 2250 Atalanta 5.5 0 0 0 0 0 0 0 1 0 5.0 0 0
4 21179 1 2251 Atalanta 6.0 0 0 0 0 0 0 0 0 0 6.0 0 0

Let’s join this dataset in order to obtain the player names together with the id (it won’t affect the results of course, but it will be easier to check the process

marks_merged = pd.merge(marks,players, how = 'inner', left_on='player', right_on='id')
marks_merged.head()
id_x day player team_x mark goal gotgoal penalty spenalty mpenalty ... assist ycard rcard fmark enter exit id_y name team_y role
0 21175 1 2247 Atalanta 7.0 0 0 0 0 0 ... 0 0 0 7.0 0 0 2247 Sportiello Atalanta P
1 21453 2 2247 Atalanta 7.5 0 1 0 0 0 ... 0 0 0 6.5 0 0 2247 Sportiello Atalanta P
2 21733 3 2247 Atalanta 6.0 0 1 0 0 0 ... 0 0 0 5.0 0 0 2247 Sportiello Atalanta P
3 22012 4 2247 Atalanta 6.5 0 2 0 1 0 ... 0 0 0 7.5 0 0 2247 Sportiello Atalanta P
4 22291 5 2247 Atalanta 5.0 0 3 0 0 0 ... 0 0 0 2.0 0 0 2247 Sportiello Atalanta P

5 rows × 21 columns

marks_merged.sort_values(['day', 'team_x','name']).head()
id_x day player team_x mark goal gotgoal penalty spenalty mpenalty ... assist ycard rcard fmark enter exit id_y name team_y role
37 21176 1 2248 Atalanta 6.0 0 0 0 0 0 ... 0 0 0 6.0 0 0 2248 Benalouane Atalanta D
281 21186 1 2258 Atalanta 6.0 0 0 0 0 0 ... 0 0 0 6.0 1 0 2258 Bianchi Atalanta A
64 21177 1 2249 Atalanta 6.0 0 0 0 0 0 ... 0 0 0 6.0 0 0 2249 Biava Atalanta D
302 21187 1 2259 Atalanta 5.0 0 0 0 0 0 ... 0 1 0 4.5 1 0 2259 Boakye Atalanta A
140 21180 1 2252 Atalanta 6.0 0 0 0 0 0 ... 0 0 0 6.0 0 0 2252 Bonaventura Milan C

5 rows × 21 columns

For the moment let’s use just a subset of the features, we will have time to get them back.

marks_final = marks_merged[['player','name','team_x','day','goal','penalty','owngoal','mark']]

Don’t really like that “team_x”

marks_final=marks_final.rename(columns = {'team_x':'team'})

let us focus on the games.

games.head()
day home away
0 1 Chievo Juventus
1 1 Atalanta Verona
2 1 Torino Inter
3 1 Sassuolo Cagliari
4 1 Genoa Napoli

as suspected, there are no results for the matches. We should obtain those from the player statistics. What will influence the results?

  • Of course, the Goals, so consider the feature gol of the dataset marks
  • What about the penalties? Let’s consider them as well
  • Do not forget the unfortunate owngoals
def whoWon (teamH,teamA,day):
    """simple functions aimed to deliver the winner of each game"""
    marksSpecificH = marks_final[(marks_final['day'] == day) & (marks_final['team'] == teamH)]
    marksSpecificA = marks_final[(marks_final['day'] == day) & (marks_final['team'] == teamA)]
    
    #print(marksSpecificH)
    goalH = marksSpecificH['goal'].sum() + marksSpecificH['penalty'].sum() + marksSpecificA['owngoal'].sum()

    
    #print(marksSpecificA)
    goalA = marksSpecificA['goal'].sum() + marksSpecificA['penalty'].sum() + marksSpecificH['owngoal'].sum()
    
    winner = 0
    if (goalH > goalA): winner = 1
    if (goalH < goalA): winner = 2
    if (goalH == goalA): winner = 0
    return winner
games['winner'] = map(lambda teamH,teamA,day: whoWon(teamH,teamA,day), games['home'], games['away'], games['day'])
games[games['day']==1]
day home away winner
0 1 Chievo Juventus 2
1 1 Atalanta Verona 0
2 1 Torino Inter 0
3 1 Sassuolo Cagliari 0
4 1 Genoa Napoli 2
5 1 Palermo Sampdoria 0
6 1 Cesena Parma 1
7 1 Roma Fiorentina 1
8 1 Udinese Empoli 1
9 1 Milan Lazio 1

Very good! now we have the results. However, I am afraid that we need more feature engineering if we want other interesting informations. Indeed, the rank of the team is quite an important matter, for this we need to rebuild the rankings.

days = np.arange(0,39,1)
teams = games1['home']
teams = teams.drop_duplicates()
points = pd.DataFrame(columns=np.arange(0,39,1))
points['Team'] = teams
points[0] = 0
cols = points.columns.tolist()
cols = cols[-1:] + cols[:-1]
points = points[cols]
points
Team 0 1 2 3 4 5 6 7 8 ... 29 30 31 32 33 34 35 36 37 38
0 Chievo 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Atalanta 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Torino 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Sassuolo 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Genoa 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 Palermo 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 Cesena 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 Roma 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 Udinese 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 Milan 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 Napoli 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 Empoli 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 Fiorentina 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 Cagliari 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 Juventus 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 Verona 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 Sampdoria 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 Lazio 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 Parma 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 Inter 0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

20 rows × 40 columns

def updatePoints (teamH,teamA,day,winner):
    if (winner == 1):
        points.ix[points['Team'] == teamH, day] = 3
        points.ix[points['Team'] == teamA, day] = 0
    if (winner == 2):
        points.ix[points['Team'] == teamA, day] = 3
        points.ix[points['Team'] == teamH, day] = 0
    if (winner == 0):
        points.ix[points['Team'] == teamH, day] = 1
        points.ix[points['Team'] == teamA, day] = 1
_ = map(lambda teamH,teamA,day,winner: updatePoints(teamH,teamA,day,winner), games['home'], games['away'], games['day'], games['winner'])
points
Team 0 1 2 3 4 5 6 7 8 ... 29 30 31 32 33 34 35 36 37 38
0 Chievo 0 0 3 0 0 1 0 0 0 ... 0 3 1 1 3 1 1 0 1 0
1 Atalanta 0 1 3 0 0 0 0 3 0 ... 0 3 1 1 1 1 3 0 1 0
2 Torino 0 1 0 0 3 1 0 3 0 ... 3 1 1 3 1 0 0 3 0 3
3 Sassuolo 0 1 0 1 1 0 0 1 3 ... 3 0 1 0 0 1 3 3 3 3
4 Genoa 0 0 1 3 1 0 3 1 3 ... 1 3 0 3 3 0 3 3 3 0
5 Palermo 0 1 0 1 1 0 0 3 0 ... 0 3 3 0 1 1 0 3 0 3
6 Cesena 0 3 0 1 0 1 1 0 0 ... 1 0 1 0 1 0 0 0 0 0
7 Roma 0 3 3 3 3 3 0 3 1 ... 3 1 1 0 3 3 0 3 3 0
8 Udinese 0 3 0 3 3 3 1 0 3 ... 1 0 1 3 0 3 0 0 0 0
9 Milan 0 3 3 0 1 1 3 3 1 ... 3 1 1 0 0 0 3 0 3 3
10 Napoli 0 3 0 0 1 3 3 1 3 ... 0 3 3 3 0 3 1 3 0 0
11 Empoli 0 0 0 1 1 1 3 1 0 ... 0 0 1 1 3 3 0 0 1 0
12 Fiorentina 0 0 1 3 1 1 3 0 1 ... 3 0 0 0 0 3 3 3 3 3
13 Cagliari 0 1 0 0 0 3 0 1 3 ... 0 0 0 3 0 3 1 0 3 3
14 Juventus 0 3 3 3 3 3 3 1 3 ... 3 0 3 0 3 3 1 3 3 1
15 Verona 0 1 3 3 1 0 3 0 0 ... 1 0 3 3 1 0 1 3 1 1
16 Sampdoria 0 1 3 1 3 3 3 1 1 ... 0 1 1 0 1 0 3 0 1 1
17 Lazio 0 0 3 0 0 3 3 3 3 ... 3 3 0 1 3 1 0 3 0 3
18 Parma 0 0 0 3 0 0 0 0 0 ... 1 3 1 3 0 0 1 0 1 1
19 Inter 0 1 3 1 3 0 0 1 3 ... 1 3 1 3 3 1 3 0 0 3

20 rows × 40 columns

#print(pointsTest)
points['total'] = 0
for i in range(39):
    points['total'] = points['total'] + points[i]    
points
Team 0 1 2 3 4 5 6 7 8 ... 30 31 32 33 34 35 36 37 38 total
0 Chievo 0 0 3 0 0 1 0 0 0 ... 3 1 1 3 1 1 0 1 0 43
1 Atalanta 0 1 3 0 0 0 0 3 0 ... 3 1 1 1 1 3 0 1 0 37
2 Torino 0 1 0 0 3 1 0 3 0 ... 1 1 3 1 0 0 3 0 3 54
3 Sassuolo 0 1 0 1 1 0 0 1 3 ... 0 1 0 0 1 3 3 3 3 49
4 Genoa 0 0 1 3 1 0 3 1 3 ... 3 0 3 3 0 3 3 3 0 59
5 Palermo 0 1 0 1 1 0 0 3 0 ... 3 3 0 1 1 0 3 0 3 49
6 Cesena 0 3 0 1 0 1 1 0 0 ... 0 1 0 1 0 0 0 0 0 24
7 Roma 0 3 3 3 3 3 0 3 1 ... 1 1 0 3 3 0 3 3 0 70
8 Udinese 0 3 0 3 3 3 1 0 3 ... 0 1 3 0 3 0 0 0 0 41
9 Milan 0 3 3 0 1 1 3 3 1 ... 1 1 0 0 0 3 0 3 3 52
10 Napoli 0 3 0 0 1 3 3 1 3 ... 3 3 3 0 3 1 3 0 0 63
11 Empoli 0 0 0 1 1 1 3 1 0 ... 0 1 1 3 3 0 0 1 0 42
12 Fiorentina 0 0 1 3 1 1 3 0 1 ... 0 0 0 0 3 3 3 3 3 64
13 Cagliari 0 1 0 0 0 3 0 1 3 ... 0 0 3 0 3 1 0 3 3 34
14 Juventus 0 3 3 3 3 3 3 1 3 ... 0 3 0 3 3 1 3 3 1 87
15 Verona 0 1 3 3 1 0 3 0 0 ... 0 3 3 1 0 1 3 1 1 46
16 Sampdoria 0 1 3 1 3 3 3 1 1 ... 1 1 0 1 0 3 0 1 1 56
17 Lazio 0 0 3 0 0 3 3 3 3 ... 3 0 1 3 1 0 3 0 3 69
18 Parma 0 0 0 3 0 0 0 0 0 ... 3 1 3 0 0 1 0 1 1 26
19 Inter 0 1 3 1 3 0 0 1 3 ... 3 1 3 3 1 3 0 0 3 55

20 rows × 41 columns

Cool, now let’s try to obtain the position in the rank for each day of the championship

pointsDay = pd.DataFrame(columns=np.arange(0,39,1))
pointsDay['Team'] = teams
pointsDay[0] = 0

for i in range(1,39):
    pointsDay[i] = pointsDay[i-1] + points[i]
pointsDay
0 1 2 3 4 5 6 7 8 9 ... 30 31 32 33 34 35 36 37 38 Team
0 0 0 3 3 3 4 4 4 4 4 ... 35 36 37 40 41 42 42 43 43 Chievo
1 0 1 4 4 4 4 4 7 7 8 ... 29 30 31 32 33 36 36 37 37 Atalanta
2 0 1 1 1 4 5 5 8 8 11 ... 43 44 47 48 48 48 51 51 54 Torino
3 0 1 1 2 3 3 3 4 7 10 ... 35 36 36 36 37 40 43 46 49 Sassuolo
4 0 0 1 4 5 5 8 9 12 15 ... 44 44 47 50 50 53 56 59 59 Genoa
5 0 1 1 2 3 3 3 6 6 9 ... 38 41 41 42 43 43 46 46 49 Palermo
6 0 3 3 4 4 5 6 6 6 6 ... 22 23 23 24 24 24 24 24 24 Cesena
7 0 3 6 9 12 15 15 18 19 22 ... 57 58 58 61 64 64 67 70 70 Roma
8 0 3 3 6 9 12 13 13 16 16 ... 34 35 38 38 41 41 41 41 41 Udinese
9 0 3 6 6 7 8 11 14 15 16 ... 42 43 43 43 43 46 46 49 52 Milan
10 0 3 3 3 4 7 10 11 14 15 ... 50 53 56 56 59 60 63 63 63 Napoli
11 0 0 0 1 2 3 6 7 7 7 ... 33 34 35 38 41 41 41 42 42 Empoli
12 0 0 1 4 5 6 9 9 10 13 ... 49 49 49 49 52 55 58 61 64 Fiorentina
13 0 1 1 1 1 4 4 5 8 9 ... 21 21 24 24 27 28 28 31 34 Cagliari
14 0 3 6 9 12 15 18 19 22 22 ... 70 73 73 76 79 80 83 86 87 Juventus
15 0 1 4 7 8 8 11 11 11 12 ... 33 36 39 40 40 41 44 45 46 Verona
16 0 1 4 5 8 11 14 15 16 16 ... 49 50 50 51 51 54 54 55 56 Sampdoria
17 0 0 3 3 3 6 9 12 15 16 ... 58 58 59 62 63 63 66 66 69 Lazio
18 0 0 0 3 3 3 3 3 3 3 ... 19 20 23 23 23 24 24 25 26 Parma
19 0 1 4 5 8 8 8 9 12 15 ... 41 42 45 48 49 52 52 52 55 Inter

20 rows × 40 columns

rankDay = pd.DataFrame(columns=np.arange(0,39,1))
rankDay['Team'] = teams
for i in range(0,39): 
    rankDay[i] = pointsDay[i].rank(ascending=0)
    rankDay[i] = map(lambda x: int(x), rankDay[i])
rankDay.sort_values(by=38) 

0 1 2 3 4 5 6 7 8 9 ... 30 31 32 33 34 35 36 37 38 Team
14 10 3 2 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 Juventus
7 10 3 2 1 1 1 2 2 2 1 ... 3 2 3 3 2 2 2 2 2 Roma
17 10 17 10 13 16 9 8 6 5 4 ... 2 2 2 2 3 3 3 3 3 Lazio
12 10 17 15 9 8 9 8 10 11 10 ... 5 6 6 7 5 5 5 5 4 Fiorentina
10 10 3 10 13 11 8 7 7 7 8 ... 4 4 4 4 4 4 4 4 5 Napoli
4 10 17 15 9 8 12 10 10 8 8 ... 7 7 7 6 7 7 6 6 6 Genoa
16 10 10 5 6 5 4 3 3 3 4 ... 5 5 5 5 6 6 7 7 7 Sampdoria
19 10 10 5 6 5 6 10 10 8 8 ... 10 10 9 8 8 8 8 8 8 Inter
2 10 10 15 19 11 12 14 12 12 12 ... 8 7 7 8 9 9 9 9 9 Torino
9 10 3 2 4 7 6 5 4 5 4 ... 9 9 10 10 10 10 10 10 10 Milan
5 10 10 15 16 16 18 19 15 17 14 ... 11 11 11 11 10 11 10 11 11 Palermo
3 10 10 15 16 16 18 19 18 15 13 ... 12 13 15 16 16 16 13 11 11 Sassuolo
15 10 10 5 3 5 6 5 7 10 11 ... 15 13 12 12 15 14 12 13 13 Verona
0 10 17 10 13 16 15 16 18 19 19 ... 12 13 14 12 13 12 14 14 14 Chievo
11 10 17 19 19 19 18 12 13 15 17 ... 15 16 16 14 13 14 15 15 15 Empoli
8 10 3 10 4 3 3 4 5 3 4 ... 14 15 13 14 13 14 15 16 16 Udinese
1 10 10 5 9 11 15 16 13 15 16 ... 17 17 17 17 17 17 17 17 17 Atalanta
13 10 10 15 19 20 15 16 17 12 14 ... 19 19 18 18 18 18 18 18 18 Cagliari
18 10 17 19 13 16 18 19 20 20 20 ... 20 20 19 20 20 19 19 19 19 Parma
6 10 3 10 9 11 12 12 15 17 18 ... 18 18 19 18 19 19 19 20 20 Cesena

20 rows × 40 columns

This is the rank day by day. The Juventus Team has been on the top from the very first days. Let’s add this information (the rank of the ownteam and the rank of the opponent) to the marks dataset.

def rankOwnTeam (OwnTeam, day):
    rankOwnTeam = rankDay.ix[rankDay['Team'] == OwnTeam, day-1]
    return int(rankOwnTeam)

marks_final['rankOwnTeam'] = map(lambda ownTeam, day: rankOwnTeam(ownTeam, day), marks_final['team'], marks_final['day'])
marks_final.head()
player name team day goal penalty owngoal mark rankOwnTeam
0 2247 Sportiello Atalanta 1 0 0 0 7.0 10
1 2247 Sportiello Atalanta 2 0 0 0 7.5 10
2 2247 Sportiello Atalanta 3 0 0 0 6.0 5
3 2247 Sportiello Atalanta 4 0 0 0 6.5 9
4 2247 Sportiello Atalanta 5 0 0 0 5.0 11

Ok, done for the own team, let’s try to add the opponent, less easy because it is dependent on the calendar.

def playingHome (OwnTeam, day):
    temp = games.ix[(games['day'] == day)] 
    return (len(temp[temp['home'] == OwnTeam]))
marks_final['playingHome'] = map(lambda ownTeam, day: playingHome(ownTeam, day), marks_final['team'], marks_final['day'])
marks_final.head()
player name team day goal penalty owngoal mark rankOwnTeam playingHome
0 2247 Sportiello Atalanta 1 0 0 0 7.0 10 1
1 2247 Sportiello Atalanta 2 0 0 0 7.5 10 0
2 2247 Sportiello Atalanta 3 0 0 0 6.0 5 1
3 2247 Sportiello Atalanta 4 0 0 0 6.5 9 0
4 2247 Sportiello Atalanta 5 0 0 0 5.0 11 1

We have added the info if it the player is playing home. it could be usefull!

def opponentTeam (OwnTeam, day):
    temp = games.ix[(games['day'] == day)] 
    if (len(temp[temp['home'] == OwnTeam])):
        opponent = games.ix[(games['day'] == day) & (games['home'] == OwnTeam), 'away'] 
    else:
        opponent = games.ix[(games['day'] == day) & (games['away'] == OwnTeam), 'home'] 
    return str(opponent.values[0])



marks_final['opponentTeam'] = map(lambda ownTeam, day: opponentTeam(ownTeam, day), marks_final['team'], marks_final['day'])
marks_final.head()
player name team day goal penalty owngoal mark rankOwnTeam playingHome opponentTeam
0 2247 Sportiello Atalanta 1 0 0 0 7.0 10 1 Verona
1 2247 Sportiello Atalanta 2 0 0 0 7.5 10 0 Cagliari
2 2247 Sportiello Atalanta 3 0 0 0 6.0 5 1 Fiorentina
3 2247 Sportiello Atalanta 4 0 0 0 6.5 9 0 Inter
4 2247 Sportiello Atalanta 5 0 0 0 5.0 11 1 Juventus
marks_final['rankOpponentTeam'] = map(lambda oppTeam, day: rankOwnTeam(oppTeam, day), marks_final['opponentTeam'], marks_final['day'])
marks_final.head()
player name team day goal penalty owngoal mark rankOwnTeam playingHome opponentTeam rankOpponentTeam
0 2247 Sportiello Atalanta 1 0 0 0 7.0 10 1 Verona 10
1 2247 Sportiello Atalanta 2 0 0 0 7.5 10 0 Cagliari 10
2 2247 Sportiello Atalanta 3 0 0 0 6.0 5 1 Fiorentina 15
3 2247 Sportiello Atalanta 4 0 0 0 6.5 9 0 Inter 6
4 2247 Sportiello Atalanta 5 0 0 0 5.0 11 1 Juventus 1

let’s add some statistics (probably we will remove them because they will be strongly correlated)

def avgPrevious5 (playerName, day):
    marksPlayer = marks_final.ix[(marks_final['name'] == playerName) & (marks_final['day'] >= (day-5)) & (marks_final['day'] < day), 'mark']
    return marksPlayer.mean()

def avgSoFar (playerName, day):
    marksPlayer = marks_final.ix[(marks_final['name'] == playerName) & (marks_final['day'] < day), 'mark']
    return marksPlayer.mean()

def lastMark (playerName, day):
    marksPlayer = marks_final.ix[(marks_final['name'] == playerName) & (marks_final['day'] == day-1), 'mark']
    return marksPlayer.mean()
marks_final['avgPrevious5'] = map(lambda playerName, day: avgPrevious5 (playerName, day), marks_final['name'], marks_final['day'])
marks_final['avgSoFar'] = map(lambda playerName, day: avgSoFar (playerName, day), marks_final['name'], marks_final['day'])
marks_final['LastMark'] = map(lambda playerName, day: lastMark (playerName, day), marks_final['name'], marks_final['day'])

For the moment, let’s aim low, trying to predict if the player’s mark will be over 6 (sufficient in italian culture)

def isSuff (mark):
    if mark>=6: return 1
    else: return 0
    
marks_final['isSuff'] = map(lambda mark: isSuff (mark), marks_final['mark'])
marks_final.head()
player name team day goal penalty owngoal mark rankOwnTeam playingHome opponentTeam rankOpponentTeam avgPrevious5 avgSoFar LastMark isSuff
0 2247 Sportiello Atalanta 1 0 0 0 7.0 10 1 Verona 10 NaN NaN NaN 1
1 2247 Sportiello Atalanta 2 0 0 0 7.5 10 0 Cagliari 10 7.000000 7.000000 7.0 1
2 2247 Sportiello Atalanta 3 0 0 0 6.0 5 1 Fiorentina 15 7.250000 7.250000 7.5 1
3 2247 Sportiello Atalanta 4 0 0 0 6.5 9 0 Inter 6 6.833333 6.833333 6.0 1
4 2247 Sportiello Atalanta 5 0 0 0 5.0 11 1 Juventus 1 6.750000 6.750000 6.5 0

Uhh, we don’t have any mean for the first day of the championship. For the moment, let us feed that value with the average of the championship. Alternatively, we could add a “political correct” 6.

To be continued!
Written on March 7, 2017