Player(round=0, player=42234, team=264, position=1, games=0, average=0.0, value=10.0, score=0.0, appreciation=0.0, minimum=4.53)
Cartola is a fantasy football league following the Brazilian Championship A Series.
Cartola offers a public API to access data for the current round. A couple of years ago, I created a script to automate data retrieval to a repository, which now hosts comprehensive historical data since 2022.
In this post, I will delve into the data for the 2022 season, formulate a mixed integer linear program to draft the optimal team, and present initial concepts for forecasting player scores using mixed effects linear models.
The game
We begin the season with a budget of C$ 100, the game’s paper currency.
Each round is preceded by a market session, where players are assigned a value. We are tasked with forming a team of 11 players plus a coach, all within our budget and adhering to a valid formation. A captain must be chosen from among the players, excluding the coach.
The market is available until the round starts. Players then earn scores based on their real-life match performances. Our team’s score is the aggregate of our players’ scores, with our captain’s score doubled in the 2022 season.
Following the conclusion of the round, player values are recalibrated based on performance -— with increases for scores above their average and decreases for below-average performances. Our budget for the next round is our previous budget, plus the sum of our players’ value variations.
Data wrangling
Let’s talk about data structures: each round has a market, and each market is a list of players. A player is a structure like this:
Let’s get the list of markets for 2022 and flatten it into a single DataFrame:
+-------------------------------------------------------------------------------+
| round player team position … value score appreciation minimum |
+===============================================================================+
| 1 37424 1371 6 … 3.0 0.0 0.0 0.0 |
| 1 37646 314 3 … 5.0 0.0 0.0 2.3 |
| 1 37656 266 1 … 9.0 0.0 0.0 4.08 |
| … … … … … … … … … |
| 38 121398 354 4 … 1.0 0.0 0.0 0.0 |
| 38 121399 354 4 … 1.0 0.0 0.0 0.0 |
| 38 121400 354 5 … 1.0 0.0 0.0 0.0 |
+-------------------------------------------------------------------------------+
shape: (30_063, 10)
Now, let’s focus on a specific player
to illustrate our data while we wrangle it:
+-------------------------------------------------------------------------------+
| round player team position … value score appreciation minimum |
+===============================================================================+
| 1 42234 264 1 … 10.0 0.0 0.0 4.53 |
| 2 42234 264 1 … 7.93 2.0 -2.07 5.52 |
| 3 42234 264 1 … 10.44 11.0 2.51 4.75 |
| … … … … … … … … … |
| 36 42234 264 1 … 11.51 0.0 0.03 3.63 |
| 37 42234 264 1 … 12.68 0.0 1.17 9.29 |
| 38 42234 264 1 … 11.06 0.0 -1.62 1.37 |
+-------------------------------------------------------------------------------+
shape: (38, 10)
Filtering participation
Players will show up in the market for many rounds that they do not participate in. However, for our analysis, we are only interested in players that actually played a game in the round.
Each player has a status
field intended to indicate their participation in the round. However, this field is often inaccurate, likely due to the API data being updated before the round.
One solution is to keep only rows where there is an increase in the number of games
the player has played:
+------------------------+
| round player games |
+========================+
| 1 42234 0 |
| 2 42234 1 |
| 3 42234 2 |
| … … … |
| 36 42234 28 |
| 37 42234 29 |
| 38 42234 30 |
+------------------------+
shape: (31, 3)
Imputing scores
Similarly, the player score
field is often inaccurate, likely for the same reasons as the status
field. Fortunately, the average
field is reliable, allowing us to recover the score
:
\[ \begin{align*} \mathrm{Average}(\mathbf{s}_{1:t}) = \frac{\mathrm{Average}(\mathbf{s}_{1:(t-1)}) + s_t}{2} \\ s_t = 2\mathrm{Average}(\mathbf{s}_{1:t}) - \mathrm{Average}(\mathbf{s}_{1:(t-1)}), \end{align*} \]
where \(\mathbf{s}\) is the vector of scores for a given player across all rounds.
+----------------------------------+
| round player score average |
+==================================+
| 1 42234 2.0 2.0 |
| 2 42234 11.0 6.5 |
| 3 42234 9.5 8.0 |
| … … … … |
| 36 42234 5.1 4.96 |
| 37 42234 4.62 4.79 |
| 38 42234 4.79 4.79 |
+----------------------------------+
shape: (31, 4)
Adding fixtures
Let’s fetch the list of fixtures to enrich our dataset. A fixture is an object like:
Fixture(round=1, home=282, away=285)
Let’s consolidate these fixtures into a single DataFrame and then pivot them into a long format:
+------------------------------+
| round team versus home |
+==============================+
| 1 282 285 1 |
| 1 266 277 1 |
| 1 276 293 1 |
| … … … … |
| 38 276 290 0 |
| 38 294 1371 0 |
| 38 263 293 0 |
+------------------------------+
shape: (760, 4)
Finally, let’s join this data to our dataset:
+---------------------------------------+
| round player team versus home |
+=======================================+
| 1 42234 264 263 0 |
| 2 42234 264 314 1 |
| 3 42234 264 275 0 |
| … … … … … |
| 36 42234 264 354 1 |
| 37 42234 264 294 0 |
| 38 42234 264 282 1 |
+---------------------------------------+
shape: (31, 5)
Aligning variables
In our subsequent analysis, the average
field will exclude the score
from the given round. Additionally, the appreciation
field will be calculated in relation to the round’s score
.
+---------------------------------------------------------+
| round player average value score appreciation |
+=========================================================+
| 1 42234 0.0 10.0 2.0 -2.07 |
| 2 42234 2.0 7.93 11.0 2.51 |
| 3 42234 6.5 10.44 9.5 1.25 |
| … … … … … … |
| 36 42234 4.82 11.51 5.1 1.17 |
| 37 42234 4.96 12.68 4.62 -1.62 |
| 38 42234 4.79 11.06 4.79 0.0 |
+---------------------------------------------------------+
shape: (31, 6)
Team picking
Now let’s solve the problem of picking the best team a given market. Let $ $ be the set of valid formations, then for each formation \(f \in \mathcal{F}\), solve:
\[ \begin{equation*} \begin{array}{ll@{}ll} \text{maximize} & \displaystyle \hat{\mathbf{s}}^T \mathbf{x}, & \mathbf{x} \in \{\mathbf{0}, \mathbf{1}\} \\ \text{subject to} & \displaystyle \mathbf{v}^T \mathbf{x} \leq b \\ & \displaystyle \mathbf{P}^T \mathbf{x} = f, \\ \end{array} \end{equation*} \]
where
\(\mathbf{x}\) is a variable vector of player picks in the market; \(\hat{\mathbf{s}}\) is the vector of predicted player scores in the market; \(b\) is our available budget for that round; \(\mathbf{P}\) is the matrix of dummy-encoded player formations in the market.
Finally, take the solution with the highest objective.
import numpy as np
import pulp
class Formation(BaseModel):
int = Field(alias="gol")
goalkeeper: int = Field(alias="zag")
defender: int = Field(alias="lat")
winger: int = Field(alias="mei")
midfielder: int = Field(alias="ata")
forward: int = Field(alias="tec")
coach:
class Problem(BaseModel):
float]
scores: List[float]
values: List[float
budget: int]]
positions: List[List[
formations: List[Formation]
def solve(self) -> List[pulp.LpSolution]:
= [list(f.model_dump().values()) for f in self.formations]
formations = [self.construct(f) for f in formations]
problems =False)) for p in problems]
[p.solve(pulp.COIN(msg= [p.objective.value() for p in problems]
objectives = np.argmax(np.array(objectives))
best = problems[best]
solution = [v.value() for v in solution.variables()]
variables = np.array(variables)
picks return picks
def construct(self, formation: List[int]) -> pulp.LpProblem:
= len(self.scores)
n = len(formation)
m = pulp.LpProblem("team_picking", pulp.LpMaximize)
problem = ["pick_" + str(i).zfill(len(str(n))) for i in range(n)]
indexes = [pulp.LpVariable(i, cat=pulp.const.LpBinary) for i in indexes]
picks += pulp.lpDot(picks, self.scores)
problem += pulp.lpDot(picks, self.values) <= self.budget
problem for i in range(m):
+= pulp.lpDot(picks, self.positions[i]) == formation[i]
problem return problem
Backtesting
By solving the team picking problem for all rounds, we can backtest our performance in the season. Before backtesting, let’s get the set of valid formations \(\mathcal{F}\):
[Formation(goalkeeper=1, defender=3, winger=0, midfielder=4, forward=3, coach=1),
Formation(goalkeeper=1, defender=3, winger=0, midfielder=5, forward=2, coach=1),
Formation(goalkeeper=1, defender=2, winger=2, midfielder=3, forward=3, coach=1),
Formation(goalkeeper=1, defender=2, winger=2, midfielder=4, forward=2, coach=1),
Formation(goalkeeper=1, defender=2, winger=2, midfielder=5, forward=1, coach=1),
Formation(goalkeeper=1, defender=3, winger=2, midfielder=3, forward=2, coach=1),
Formation(goalkeeper=1, defender=3, winger=2, midfielder=4, forward=1, coach=1)]
Knowing our formation constraints, we’re ready to backtest. Starting with a budget of C$ 100, for each round let’s:
- Predict each player’s score based on their performance on previous rounds;
- Pick the team with the best total score;
- Add the sum of the team player’s appreciation to our budget.
from typing import Callable
import polars as pl
def backtest(
float = 100.0
players: pl.DataFrame, predict: Callable, initial_budget: -> pl.DataFrame:
) = players.get_column("round").max()
rounds = [None] * rounds
budget = [None] * rounds
teams 0] = initial_budget
budget[= 0
appreciation for round in range(rounds):
if round > 0:
round] = budget[round - 1] + appreciation
budget[= players.filter(pl.col("round") < round + 1)
data = players.filter(pl.col("round") == round + 1)
candidates = predict(data, candidates)
candidates = Problem(
problem =candidates.get_column("prediction"),
scores=candidates.get_column("value"),
values=candidates.get_column("position").to_dummies(),
positions=budget[round],
budget=formations,
formations
)= problem.solve()
picks = candidates.filter(picks == 1)
team round] = team
teams[= team.get_column("appreciation").sum()
appreciation = pl.concat(teams)
teams return teams
Before exploring predictions, we’ll begin with a few hypothetical backtests using actual observed scores for team selection. Backtesting this strategy, this is our team in the first round:
+-----------------------------------------------------------------------------+
| round player team position … minimum versus home prediction |
+=============================================================================+
| 1 71571 356 1 … 3.19 1371 1 11.0 |
| 1 42145 294 2 … 2.75 290 1 15.8 |
| 1 105584 264 2 … 2.75 263 0 10.5 |
| … … … … … … … … … |
| 1 89840 276 5 … 5.42 293 1 27.1 |
| 1 104530 294 5 … 2.3 290 1 11.0 |
| 1 97341 276 6 … 0.0 293 1 9.52 |
+-----------------------------------------------------------------------------+
shape: (12, 13)
And we can plot out cumulative performance during the season:
This might seem like a perfect campaign at first, but it’s possible that, early in the season, we didn’t have enough budget to pick the best scoring teams. To test this hypothesis, we backtest the same strategy with unlimited budget from the start:
Both runs are nearly identical, which is evidence that focusing on appreciation is not so important if we have accurate predictions for the scores. If we predict scores perfectly, we get a near perfect run.
To put our backtests into perspective, the 2022 season champion had a total score of 3434.37. This is very impressive and not very far from the near perfect run.
Score prediction
For each round, we must predict \(\hat{s}\), the vector of score predictions, using data from previous rounds.
However, during the first round, we don’t have any previous data to train our model. In this case, we need to include prior information. One way to do that would be to use data from previous seasons. However, we know a variable where this information is already encoded: the player value
. Each season starts with players valued according to their past performance. Knowing this, all our models start with \(\hat{s} = v\) in the first round.
Let’s use Bambi (Capretto et al. 2022) and its default priors to fit our models. We won’t delve into convergence diagnostics, since we are more interested in the average of the predictive posteriors and the backtest itself is measure of the prediction quality.
One question that arises here is: why not use non-parametric models such as gradient boosted trees or neural nets? After some experimentation, I concluded they are not a good fit for this problem: either because they assume independence between observations, or because they are too data hungry. Also, tuning these models for backtests might lead us into a rabbit hole (Bailey et al. 2013).
Player average
\[ \begin{align*} \mathbf{\hat{s}} = \mathbf{Z} \mathbf{\beta} \\ \mathbf{s} \sim N(\mathbf{\hat{s}}, \sigma), \end{align*} \]
where \(\mathbf{Z}\) is a dummy-encoded matrix of players; \(\mathbf{\beta}\) is a vector of parameters for each player.
In this model, \(\mathbf{\beta}\) is simply a vector of player averages. Let’s also consider that players that show up in the middle of the season have an average of zero before their first round. This will be our baseline model.
Player random effects
\[ \begin{align*} \mathbf{\hat{s}} = \alpha + \mathbf{Z} \mathbf{b} \\ \mathbf{b} \sim N(0, \sigma_b), \end{align*} \]
where \(\alpha\) is an intercept and \(\mathbf{b}\) is a vector of player random effects.
This model performs significantly better than the average model, possibly because of the partial pooling between the random effects, that pulls large effects towards the overall mean (Clark 2019). In our dataset, it’s common for players that played one or two games to have large averages by chance.
Fixture mixed effects
\[ \mathbf{\hat{s}} = \alpha + \mathbf{X} \mathbf{\beta} + \mathbf{Z} \mathbf{b}, \]
where \(\mathbf{X}\) is a matrix of the dummy-encoded fixture variables: the player team
, whether they are playing at home
, and their adversary
team variables; \(\mathbf{\beta}\) is a vector of fixed effects.
This model brings more context to our predictions. It also provides a reasonable way to predict a new player, by setting their \(b = 0\) (the mean of the random effects). However, it does not improve significantly over our random effects model.
Conclusion
We developed a comprehensive framework for the fantasy football team picking problem. There are more ideas we could explore to improve our chances of winning:
- enriching our data and models with player scouts;
- including more information in our priors;
- testing strategies that balance predicted score and appreciation;
- further model diagnostics.
However, I suppose expert human player predictions have a certain edge over those of hobbyist statistical models in fantasy leagues, due to the fact that there are all sorts of relevant data unavailable in public datasets.
At least, this seems to be the case for brazilian soccer, also known as “a little box of surprises”.
Citation
@online{assunção2023,
author = {Assunção, Luís},
title = {Drafting a Fantasy Football Team},
date = {2023-09-21},
url = {https://assuncaolfi.github.io/site/blog/fantasy-football},
langid = {en}
}