Football is a very exciting sport. Until now, this is the most popular game on the entire Earth planet. Sorry, not sorry about other games.
I want to review data collected since 1872 trying to understand how matches between countries have evolved up to this moment. So, we are calling to R
and a few libraries to help us visualizing data:
library(tidyverse)
library(plotly)
The first thing is to read files. I downloaded this project at 2021-07-22 from Kaggle.
results <- read.csv("results.csv", encoding = "UTF-8")
This dataset contains data about \(42k+\) football matches in the history of international encounters between national teams. So, let’s take a little taste of the data:
head(results)
One interesting thing is to take a look at the context of the matches, some of them could be not relevant at all, however, there is also World cup matches, continental tournaments, and so on:
levels(as.factor(results$tournament)) -> tournaments
sample(tournaments,20)
## [1] "AFF Championship qualification"
## [2] "Nations Cup"
## [3] "Jordan International Tournament"
## [4] "World Unity Cup"
## [5] "Copa Oswaldo Cruz"
## [6] "Malta International Tournament"
## [7] "South Pacific Games"
## [8] "Copa Lipton"
## [9] "UAFA Cup qualification"
## [10] "Intercontinental Cup"
## [11] "Rous Cup"
## [12] "Prime Minister's Cup"
## [13] "Copa Ramón Castilla"
## [14] "Island Games"
## [15] "African Cup of Nations qualification"
## [16] "African Nations Championship qualification"
## [17] "NAFU Championship"
## [18] "COSAFA Cup"
## [19] "Copa Chevallier Boutell"
## [20] "Copa del Pacífico"
Filtering by tournaments with at least 100 matches played in the history:
results %>%
group_by(tournament) %>%
summarise(count=n()) %>%
filter(count > 100) %>%
select(tournament) -> popularCups
results %>%
filter(tournament %in% popularCups$tournament) %>%
ggplot(aes(x=tournament, fill=tournament)) +
geom_bar() +
coord_flip() +
labs(title="Matches in tournaments") -> p
ggplotly(p)
Now we need to process a little bit of the data to assign a standard way to provide points based on the outcome of every match:
Points | Outcome |
---|---|
\(3\) | Victory |
\(1\) | Tie |
\(0\) | Defeat |
In FIFA scores, 2 points can be achieved by winning a shootout after a tied match, however, I ignored that for the following analysis
Let’s take a look on how it looks now:
results %>%
mutate(tied=ifelse(home_score == away_score,TRUE,FALSE)) %>%
mutate(home_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,3,0))) %>%
mutate(away_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,0,3))) -> results
results %>%
filter(grepl("FIFA World Cup",tournament)) -> worldCupResults
head(worldCupResults)
After this step we also need to transform a little bit the structure of this dataset in order to measure the performance of each National Team in this way:
Then we can see how it looks (for tournaments that contain "FIFA World Cup"
in its name).
results %>%
pivot_longer(c(home_team,away_team),names_to = "homeaway", values_to = "team") %>%
mutate(points=ifelse(grepl("home",homeaway),home_points,away_points),
goals=ifelse(grepl("home",homeaway),home_score,away_score),
receivedGoals=ifelse(grepl("home",homeaway),away_score,home_score)) %>%
select(date,tournament,country,team,points,goals,receivedGoals) -> results
results %>%
filter(grepl("FIFA World Cup",tournament)) -> worldCupResults
worldCupResults %>%
group_by(team) %>%
summarise( p=sum(points),
goals=sum(goals),
against=sum(receivedGoals),
matches=n()) %>%
mutate(performance=p/matches,
ofensive=goals/matches,
defense=against/matches) %>%
arrange(desc(performance)) %>%
head()
The most interesting matches occur at FIFA World Cup. So we can focus on what happens in this tournament:
worldCupResults %>%
group_by(team) %>%
summarise( p=sum(points),
goals=sum(goals),
against=sum(receivedGoals),
matches=n()) %>%
mutate( performance=p/matches,
ofensive=goals/matches,
defense=against/matches) %>%
ggplot(aes(x=performance,y=ofensive,size=defense,color=matches,text=team)) +
geom_point() +
labs(title="Performance vs offensiveness of National Teams in matches") -> p
ggplotly(p)
Germany emerges as the best in performance over all the matches related to the World Cup. Is not a surprise at all, remember all of the “goleadas” that has produced, in the qualifiers as well as in the knock-out matches in the final stages of the tournament.
Now we can take a look at what happens if we focus only on the final stage, I mean filtering out the qualifiers:
worldCupResults %>%
filter(!grepl("qualifi",tournament)) %>%
group_by(team) %>%
summarise( p=sum(points),
goals=sum(goals),
against=sum(receivedGoals),
matches=n()) %>%
mutate( performance=p/matches,
ofensive=goals/matches,
defense=against/matches) %>%
ggplot(aes(x=performance, y=ofensive, size=defense, color=matches, text=team)) +
geom_point() +
labs(title="Performance vs offensiveness of National Teams in matches of FIFA World Cup") -> p
ggplotly(p)
Now Brazil is the highlight. And it is not a surprise, they are also a winning machine in football. Germany got second place (this also is not a surprise).
Another interesting thing to appreciate is the offensive of Hungary! For all of the young people, you must know that in other times they were a true fear for every other nation in football.
And yes, Italy is another highlight, they invented the legendary Catenaccio. This is a very defensive way to play with a great performance as you can see!
You can also look to the Netherlands (the eternal runner-up), and my Mexico that is the “Otra vez nos quedamos sin el quinto partido”.
It is also interesting to take a more general view of the history of international football. (To forget about the unsurprising performance of my team).
results %>%
group_by(team) %>%
summarise( p=sum(points),
goals=sum(goals),
against=sum(receivedGoals),
matches=n()) %>%
mutate( performance=p/matches,
ofensive=goals/matches,
defense=against/matches) %>%
ggplot(aes( x=performance, y=ofensive, size=defense, color=matches, text=team)) +
geom_point() +
labs(title="Performance vs offensive in all matches") -> p
ggplotly(p)
And there are a lot of surprises! Maybe they are due to the lack of competitiveness, however, it is noticeable that in the center are the most popular teams, also with more played matches, and it seems that competency tends to center the behavior of the parameters selected.
So, we can take a look at how defense relates to offensive:
results %>%
group_by(team) %>%
summarise( p=sum(points),
goals=sum(goals),
against=sum(receivedGoals),
matches=n()) %>%
mutate( performance=p/matches,
ofensive=goals/matches,
defense=against/matches) %>%
ggplot(aes( color=performance, x=ofensive, y=defense, text=paste(team,matches,sep="\n"))) +
geom_point() +
labs(title="Defense vs offensive in all matches") -> p
ggplotly(p)
Maybe we can construct an ML tool to classify teams according to their performance. There is a pattern, however, there is also a lot of noise due to not-so-professional matches in this dataset.