In this review I will compare stats from Pokemon in order to get insights of current classification of them in terms of their capabilities for competitive battles.
This analysis will be made using R
so we need to load some libraries:
library(tidyverse)
library(plotly)
library(factoextra)
library(heatmaply)
library(knitr)
library(caret)
I will use a Pokemon dataset available in kaggle.
pstats <- read.csv("../pokemon.csv")
This dataset contains information about battle stats of the Pokemon as follows:
HP
Is the health power of the pokemon, it could be thought as the stamina of the pokemon.Attack
The power of physical attacks.Defence
The resistance to physical attacks.Sp_attack
The power of non physical attacks (special / energy attacks).Sp_defence
The resistance to non physical attacks.Speed
The velocity to perform an attack.Total
Is the sum of the other variables but Name
.Name
Is the name of the Pokemon.kable(head(pstats))
Name | Total | HP | Attack | Defence | Sp_attack | Sp_defence | Speed |
---|---|---|---|---|---|---|---|
Bulbasaur | 318 | 45 | 49 | 49 | 65 | 65 | 45 |
Ivysaur | 405 | 60 | 62 | 63 | 80 | 80 | 60 |
Venusaur | 525 | 80 | 82 | 83 | 100 | 100 | 80 |
Mega Venusaur | 625 | 80 | 100 | 123 | 122 | 120 | 80 |
Charmander | 309 | 39 | 52 | 43 | 60 | 50 | 65 |
Charmeleon | 405 | 58 | 64 | 58 | 80 | 65 | 80 |
The first step consists in perform an exploratory analysis of the different variables in this dataset. It is always useful to start identifying whether there is an identificable difference in the distribution of de data:
pstats %>%
pivot_longer(.,c(HP,Attack,Defence,Sp_attack,Sp_defence,Speed),names_to = "stat") -> pivstats
pivstats %>%
ggplot(aes(x=stat,y=value, fill=stat)) +
geom_boxplot() +
labs(title = "Distribution of stats") -> p
ggplotly(p)
As you can see, variables are equivalent between them so we can use raw data as it is.
I generate an interactive plot to see ranks of the pokemon across the variable stats:
pivstats %>%
ggplot(aes(x=stat,y=reorder(value,value),fill=Total,text=Name)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Pokes oredered by stat", y="value") -> p
ggplotly(p)
And an obvious plot to see is comparing Total
variable to each of the component variables to get insights of visible patterns in data:
pivstats %>%
ggplot(aes(color=stat,y=value,x=Total,text=Name)) +
geom_point() +
labs(title="Plotting stats vs PC") -> p
ggplotly(p)
The following thing to review consists in a reduction of dimensions on the data, the idea is to check if the primary variables contribute in some way to de dispersion of the capabilities of the Pokemon in battle.
I decide to use PCA to investigate how the variables relate in this dataset. In the first image are plotted the first and the second principal components, and the third one is displayed as a color scale.
It is clear that PC1 contains the overall summary of the battle capabilities for the Pokemon. There is an spotlight Pokemon: "Mega Eternatus"
, it is very different from the rest because of their great stats. On the other hand, some great Pokemons such as "Mega Rayquaza", "Mega Groudon", "Mega Kyogre"
are the nearest neigbors of the best Pokemon. They are also following the tendency on PC1.
pstats %>% select(HP,Attack,Defence,Sp_attack,Sp_defence,Speed) %>% prcomp() -> pca_poke
pcpoke <- pca_poke$x
pcpoke <- cbind(as.data.frame(pcpoke),nombre=pstats$Name)
pcpoke %>% ggplot(aes(x=PC1,y=PC2, color=PC3, text=nombre)) + geom_point() + labs(title = "Pokes in principal components") -> p
ggplotly(p)
Second image projects PC1
, PC2
, PC3
, and PC4
in a plot. It is very clear that the main outlier corresponds to "Mega Eternatus"
, however another Pokemon is highlighted (in yellow): "Shuckle"
which is the bug with the highest defense on the game (because of its shell).
plot_ly(pcpoke, x= ~PC1, y= ~PC2, z= ~PC3, color = ~PC4, text= ~nombre)
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
In that visualization, it can be seen that over PC1, PC2, PC3
is projected a cone filled with pokemon. That could be the main picking space for Nintendo and Gamefreak for the new Pokemon in every generation.
So the following step is to get a better view of how original variables contribute with the \(2\) principal components. In this plot you can see that the first impression about the summary along PC1
could be confirmed as every variable are somehow directed in similar direction. A interesting thing is that for PC2
seems to be a tendency for aggressive stats (Speed
, Sp_attack
) in positive values, and defensive to the other side.
fviz_pca_var(pca_poke, col.var = "contrib", gradient.cols=c("#00AFBB","#E7B800","#FC4E07"), repel=TRUE) -> p
p
In the PCA is shown that there is an area commonly picked to create new Pokemon over all generations. Another feature observed is that there is two outliers, however the rest of Pokemon also could be clustered in different groups. In this section I want to show you a classification that can be done using merely this type of stats.
I will use \(k=6\) for this analysis looking for classes somehow similar to this eschema:
Shuckle
?Mega Eternatus
?ptstats<-as.matrix(pstats)
rownames(ptstats)<-pstats$Name
d<-dist(ptstats)
## Warning in dist(ptstats): NAs introducidos por coerción
h<-hclust(d)
fviz_dend(x=h,k=6)
And proyected in a heatmap:
heatmaply(apply(ptstats[,3:8],c(1,2),as.numeric))
## Warning in fix_not_all_unique(rownames(x)): Not all the values are unique -
## manually added prefix numbers
The next thing to see is use tag of classes obtained by hierarchical clustering in the projection of the PCA.
cluspoke<-cutree(h,k=6)
cbind(pcpoke,cluspoke) %>%
ggplot(aes(x=PC1,y=PC2, color=as.factor(cluspoke), text=nombre)) +
geom_point() +
labs(title = "Classes of Pokemon") -> p
ggplotly(p)
And in the plotting relationship between Defence~Attack
variables using class tags to deveal hidden patterns (if they exists).
cbind(pstats,cluspoke) %>%
ggplot(aes(x=Defence,y=Attack, color=as.factor(cluspoke), size=Total, text=Name)) +
geom_point() +
labs(title = "Comparing stats on classes of Pokemon") -> p
ggplotly(p)
poketype<- read.csv("../pokedex_(Update_05.20).csv", row.names = 1)
pstats %>% inner_join(poketype, by = c("Name" = "name")) %>% select(pokedex_number,Name,HP,Attack,Defence,Sp_attack,Sp_defence,Speed,type_1) -> pstats_wtype
index <- createDataPartition(pstats_wtype$type_1, p=0.65, list=FALSE)
pwtype.training <- pstats_wtype[index,]
pwtype.test <- pstats_wtype[-index,]
model_type_knn <- train(pwtype.training[,3:8], pwtype.training[,9], method="knn", preProcess = c("center","scale"))
## Registered S3 methods overwritten by 'proxy':
## method from
## print.registry_field registry
## print.registry_entry registry
predictions <- predict(object=model_type_knn,pwtype.test[,3:8])
table(predictions)
## predictions
## Bug Dark Dragon Electric Fairy Fighting Fire Flying
## 29 3 11 22 3 6 29 0
## Ghost Grass Ground Ice Normal Poison Psychic Rock
## 3 27 17 2 51 17 15 13
## Steel Water
## 13 50
testLabels <- pwtype.test[,9]
confusionMatrix(predictions,as.factor(testLabels))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bug Dark Dragon Electric Fairy Fighting Fire Flying Ghost Grass
## Bug 6 2 0 4 0 0 1 0 0 5
## Dark 0 0 0 0 0 2 0 0 0 0
## Dragon 0 1 2 0 0 0 0 0 1 1
## Electric 1 1 1 2 0 0 2 0 2 3
## Fairy 1 0 0 0 0 0 0 0 0 0
## Fighting 0 0 0 0 2 0 0 0 0 0
## Fire 2 2 2 3 0 2 6 0 1 1
## Flying 0 0 0 0 0 0 0 0 0 0
## Ghost 0 0 0 0 0 0 0 0 1 0
## Grass 3 0 0 0 1 0 4 0 0 5
## Ground 1 0 0 1 0 4 1 0 0 0
## Ice 0 0 0 0 0 0 0 0 0 1
## Normal 2 2 0 2 0 3 2 2 0 4
## Poison 3 2 1 2 0 0 1 0 0 2
## Psychic 0 0 0 0 1 0 0 0 1 2
## Rock 4 0 1 0 0 0 0 0 0 0
## Steel 1 0 1 0 0 0 0 0 2 2
## Water 3 3 4 2 3 1 3 0 2 4
## Reference
## Prediction Ground Ice Normal Poison Psychic Rock Steel Water
## Bug 0 0 3 0 3 1 1 3
## Dark 0 0 0 0 0 1 0 0
## Dragon 0 0 0 0 2 0 1 3
## Electric 0 1 2 1 2 1 0 3
## Fairy 0 0 0 0 2 0 0 0
## Fighting 2 0 1 1 0 0 0 0
## Fire 0 2 1 0 1 2 0 4
## Flying 0 0 0 0 0 0 0 0
## Ghost 0 0 1 1 0 0 0 0
## Grass 1 0 3 2 1 0 1 6
## Ground 2 1 0 0 0 4 2 1
## Ice 0 0 0 0 0 0 0 1
## Normal 3 1 19 4 0 0 0 7
## Poison 0 2 2 0 0 0 0 2
## Psychic 0 0 2 0 4 0 1 4
## Rock 1 0 1 0 0 3 0 3
## Steel 1 0 0 1 0 2 2 1
## Water 1 2 4 2 5 3 3 5
##
## Overall Statistics
##
## Accuracy : 0.1833
## 95% CI : (0.1419, 0.2308)
## No Information Rate : 0.1383
## P-Value [Acc > NIR] : 0.01576
##
## Kappa : 0.1093
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Bug Class: Dark Class: Dragon Class: Electric
## Sensitivity 0.22222 0.000000 0.166667 0.125000
## Specificity 0.91901 0.989933 0.969900 0.932203
## Pos Pred Value 0.20690 0.000000 0.181818 0.090909
## Neg Pred Value 0.92553 0.957792 0.966667 0.951557
## Prevalence 0.08682 0.041801 0.038585 0.051447
## Detection Rate 0.01929 0.000000 0.006431 0.006431
## Detection Prevalence 0.09325 0.009646 0.035370 0.070740
## Balanced Accuracy 0.57062 0.494966 0.568283 0.528602
## Class: Fairy Class: Fighting Class: Fire Class: Flying
## Sensitivity 0.000000 0.00000 0.30000 0.000000
## Specificity 0.990132 0.97993 0.92096 1.000000
## Pos Pred Value 0.000000 0.00000 0.20690 NaN
## Neg Pred Value 0.977273 0.96066 0.95035 0.993569
## Prevalence 0.022508 0.03859 0.06431 0.006431
## Detection Rate 0.000000 0.00000 0.01929 0.000000
## Detection Prevalence 0.009646 0.01929 0.09325 0.000000
## Balanced Accuracy 0.495066 0.48997 0.61048 0.500000
## Class: Ghost Class: Grass Class: Ground Class: Ice
## Sensitivity 0.100000 0.16667 0.181818 0.000000
## Specificity 0.993355 0.92171 0.950000 0.993377
## Pos Pred Value 0.333333 0.18519 0.117647 0.000000
## Neg Pred Value 0.970779 0.91197 0.969388 0.970874
## Prevalence 0.032154 0.09646 0.035370 0.028939
## Detection Rate 0.003215 0.01608 0.006431 0.000000
## Detection Prevalence 0.009646 0.08682 0.054662 0.006431
## Balanced Accuracy 0.546678 0.54419 0.565909 0.496689
## Class: Normal Class: Poison Class: Psychic Class: Rock
## Sensitivity 0.48718 0.00000 0.20000 0.176471
## Specificity 0.88235 0.94314 0.96220 0.965986
## Pos Pred Value 0.37255 0.00000 0.26667 0.230769
## Neg Pred Value 0.92308 0.95918 0.94595 0.953020
## Prevalence 0.12540 0.03859 0.06431 0.054662
## Detection Rate 0.06109 0.00000 0.01286 0.009646
## Detection Prevalence 0.16399 0.05466 0.04823 0.041801
## Balanced Accuracy 0.68477 0.47157 0.58110 0.571228
## Class: Steel Class: Water
## Sensitivity 0.181818 0.11628
## Specificity 0.963333 0.83209
## Pos Pred Value 0.153846 0.10000
## Neg Pred Value 0.969799 0.85441
## Prevalence 0.035370 0.13826
## Detection Rate 0.006431 0.01608
## Detection Prevalence 0.041801 0.16077
## Balanced Accuracy 0.572576 0.47418
predictions2 <- predict(object = model_type_knn, pstats[,3:8])
cbind(pcpoke,predictions2) %>% ggplot(aes(x=PC1,y=PC2,color=predictions2, text=nombre)) + geom_point() + labs(title="Prediction using K Nearest Neibourghs") -> p
ggplotly(p)
#model_type_dnn <- train(pwtype.training[,3:8], pwtype.training[,9], method="dnn", preProcess = c("center","scale"))
model_type_dnn <- readRDS("model_type_dnn.rds")
predictions <- predict(object=model_type_dnn,pwtype.test[,3:8])
table(predictions)
## predictions
## Bug Dark Dragon Electric Fairy Fighting Fire Flying
## 0 0 0 0 0 0 0 0
## Ghost Grass Ground Ice Normal Poison Psychic Rock
## 0 0 0 0 0 0 0 0
## Steel Water
## 0 311
predictions2 <- predict(object = model_type_dnn, pstats[,3:8])
cbind(pcpoke,predictions2) %>% ggplot(aes(x=PC1,y=PC2,color=predictions2, text=nombre)) + geom_point() + labs(title = "Prediction using Deep Neural Network") -> p
ggplotly(p)
model_type_rf <- train(pwtype.training[,3:8], pwtype.training[,9], method="rf", preProcess = c("center","scale"))
## Warning: model fit failed for Resample06: mtry=2 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) :
## Can't have empty classes in y.
## Warning: model fit failed for Resample06: mtry=4 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) :
## Can't have empty classes in y.
## Warning: model fit failed for Resample06: mtry=6 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) :
## Can't have empty classes in y.
## Warning: model fit failed for Resample22: mtry=2 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) :
## Can't have empty classes in y.
## Warning: model fit failed for Resample22: mtry=4 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) :
## Can't have empty classes in y.
## Warning: model fit failed for Resample22: mtry=6 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) :
## Can't have empty classes in y.
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
predictions <- predict(object=model_type_rf,pwtype.test[,3:8])
table(predictions)
## predictions
## Bug Dark Dragon Electric Fairy Fighting Fire Flying
## 31 6 8 15 3 9 26 0
## Ghost Grass Ground Ice Normal Poison Psychic Rock
## 5 26 15 1 65 13 22 11
## Steel Water
## 8 47
predictions2 <- predict(object = model_type_rf, pstats[,3:8])
cbind(pcpoke,predictions2) %>% ggplot(aes(x=PC1,y=PC2,color=predictions2, text=nombre)) + geom_point() + labs(title = "Predictions using Random Forest") -> p
ggplotly(p)