In this review I will compare stats from Pokemon in order to get insights of current classification of them in terms of their capabilities for competitive battles.

Libraries

This analysis will be made using R so we need to load some libraries:

library(tidyverse)
library(plotly)
library(factoextra)
library(heatmaply)
library(knitr)
library(caret)

Dataset

I will use a Pokemon dataset available in kaggle.

pstats <- read.csv("../pokemon.csv")

This dataset contains information about battle stats of the Pokemon as follows:

kable(head(pstats))
Name Total HP Attack Defence Sp_attack Sp_defence Speed
Bulbasaur 318 45 49 49 65 65 45
Ivysaur 405 60 62 63 80 80 60
Venusaur 525 80 82 83 100 100 80
Mega Venusaur 625 80 100 123 122 120 80
Charmander 309 39 52 43 60 50 65
Charmeleon 405 58 64 58 80 65 80

Exploratory analysis

The first step consists in perform an exploratory analysis of the different variables in this dataset. It is always useful to start identifying whether there is an identificable difference in the distribution of de data:

pstats %>% 
  pivot_longer(.,c(HP,Attack,Defence,Sp_attack,Sp_defence,Speed),names_to = "stat") -> pivstats
pivstats %>%
  ggplot(aes(x=stat,y=value, fill=stat)) +
  geom_boxplot() +
  labs(title = "Distribution of stats") -> p
ggplotly(p)

As you can see, variables are equivalent between them so we can use raw data as it is.

I generate an interactive plot to see ranks of the pokemon across the variable stats:

pivstats %>%
  ggplot(aes(x=stat,y=reorder(value,value),fill=Total,text=Name)) +
  geom_bar(stat="identity", position="dodge") +
  labs(title="Pokes oredered by stat", y="value") -> p
ggplotly(p)

And an obvious plot to see is comparing Total variable to each of the component variables to get insights of visible patterns in data:

pivstats %>%
  ggplot(aes(color=stat,y=value,x=Total,text=Name)) +
  geom_point() +
  labs(title="Plotting stats vs PC") -> p
ggplotly(p)

Principal component analysis

The following thing to review consists in a reduction of dimensions on the data, the idea is to check if the primary variables contribute in some way to de dispersion of the capabilities of the Pokemon in battle.

I decide to use PCA to investigate how the variables relate in this dataset. In the first image are plotted the first and the second principal components, and the third one is displayed as a color scale.

It is clear that PC1 contains the overall summary of the battle capabilities for the Pokemon. There is an spotlight Pokemon: "Mega Eternatus", it is very different from the rest because of their great stats. On the other hand, some great Pokemons such as "Mega Rayquaza", "Mega Groudon", "Mega Kyogre" are the nearest neigbors of the best Pokemon. They are also following the tendency on PC1.

pstats %>% select(HP,Attack,Defence,Sp_attack,Sp_defence,Speed) %>% prcomp() -> pca_poke
pcpoke <- pca_poke$x
pcpoke <- cbind(as.data.frame(pcpoke),nombre=pstats$Name)
pcpoke %>% ggplot(aes(x=PC1,y=PC2, color=PC3, text=nombre)) + geom_point() + labs(title = "Pokes in principal components") -> p
ggplotly(p)

Second image projects PC1, PC2, PC3, and PC4 in a plot. It is very clear that the main outlier corresponds to "Mega Eternatus", however another Pokemon is highlighted (in yellow): "Shuckle" which is the bug with the highest defense on the game (because of its shell).

plot_ly(pcpoke, x= ~PC1, y=  ~PC2, z= ~PC3, color = ~PC4, text= ~nombre)
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

In that visualization, it can be seen that over PC1, PC2, PC3 is projected a cone filled with pokemon. That could be the main picking space for Nintendo and Gamefreak for the new Pokemon in every generation.

So the following step is to get a better view of how original variables contribute with the \(2\) principal components. In this plot you can see that the first impression about the summary along PC1 could be confirmed as every variable are somehow directed in similar direction. A interesting thing is that for PC2 seems to be a tendency for aggressive stats (Speed, Sp_attack) in positive values, and defensive to the other side.

fviz_pca_var(pca_poke, col.var = "contrib", gradient.cols=c("#00AFBB","#E7B800","#FC4E07"), repel=TRUE) -> p
p

Clustering analysis

In the PCA is shown that there is an area commonly picked to create new Pokemon over all generations. Another feature observed is that there is two outliers, however the rest of Pokemon also could be clustered in different groups. In this section I want to show you a classification that can be done using merely this type of stats.

I will use \(k=6\) for this analysis looking for classes somehow similar to this eschema:

  1. Common Pokemon
  2. Strong Pokemon
  3. Competitive Pokemon
  4. Prohibited Pokemon
  5. Shuckle?
  6. Mega Eternatus?
ptstats<-as.matrix(pstats)
rownames(ptstats)<-pstats$Name
d<-dist(ptstats)
## Warning in dist(ptstats): NAs introducidos por coerción
h<-hclust(d)
fviz_dend(x=h,k=6)

And proyected in a heatmap:

heatmaply(apply(ptstats[,3:8],c(1,2),as.numeric))
## Warning in fix_not_all_unique(rownames(x)): Not all the values are unique -
## manually added prefix numbers

The next thing to see is use tag of classes obtained by hierarchical clustering in the projection of the PCA.

cluspoke<-cutree(h,k=6)
cbind(pcpoke,cluspoke) %>%
  ggplot(aes(x=PC1,y=PC2, color=as.factor(cluspoke), text=nombre)) +
  geom_point() +
  labs(title = "Classes of Pokemon") -> p
ggplotly(p)

And in the plotting relationship between Defence~Attack variables using class tags to deveal hidden patterns (if they exists).

cbind(pstats,cluspoke) %>%
  ggplot(aes(x=Defence,y=Attack, color=as.factor(cluspoke), size=Total, text=Name)) +
  geom_point() +
  labs(title = "Comparing stats on classes of Pokemon") -> p
ggplotly(p)

Classification using Machine Learning

poketype<- read.csv("../pokedex_(Update_05.20).csv", row.names = 1)
pstats %>% inner_join(poketype, by = c("Name" = "name")) %>% select(pokedex_number,Name,HP,Attack,Defence,Sp_attack,Sp_defence,Speed,type_1) -> pstats_wtype
index <- createDataPartition(pstats_wtype$type_1, p=0.65, list=FALSE)
pwtype.training <- pstats_wtype[index,]
pwtype.test <- pstats_wtype[-index,]
model_type_knn <- train(pwtype.training[,3:8], pwtype.training[,9], method="knn", preProcess = c("center","scale"))
## Registered S3 methods overwritten by 'proxy':
##   method               from    
##   print.registry_field registry
##   print.registry_entry registry
predictions <- predict(object=model_type_knn,pwtype.test[,3:8])
table(predictions)
## predictions
##      Bug     Dark   Dragon Electric    Fairy Fighting     Fire   Flying 
##       29        3       11       22        3        6       29        0 
##    Ghost    Grass   Ground      Ice   Normal   Poison  Psychic     Rock 
##        3       27       17        2       51       17       15       13 
##    Steel    Water 
##       13       50
testLabels <- pwtype.test[,9]
confusionMatrix(predictions,as.factor(testLabels))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bug Dark Dragon Electric Fairy Fighting Fire Flying Ghost Grass
##   Bug        6    2      0        4     0        0    1      0     0     5
##   Dark       0    0      0        0     0        2    0      0     0     0
##   Dragon     0    1      2        0     0        0    0      0     1     1
##   Electric   1    1      1        2     0        0    2      0     2     3
##   Fairy      1    0      0        0     0        0    0      0     0     0
##   Fighting   0    0      0        0     2        0    0      0     0     0
##   Fire       2    2      2        3     0        2    6      0     1     1
##   Flying     0    0      0        0     0        0    0      0     0     0
##   Ghost      0    0      0        0     0        0    0      0     1     0
##   Grass      3    0      0        0     1        0    4      0     0     5
##   Ground     1    0      0        1     0        4    1      0     0     0
##   Ice        0    0      0        0     0        0    0      0     0     1
##   Normal     2    2      0        2     0        3    2      2     0     4
##   Poison     3    2      1        2     0        0    1      0     0     2
##   Psychic    0    0      0        0     1        0    0      0     1     2
##   Rock       4    0      1        0     0        0    0      0     0     0
##   Steel      1    0      1        0     0        0    0      0     2     2
##   Water      3    3      4        2     3        1    3      0     2     4
##           Reference
## Prediction Ground Ice Normal Poison Psychic Rock Steel Water
##   Bug           0   0      3      0       3    1     1     3
##   Dark          0   0      0      0       0    1     0     0
##   Dragon        0   0      0      0       2    0     1     3
##   Electric      0   1      2      1       2    1     0     3
##   Fairy         0   0      0      0       2    0     0     0
##   Fighting      2   0      1      1       0    0     0     0
##   Fire          0   2      1      0       1    2     0     4
##   Flying        0   0      0      0       0    0     0     0
##   Ghost         0   0      1      1       0    0     0     0
##   Grass         1   0      3      2       1    0     1     6
##   Ground        2   1      0      0       0    4     2     1
##   Ice           0   0      0      0       0    0     0     1
##   Normal        3   1     19      4       0    0     0     7
##   Poison        0   2      2      0       0    0     0     2
##   Psychic       0   0      2      0       4    0     1     4
##   Rock          1   0      1      0       0    3     0     3
##   Steel         1   0      0      1       0    2     2     1
##   Water         1   2      4      2       5    3     3     5
## 
## Overall Statistics
##                                           
##                Accuracy : 0.1833          
##                  95% CI : (0.1419, 0.2308)
##     No Information Rate : 0.1383          
##     P-Value [Acc > NIR] : 0.01576         
##                                           
##                   Kappa : 0.1093          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Bug Class: Dark Class: Dragon Class: Electric
## Sensitivity             0.22222    0.000000      0.166667        0.125000
## Specificity             0.91901    0.989933      0.969900        0.932203
## Pos Pred Value          0.20690    0.000000      0.181818        0.090909
## Neg Pred Value          0.92553    0.957792      0.966667        0.951557
## Prevalence              0.08682    0.041801      0.038585        0.051447
## Detection Rate          0.01929    0.000000      0.006431        0.006431
## Detection Prevalence    0.09325    0.009646      0.035370        0.070740
## Balanced Accuracy       0.57062    0.494966      0.568283        0.528602
##                      Class: Fairy Class: Fighting Class: Fire Class: Flying
## Sensitivity              0.000000         0.00000     0.30000      0.000000
## Specificity              0.990132         0.97993     0.92096      1.000000
## Pos Pred Value           0.000000         0.00000     0.20690           NaN
## Neg Pred Value           0.977273         0.96066     0.95035      0.993569
## Prevalence               0.022508         0.03859     0.06431      0.006431
## Detection Rate           0.000000         0.00000     0.01929      0.000000
## Detection Prevalence     0.009646         0.01929     0.09325      0.000000
## Balanced Accuracy        0.495066         0.48997     0.61048      0.500000
##                      Class: Ghost Class: Grass Class: Ground Class: Ice
## Sensitivity              0.100000      0.16667      0.181818   0.000000
## Specificity              0.993355      0.92171      0.950000   0.993377
## Pos Pred Value           0.333333      0.18519      0.117647   0.000000
## Neg Pred Value           0.970779      0.91197      0.969388   0.970874
## Prevalence               0.032154      0.09646      0.035370   0.028939
## Detection Rate           0.003215      0.01608      0.006431   0.000000
## Detection Prevalence     0.009646      0.08682      0.054662   0.006431
## Balanced Accuracy        0.546678      0.54419      0.565909   0.496689
##                      Class: Normal Class: Poison Class: Psychic Class: Rock
## Sensitivity                0.48718       0.00000        0.20000    0.176471
## Specificity                0.88235       0.94314        0.96220    0.965986
## Pos Pred Value             0.37255       0.00000        0.26667    0.230769
## Neg Pred Value             0.92308       0.95918        0.94595    0.953020
## Prevalence                 0.12540       0.03859        0.06431    0.054662
## Detection Rate             0.06109       0.00000        0.01286    0.009646
## Detection Prevalence       0.16399       0.05466        0.04823    0.041801
## Balanced Accuracy          0.68477       0.47157        0.58110    0.571228
##                      Class: Steel Class: Water
## Sensitivity              0.181818      0.11628
## Specificity              0.963333      0.83209
## Pos Pred Value           0.153846      0.10000
## Neg Pred Value           0.969799      0.85441
## Prevalence               0.035370      0.13826
## Detection Rate           0.006431      0.01608
## Detection Prevalence     0.041801      0.16077
## Balanced Accuracy        0.572576      0.47418
predictions2 <- predict(object = model_type_knn, pstats[,3:8])
cbind(pcpoke,predictions2) %>% ggplot(aes(x=PC1,y=PC2,color=predictions2, text=nombre)) + geom_point() + labs(title="Prediction using K Nearest Neibourghs") -> p
ggplotly(p)
#model_type_dnn <- train(pwtype.training[,3:8], pwtype.training[,9], method="dnn", preProcess = c("center","scale"))
model_type_dnn <- readRDS("model_type_dnn.rds")
predictions <- predict(object=model_type_dnn,pwtype.test[,3:8])
table(predictions)
## predictions
##      Bug     Dark   Dragon Electric    Fairy Fighting     Fire   Flying 
##        0        0        0        0        0        0        0        0 
##    Ghost    Grass   Ground      Ice   Normal   Poison  Psychic     Rock 
##        0        0        0        0        0        0        0        0 
##    Steel    Water 
##        0      311
predictions2 <- predict(object = model_type_dnn, pstats[,3:8])
cbind(pcpoke,predictions2) %>% ggplot(aes(x=PC1,y=PC2,color=predictions2, text=nombre)) + geom_point() + labs(title = "Prediction using Deep Neural Network") -> p
ggplotly(p)
model_type_rf <- train(pwtype.training[,3:8], pwtype.training[,9], method="rf", preProcess = c("center","scale"))
## Warning: model fit failed for Resample06: mtry=2 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) : 
##   Can't have empty classes in y.
## Warning: model fit failed for Resample06: mtry=4 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) : 
##   Can't have empty classes in y.
## Warning: model fit failed for Resample06: mtry=6 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) : 
##   Can't have empty classes in y.
## Warning: model fit failed for Resample22: mtry=2 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) : 
##   Can't have empty classes in y.
## Warning: model fit failed for Resample22: mtry=4 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) : 
##   Can't have empty classes in y.
## Warning: model fit failed for Resample22: mtry=6 Error in randomForest.default(x, y, mtry = min(param$mtry, ncol(x)), ...) : 
##   Can't have empty classes in y.
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
predictions <- predict(object=model_type_rf,pwtype.test[,3:8])
table(predictions)
## predictions
##      Bug     Dark   Dragon Electric    Fairy Fighting     Fire   Flying 
##       31        6        8       15        3        9       26        0 
##    Ghost    Grass   Ground      Ice   Normal   Poison  Psychic     Rock 
##        5       26       15        1       65       13       22       11 
##    Steel    Water 
##        8       47
predictions2 <- predict(object = model_type_rf, pstats[,3:8])
cbind(pcpoke,predictions2) %>% ggplot(aes(x=PC1,y=PC2,color=predictions2, text=nombre)) + geom_point() + labs(title = "Predictions using Random Forest") -> p
ggplotly(p)