Muchos Modelos- Clase 1

library(modelr)
library(broom)
library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)

gapminder

Los datos de gapminder resumen el progreso de los países a lo largo del tiempo, mirando estadísticas como la esperanza de vida y el PBI. Los datos son de fácil acceso en R, gracias a Jenny Bryan, quien creó el paquete gapminder:

#install.packages("gapminder")
library(gapminder)
gapminder
#> # A tibble: 1,704 x 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # ... with 1,694 more rows

En este estudio de caso, vamos a centrarnos solo en tres variables para responder a la pregunta

“¿Cómo cambia la esperanza de vida (lifeExp) en el tiempo (year) para cada país (country)?”

Un buen lugar para comenzar es con una gráfico:

gapminder %>% 
  ggplot(aes(year, lifeExp, group = country)) +
    geom_line(alpha = 1/3)

Se puede ver que en general hay una tendencia a que suba la esperanza de vida en el tiempo, pero también algunos países parecen tener un comportamiento diferente.

Si quisieramos calcular el modelo lineal para alguno de estos países


nz <- filter(gapminder, country == "New Zealand")
nz %>% 
  ggplot(aes(year, lifeExp)) + 
  geom_line() + 
  ggtitle("Y = ")

nz_mod <- lm(lifeExp ~ year, data = nz)
nz %>% 
  add_predictions(nz_mod) %>%
  ggplot(aes(year, pred)) + 
  geom_line() + 
  ggtitle(expression(beta[0] + beta[1]*x))

nz %>% 
  add_residuals(nz_mod) %>% 
  ggplot(aes(year, resid)) + 
  geom_hline(yintercept = 0, colour = "white", size = 3) + 
  geom_line() + 
  ggtitle(expression(+ epsilon))

Como hacemos para ajustar un modelo por país?

Nested data

Si queremos replicar lo de arriba para todos los países, podríamos:

copiar y pegar el código muchas veces 😡
hacer un loop para que itere sobre el código 😞
crear una función que haga el loop y nos devuelva las cosas de forma prolija 😒
usar un dataframe con nested data y map() 😃

Ahora, en lugar de repetir una acción para cada variable, queremos repetir una acción para cada país, un subconjunto de filas. Para hacer eso, necesitamos el nested dataframe.

by_country <- gapminder %>% 
  group_by(country, continent) %>% 
  nest()

by_country
#> # A tibble: 142 x 3
#>    country     continent data             
#>    <fct>       <fct>     <list>           
#>  1 Afghanistan Asia      <tibble [12 × 4]>
#>  2 Albania     Europe    <tibble [12 × 4]>
#>  3 Algeria     Africa    <tibble [12 × 4]>
#>  4 Angola      Africa    <tibble [12 × 4]>
#>  5 Argentina   Americas  <tibble [12 × 4]>
#>  6 Australia   Oceania   <tibble [12 × 4]>
#>  7 Austria     Europe    <tibble [12 × 4]>
#>  8 Bahrain     Asia      <tibble [12 × 4]>
#>  9 Bangladesh  Asia      <tibble [12 × 4]>
#> 10 Belgium     Europe    <tibble [12 × 4]>
#> # ... with 132 more rows

Esto crea un dataframe que tiene una fila por grupo (por país) y una columna bastante inusual: data es una lista de dataframes

by_country$data[[1]]
#> # A tibble: 12 x 4
#>     year lifeExp      pop gdpPercap
#>    <int>   <dbl>    <int>     <dbl>
#>  1  1952    28.8  8425333      779.
#>  2  1957    30.3  9240934      821.
#>  3  1962    32.0 10267083      853.
#>  4  1967    34.0 11537966      836.
#>  5  1972    36.1 13079460      740.
#>  6  1977    38.4 14880372      786.
#>  7  1982    39.9 12881816      978.
#>  8  1987    40.8 13867957      852.
#>  9  1992    41.7 16317921      649.
#> 10  1997    41.8 22227415      635.
#> 11  2002    42.1 25268405      727.
#> 12  2007    43.8 31889923      975.

Tengan en cuenta la diferencia entre DF agrupado estándar y un DF anidado:

en un DF agrupado, cada fila es una observación
en un DF anidado, cada fila es un grupo

Otra forma de pensar sobre un conjunto de datos anidados es que ahora tenemos una metaobservación: una fila que representa el curso de tiempo completo para un país, en lugar de un único punto en el tiempo.

List-columns

Con este DF podemos utilizar la función para ajustar los modelos

country_model <- function(df) {
  lm(lifeExp ~ year, data = df)
}

Queremos aplicarlo a cada uno de los dataframes anidados. Podemos usar purrr::map() para aplicar la función que definimos a cada elemento

models <- map(by_country$data, country_model)

En lugar de crear un nuevo objeto en el entorno global, vamos a crear una nueva variable en el DF by_country. Ese es un trabajo para dplyr :: mutate ():

by_country <- by_country %>% 
  mutate(model = map(data, country_model))
by_country
#> # A tibble: 142 x 4
#>    country     continent data              model   
#>    <fct>       <fct>     <list>            <list>  
#>  1 Afghanistan Asia      <tibble [12 × 4]> <S3: lm>
#>  2 Albania     Europe    <tibble [12 × 4]> <S3: lm>
#>  3 Algeria     Africa    <tibble [12 × 4]> <S3: lm>
#>  4 Angola      Africa    <tibble [12 × 4]> <S3: lm>
#>  5 Argentina   Americas  <tibble [12 × 4]> <S3: lm>
#>  6 Australia   Oceania   <tibble [12 × 4]> <S3: lm>
#>  7 Austria     Europe    <tibble [12 × 4]> <S3: lm>
#>  8 Bahrain     Asia      <tibble [12 × 4]> <S3: lm>
#>  9 Bangladesh  Asia      <tibble [12 × 4]> <S3: lm>
#> 10 Belgium     Europe    <tibble [12 × 4]> <S3: lm>
#> # ... with 132 more rows

Esto tiene una gran ventaja: debido a que todos los objetos relacionados se almacenan juntos, no es necesario que los mantengas sincronizados manualmente cuando los filtres o organices. La semántica del DF nos ayuda a tener todo organizado:

by_country %>% 
  filter(continent == "Europe")
#> # A tibble: 30 x 4
#>    country                continent data              model   
#>    <fct>                  <fct>     <list>            <list>  
#>  1 Albania                Europe    <tibble [12 × 4]> <S3: lm>
#>  2 Austria                Europe    <tibble [12 × 4]> <S3: lm>
#>  3 Belgium                Europe    <tibble [12 × 4]> <S3: lm>
#>  4 Bosnia and Herzegovina Europe    <tibble [12 × 4]> <S3: lm>
#>  5 Bulgaria               Europe    <tibble [12 × 4]> <S3: lm>
#>  6 Croatia                Europe    <tibble [12 × 4]> <S3: lm>
#>  7 Czech Republic         Europe    <tibble [12 × 4]> <S3: lm>
#>  8 Denmark                Europe    <tibble [12 × 4]> <S3: lm>
#>  9 Finland                Europe    <tibble [12 × 4]> <S3: lm>
#> 10 France                 Europe    <tibble [12 × 4]> <S3: lm>
#> # ... with 20 more rows
by_country %>% 
  arrange(continent, country)
#> # A tibble: 142 x 4
#>    country                  continent data              model   
#>    <fct>                    <fct>     <list>            <list>  
#>  1 Algeria                  Africa    <tibble [12 × 4]> <S3: lm>
#>  2 Angola                   Africa    <tibble [12 × 4]> <S3: lm>
#>  3 Benin                    Africa    <tibble [12 × 4]> <S3: lm>
#>  4 Botswana                 Africa    <tibble [12 × 4]> <S3: lm>
#>  5 Burkina Faso             Africa    <tibble [12 × 4]> <S3: lm>
#>  6 Burundi                  Africa    <tibble [12 × 4]> <S3: lm>
#>  7 Cameroon                 Africa    <tibble [12 × 4]> <S3: lm>
#>  8 Central African Republic Africa    <tibble [12 × 4]> <S3: lm>
#>  9 Chad                     Africa    <tibble [12 × 4]> <S3: lm>
#> 10 Comoros                  Africa    <tibble [12 × 4]> <S3: lm>
#> # ... with 132 more rows

Si la lista de DF y la lista de modelos eran objetos separados, cada vez que se reordena o filtra un elemento, necesitas volver a ordenar o filtrar todos los demás para mantenerlos sincronizados. Sino el código continuará funcionando, ¡pero dará una respuesta incorrecta!s

Unnesting

Antes agregamos los residuos a un modelo, utilizando add_residuals(). Ahora necesitamos hacerlo sobre 142 modelos y 142 datasets: Necesitamos utilizar la función map2()

by_country <- by_country %>% 
  mutate(
    resids = map2(data, model, add_residuals)
  )
by_country
#> # A tibble: 142 x 5
#>    country     continent data              model    resids           
#>    <fct>       <fct>     <list>            <list>   <list>           
#>  1 Afghanistan Asia      <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#>  2 Albania     Europe    <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#>  3 Algeria     Africa    <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#>  4 Angola      Africa    <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#>  5 Argentina   Americas  <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#>  6 Australia   Oceania   <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#>  7 Austria     Europe    <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#>  8 Bahrain     Asia      <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#>  9 Bangladesh  Asia      <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> 10 Belgium     Europe    <tibble [12 × 4]> <S3: lm> <tibble [12 × 5]>
#> # ... with 132 more rows

Para analizar los resultados, volvemos la lista de DF a un DF normal. Previamente utilizamos nest() para convertir un DF normal en un DF anidado, y ahora hacemos lo contrario con unnest():

resids <- unnest(by_country, resids)
resids
#> # A tibble: 1,704 x 7
#>    country     continent  year lifeExp      pop gdpPercap   resid
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>   <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779. -1.11  
#>  2 Afghanistan Asia       1957    30.3  9240934      821. -0.952 
#>  3 Afghanistan Asia       1962    32.0 10267083      853. -0.664 
#>  4 Afghanistan Asia       1967    34.0 11537966      836. -0.0172
#>  5 Afghanistan Asia       1972    36.1 13079460      740.  0.674 
#>  6 Afghanistan Asia       1977    38.4 14880372      786.  1.65  
#>  7 Afghanistan Asia       1982    39.9 12881816      978.  1.69  
#>  8 Afghanistan Asia       1987    40.8 13867957      852.  1.28  
#>  9 Afghanistan Asia       1992    41.7 16317921      649.  0.754 
#> 10 Afghanistan Asia       1997    41.8 22227415      635. -0.534 
#> # ... with 1,694 more rows

Noten que unnest me deja las variables agrupadoras, junto con el DF.

Ahora sí podemos graficar los residuos

resids %>% 
  ggplot(aes(year, resid)) +
    geom_line(aes(group = country), alpha = 1 / 3) + 
    geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

podemos facetear por continente

resids %>% 
  ggplot(aes(year, resid, group = country)) +
    geom_line(alpha = 1 / 3) + 
    facet_wrap(~continent)

Parece que hay cierto patrón en los residuos que estamos perdiendonos con el modelo lineal

Muchos Modelos- Clase 2

Diagnóstico del modelo

vamos a utilizar la librería broom para calcular métricas de calidad del modelo. Este paquete forma parte del tidyverse, y nos va a permitir estandarizar el output de muchos modelos, lo que va a faciltar la comparación

Empezamos por usar glance() Para calcular algunas métricas típicas. Si lo aplicamos sobre un modelo, obtenemos un DF deuna fila y una columna para cada métrica

glance(nz_mod)
#> # A tibble: 1 x 11
#>   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#> *     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
#> 1     0.954         0.949 0.804      205. 5.41e-8     2  -13.3  32.6  34.1
#> # ... with 2 more variables: deviance <dbl>, df.residual <int>

Podemos usar mutate() y unnest() para crear un DF con una fila por país (modelo):

by_country %>% 
  mutate(glnc = map(model, glance)) %>% 
  unnest(glnc)
#> # A tibble: 142 x 16
#>    country continent data  model resids r.squared adj.r.squared sigma
#>    <fct>   <fct>     <lis> <lis> <list>     <dbl>         <dbl> <dbl>
#>  1 Afghan… Asia      <tib… <S3:… <tibb…     0.948         0.942 1.22 
#>  2 Albania Europe    <tib… <S3:… <tibb…     0.911         0.902 1.98 
#>  3 Algeria Africa    <tib… <S3:… <tibb…     0.985         0.984 1.32 
#>  4 Angola  Africa    <tib… <S3:… <tibb…     0.888         0.877 1.41 
#>  5 Argent… Americas  <tib… <S3:… <tibb…     0.996         0.995 0.292
#>  6 Austra… Oceania   <tib… <S3:… <tibb…     0.980         0.978 0.621
#>  7 Austria Europe    <tib… <S3:… <tibb…     0.992         0.991 0.407
#>  8 Bahrain Asia      <tib… <S3:… <tibb…     0.967         0.963 1.64 
#>  9 Bangla… Asia      <tib… <S3:… <tibb…     0.989         0.988 0.977
#> 10 Belgium Europe    <tib… <S3:… <tibb…     0.995         0.994 0.293
#> # ... with 132 more rows, and 8 more variables: statistic <dbl>,
#> #   p.value <dbl>, df <int>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>

Si queremos sacarnos de encima las columnas de listas (data,model,resids), necesitamos agregar .drop = TRUE:

glnc <- by_country %>% 
  mutate(glnc = map(model, glance)) %>% 
  unnest(glnc, .drop = TRUE)
glnc
#> # A tibble: 142 x 13
#>    country continent r.squared adj.r.squared sigma statistic  p.value    df
#>    <fct>   <fct>         <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>
#>  1 Afghan… Asia          0.948         0.942 1.22      181.  9.84e- 8     2
#>  2 Albania Europe        0.911         0.902 1.98      102.  1.46e- 6     2
#>  3 Algeria Africa        0.985         0.984 1.32      662.  1.81e-10     2
#>  4 Angola  Africa        0.888         0.877 1.41       79.1 4.59e- 6     2
#>  5 Argent… Americas      0.996         0.995 0.292    2246.  4.22e-13     2
#>  6 Austra… Oceania       0.980         0.978 0.621     481.  8.67e-10     2
#>  7 Austria Europe        0.992         0.991 0.407    1261.  7.44e-12     2
#>  8 Bahrain Asia          0.967         0.963 1.64      291.  1.02e- 8     2
#>  9 Bangla… Asia          0.989         0.988 0.977     930.  3.37e-11     2
#> 10 Belgium Europe        0.995         0.994 0.293    1822.  1.20e-12     2
#> # ... with 132 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
#> #   BIC <dbl>, deviance <dbl>, df.residual <int>

Podemos empezar a buscar los peores modelos ordenando por \(R^2\)

glnc %>% 
  arrange(r.squared)
#> # A tibble: 142 x 13
#>    country continent r.squared adj.r.squared sigma statistic p.value    df
#>    <fct>   <fct>         <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>
#>  1 Rwanda  Africa       0.0172      -0.0811   6.56     0.175  0.685      2
#>  2 Botswa… Africa       0.0340      -0.0626   6.11     0.352  0.566      2
#>  3 Zimbab… Africa       0.0562      -0.0381   7.21     0.596  0.458      2
#>  4 Zambia  Africa       0.0598      -0.0342   4.53     0.636  0.444      2
#>  5 Swazil… Africa       0.0682      -0.0250   6.64     0.732  0.412      2
#>  6 Lesotho Africa       0.0849      -0.00666  5.93     0.927  0.358      2
#>  7 Cote d… Africa       0.283        0.212    3.93     3.95   0.0748     2
#>  8 South … Africa       0.312        0.244    4.74     4.54   0.0588     2
#>  9 Uganda  Africa       0.342        0.276    3.19     5.20   0.0457     2
#> 10 Congo,… Africa       0.348        0.283    2.43     5.34   0.0434     2
#> # ... with 132 more rows, and 5 more variables: logLik <dbl>, AIC <dbl>,
#> #   BIC <dbl>, deviance <dbl>, df.residual <int>

Podemos confirmar la intuición de que el modelo lineal no esta funcionando bien para Africa.

Graficamente:

glnc %>% 
  ggplot(aes(continent, r.squared, fill = continent)) + 
  geom_boxplot()+
  theme(legend.position = "none")

glnc %>% 
  ggplot(aes(continent, r.squared)) + 
    geom_jitter(width = 0.5)

Podemos seleccionar los 5 peores modelos y graficarlos

#bad_fit <- filter(glnc, r.squared < 0.25)
bad_fit <- filter(glnc, rank(r.squared) < 5)

gapminder %>% 
  semi_join(bad_fit, by = "country") %>% 
  ggplot(aes(year, lifeExp, colour = country)) +
    geom_line()

Aquí vemos dos efectos principales: las tragedias de la epidemia de VIH / SIDA y el genocidio de Ruanda.

The broom package provides three general tools for turning models into tidy data frames:

Tidy

El comando tidy(model) nos permite obtener la salida del modelo lineal de forma prolija.

En el caso particular de un modelo. El output tradicional se vería de esta forma

nz_mod <- lm(lifeExp ~ year, data = nz)

summary(nz_mod)
#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = nz)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.28745 -0.63700  0.06345  0.64442  0.91192 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -307.69963   26.63039  -11.55 4.17e-07 ***
#> year           0.19282    0.01345   14.33 5.41e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8043 on 10 degrees of freedom
#> Multiple R-squared:  0.9536, Adjusted R-squared:  0.9489 
#> F-statistic: 205.4 on 1 and 10 DF,  p-value: 5.407e-08

Tenemos para intercept y year un valor estimado, error estandar y el p-value para ver su significatividad.

A la mandera tidy

tidy(nz_mod)
#> # A tibble: 2 x 5
#>   term        estimate std.error statistic      p.value
#>   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#> 1 (Intercept) -308.      26.6        -11.6 0.000000417 
#> 2 year           0.193    0.0135      14.3 0.0000000541

De esta forma recuperamos las métricas que refieren a los coeficientes en un formato consistente. Esto es sumamente útil si queremos trabajar con varios modelos a la vez, o automatizar un pipline de trabajo con una salida de un modelo.

Por ejemplo, obtengamos estos valores para todas las rergesiones. Para ello, necesitamos mapear los modelos con la función tidy

tdy <- by_country %>% 
  mutate(tdy = map(model, tidy)) %>% 
  unnest(tdy, .drop = TRUE)
tdy
#> # A tibble: 284 x 7
#>    country     continent term         estimate std.error statistic  p.value
#>    <fct>       <fct>     <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#>  1 Afghanistan Asia      (Intercept)  -508.     40.5        -12.5  1.93e- 7
#>  2 Afghanistan Asia      year            0.275   0.0205      13.5  9.84e- 8
#>  3 Albania     Europe    (Intercept)  -594.     65.7         -9.05 3.94e- 6
#>  4 Albania     Europe    year            0.335   0.0332      10.1  1.46e- 6
#>  5 Algeria     Africa    (Intercept) -1068.     43.8        -24.4  3.07e-10
#>  6 Algeria     Africa    year            0.569   0.0221      25.7  1.81e-10
#>  7 Angola      Africa    (Intercept)  -377.     46.6         -8.08 1.08e- 5
#>  8 Angola      Africa    year            0.209   0.0235       8.90 4.59e- 6
#>  9 Argentina   Americas  (Intercept)  -390.      9.68       -40.3  2.14e-12
#> 10 Argentina   Americas  year            0.232   0.00489     47.4  4.22e-13
#> # ... with 274 more rows

Noten que para cada país ahora tenemos dos filas, una para cada coeficiente del modelo.

Con esta tabla, ahora podríamos graficar los coeficientes del modelo.

Para un sólo modelo (nz_mod):

tidy(nz_mod) %>% mutate(low = estimate - std.error,
                          high = estimate + std.error) %>% 
ggplot(., aes(estimate, term, xmin = low, xmax = high, height = 0)) +
     geom_point() +
     geom_vline(xintercept = 0) +
     geom_errorbarh()

En este gráfico tenemos graficado cada uno de los parámetros del modelo, con +-1 desvío estandar. si quisieramos comparar todos los modelos.

tdy %>% mutate(low = estimate - std.error,
                          high = estimate + std.error) %>% 
ggplot(., aes(estimate, term, xmin = low, xmax = high, height = 0, color = country)) +
     geom_point() +
     geom_vline(xintercept = 0) +
     geom_errorbarh()+
  theme(legend.position = "none")

¿cómo interpretan estos coeficientes? ¿qué representa el intercepto de cada modelo?

augment

la función lm() también tiene una serie de gráficos para evaluar la calidad del modelo que pueden ser muy útiles para encontrar si un modelo esta mal calibrado.

plot(nz_mod)

Estos gráficos son:

Residuos versus el modelo ajustado: El objetivo es que no veamos una estructura clara. Si así la hubiera, esto implicaría que hay una parte sistemática del fenómeno que se esta perdiendo.
Normal QQ plot: sirve para ver si los datos siguen una distribución teórica, en este caso, la \(\sim N(0,1)\). Los residuos estandarizados, si el modelo esta bien definido, deberían seguir esta distribución
Scale-location plot: Similar al primer gráfico, pero utilizando la raiz cuadrada de los residuos estandarizados. De la misma forma que el anterior, buscamos que no haya una estructura en los residuos.
scale vs leverage: El leverage mide cuan influyentes son los puntos en el modelo. Si en este gráfico algunas observacione aparecen muy separadas, con mucho leverage y residuo, significa que cambian mucho al modelo.

La función augment(model, data) nos permite recuperar la información necesaria para estos gráficos de forma sistemática.

au <- augment(nz_mod,nz)
au
#> # A tibble: 12 x 13
#>    country continent  year lifeExp    pop gdpPercap .fitted .se.fit  .resid
#>  * <fct>   <fct>     <int>   <dbl>  <int>     <dbl>   <dbl>   <dbl>   <dbl>
#>  1 New Ze… Oceania    1952    69.4 1.99e6    10557.    68.7   0.437  0.703 
#>  2 New Ze… Oceania    1957    70.3 2.23e6    12247.    69.7   0.381  0.609 
#>  3 New Ze… Oceania    1962    71.2 2.49e6    13176.    70.6   0.331  0.625 
#>  4 New Ze… Oceania    1967    71.5 2.73e6    14464.    71.6   0.287 -0.0592
#>  5 New Ze… Oceania    1972    71.9 2.93e6    16046.    72.5   0.253 -0.653 
#>  6 New Ze… Oceania    1977    72.2 3.16e6    16234.    73.5   0.235 -1.29  
#>  7 New Ze… Oceania    1982    73.8 3.21e6    17632.    74.5   0.235 -0.632 
#>  8 New Ze… Oceania    1987    74.3 3.32e6    19007.    75.4   0.253 -1.12  
#>  9 New Ze… Oceania    1992    76.3 3.44e6    18363.    76.4   0.287 -0.0698
#> 10 New Ze… Oceania    1997    77.6 3.68e6    21050.    77.4   0.331  0.186 
#> 11 New Ze… Oceania    2002    79.1 3.91e6    23190.    78.3   0.381  0.782 
#> 12 New Ze… Oceania    2007    80.2 4.12e6    25185.    79.3   0.437  0.912 
#> # ... with 4 more variables: .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
#> #   .std.resid <dbl>


ggplot(au, aes(.fitted, .resid)) +
  geom_point()+
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(au, aes(sample= .std.resid))+
  stat_qq()+
  geom_abline()

ggplot(au, aes(.fitted, sqrt(abs(.std.resid))))+
  geom_point()+
  geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(au, aes(.hat, .std.resid)) +
  geom_vline(size = 2, colour = "white", xintercept = 0) +
  geom_hline(size = 2, colour = "white", yintercept = 0) +
  geom_point() + geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

De la misma forma que con las otras funciones, podemos utilizar map2(model, data, augment) para calcular esto para todos nuestros modelos en simultaneo.

augment <- by_country %>%
  mutate(augment = map2( model,data,augment)) %>% 
  unnest(augment, .drop = TRUE)
augment
#> # A tibble: 1,704 x 13
#>    country continent  year lifeExp    pop gdpPercap .fitted .se.fit  .resid
#>    <fct>   <fct>     <int>   <dbl>  <int>     <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Afghan… Asia       1952    28.8 8.43e6      779.    29.9   0.664 -1.11  
#>  2 Afghan… Asia       1957    30.3 9.24e6      821.    31.3   0.580 -0.952 
#>  3 Afghan… Asia       1962    32.0 1.03e7      853.    32.7   0.503 -0.664 
#>  4 Afghan… Asia       1967    34.0 1.15e7      836.    34.0   0.436 -0.0172
#>  5 Afghan… Asia       1972    36.1 1.31e7      740.    35.4   0.385  0.674 
#>  6 Afghan… Asia       1977    38.4 1.49e7      786.    36.8   0.357  1.65  
#>  7 Afghan… Asia       1982    39.9 1.29e7      978.    38.2   0.357  1.69  
#>  8 Afghan… Asia       1987    40.8 1.39e7      852.    39.5   0.385  1.28  
#>  9 Afghan… Asia       1992    41.7 1.63e7      649.    40.9   0.436  0.754 
#> 10 Afghan… Asia       1997    41.8 2.22e7      635.    42.3   0.503 -0.534 
#> # ... with 1,694 more rows, and 4 more variables: .hat <dbl>,
#> #   .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>

Estas notas estan basadas en Hadley Wickham and Garrett Grolemund. 2017. R for Data Science Import, Tidy, Transform, Visualize, and Model Data (1st ed.). O’Reilly Media, Inc. http://r4ds.had.co.nz.↩

Comparación de modelos1