library(dplyr)
library(ggplot2)
library(prophet)
library(lubridate)
El dataset consiste en información real de 8 meses de visitantes de un local de ropa localizado en un shopping de gran tamaño
shopping <- read.csv('local-shopping_prophet.csv') %>%
rename(.,visitantes=Total.de.Visitantes) %>%
mutate(Dia = ymd(Dia))
glimpse(shopping)
Observations: 228
Variables: 3
$ Dia [3m[38;5;246m<date>[39m[23m 2017-09-14, 2017-09-15, 2017-09-16, 2017-09-17, 2017-09-18, 2017-09-19, 2017-09-20, 2017-09-21, 2017-09-22…
$ visitantes [3m[38;5;246m<int>[39m[23m 589, 696, 1034, 940, 540, 526, 312, 553, 639, 793, 745, 8, 333, 568, 352, 393, 497, 578, 563, 520, 413, 8, …
$ Clima [3m[38;5;246m<fct>[39m[23m Parcialmente Nublado, Soleado, Soleado, Parcialmente Nublado, Soleado, Soleado, Soleado, Nublado, Soleado, …
Se observa que la serie de tiempo presenta un comportamiento cíclico con algunos eventos atípicos:
ggplot(shopping, aes(Dia, visitantes)) + geom_line() + theme_bw() + labs(title='Visitantes por dia')
ggplot(shopping, aes(Dia, visitantes)) + geom_line() + geom_point(color='forestgreen') + theme_bw() + labs(title='Visitantes por dia')
Una primera aproximación puede ser tratar de utilizar técnicas de suavizado.
En este caso estamos utilizando loess como una herramienta gráfica con el comando geom_smooth()
de ggplot.
ggplot(shopping, aes(Dia, visitantes)) + geom_point(color='forestgreen') + geom_smooth() + theme_bw() + labs(title='Visitantes por dia: Suavizado')
Si bien está suavizando, el resultado es bastante pobre ya que el modelo no está logrando representar bien la variabilidad de los datos.
Ahora realizamos 4 modelos de LOESS modificando la ventana (span) para cada uno de ellos.
loess=stats::loess(visitantes~as.numeric(Dia), data = shopping, na.action = 'na.exclude', model = T, span=0.01)
shopping['loess']=predict(loess,shopping)
loess_1 = ggplot(shopping, aes(Dia,visitantes)) + geom_point() + geom_line(aes(y=loess), color='firebrick', size=1) +
labs(title= "LOESS span:0.01") + theme_bw()
loess=stats::loess(visitantes~as.numeric(Dia), data = shopping, na.action = 'na.exclude', model = T, span=0.25)
shopping['loess']=predict(loess,shopping)
loess_2 = ggplot(shopping, aes(Dia,visitantes)) + geom_point() + geom_line(aes(y=loess), color='forestgreen', size=1) +
labs(title= "LOESS span:0.25") + theme_bw()
loess=stats::loess(visitantes~as.numeric(Dia), data = shopping, na.action = 'na.exclude', model = T, span=0.5)
shopping['loess']=predict(loess,shopping)
loess_3 = ggplot(shopping, aes(Dia,visitantes)) + geom_point() + geom_line(aes(y=loess), color='steelblue', size=1) +
labs(title= "LOESS span:0.50") + theme_bw()
loess=stats::loess(visitantes~as.numeric(Dia), data = shopping, na.action = 'na.exclude', model = T, span=0.75)
shopping['loess']=predict(loess,shopping)
loess_4 = ggplot(shopping, aes(Dia,visitantes)) + geom_point() + geom_line(aes(y=loess), color='purple', size=1) +
labs(title= "LOESS span:0.75") + theme_bw()
cowplot::plot_grid(loess_1, loess_2, loess_3,loess_4)
Se nota que para un span muy pequeño nuestro modelo va a tener un problema de OVERFITTING: no va a poder generalizar. Mientras que para spans grandes tenemos un problema de UNDERFITTING: nuestro modelo no logra captar bien la variabilidad del fenómeno.
Vamos a trabajar sobre la implementación de Prophet en R.
Prophet requiere que le pasemos el dataset con:
ds: la variable temporal
y: la variable a predecir
# Eliminamos observaciones con menos de 250 visitantes (cuestion de negocio)
shopping[shopping['visitantes']<250,'visitantes'] = NA
# Creamos el dataset
prophet_df = shopping %>% select(Dia, visitantes) %>% rename(., ds=Dia, y=visitantes)
Recordemos que el modelo subyacente es el propuesto por Harvey & Peters (1990):
y(t)=g(t)+s(t)+h(t)+ϵt
Hay muchos parametros para tener en cuenta. Veremos algunos:
df: dataframe
growth: tipo de tendencia: lineal o logistica
yearly.seasonality: hay estacionalidad anual?
yearly.seasonality: hay estacionalidad diaria?
holidays: dataframe con fechas de vacaciones/eventos especiales
La funcion prophet
crea el modelo, podemos pasarle o no el dataframe.
La funcion prophet.fit
aplica un modelo creado a un dataframe
# Llamamos solo al modelo
prophet_base=prophet()
# Le pasamos el dataset
prophet_base = fit.prophet(m = prophet_base, prophet_df)
Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
Initial log joint probability = -4.68957
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Notemos que el modelo automaticamente deshabilita la estacionalidad anual y diaria.
Llamando a plot
obtenemos el valor predicho del modelo y el valor original.
Es dificil de notar pero el modelo realiza predicciones aun para los dias en los cuales no hay datos.
plot(prophet_base,fcst=predict(prophet_base, prophet_df)) +theme_bw()
La funcion prophet_plot_components
nos devuelve los efectos de los componentes en nuestra variable a predecir
prophet_plot_components(prophet_base, fcst=predict(prophet_base, prophet_df)) +theme_bw()
NULL
En este caso tenemos la tendencia de los visitantes y la estacionalidad semanal. Esta última parece tener bastante sentido ya que nos muestra que hay mayor cantidad de visitantes durante el fin de semana.
La funcion cross_validation
permite realizar pronosticos realizando un esquema de cross-validation temporal y, a partir de ellos, obtener ciertas metricas de performance.
horizon: horizonte del pronostico. Cuánto tiempo deseo predecir
period: periodo entre fechas de analisis. Cuánto voy corriendo mi ventana
initial: periodo inicial de entrenamiento
cv_base = cross_validation(prophet_base, initial = 45, period = 7, horizon = 15, units = 'days')
Making 24 forecasts with cutoffs between 2017-11-04 and 2018-04-14
Initial log joint probability = -3.06265
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -2.89179
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.42994
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -2.99569
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.58187
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.40797
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -6.14203
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -6.67306
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.38137
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.21596
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.43862
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.36873
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.43537
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.5117
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.60968
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.69348
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.77909
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.8066
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -4.02179
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -4.02246
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -3.99051
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -4.04036
Optimization terminated normally:
Convergence detected: relative gradient magnitude is below tolerance
Initial log joint probability = -4.27091
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Initial log joint probability = -4.33551
Optimization terminated normally:
Convergence detected: relative gradient magnitude is below tolerance
cv_base
ds <S3: POSIXct> | y <dbl> | yhat <dbl> | yhat_lower <dbl> | yhat_upper <dbl> | cutoff <S3: POSIXct> |
---|---|---|---|---|---|
2017-11-05 | 710 | 828.8292 | 668.65151 | 992.5507 | 2017-11-04 |
2017-11-06 | 468 | 583.6559 | 415.75243 | 739.8342 | 2017-11-04 |
2017-11-07 | 439 | 502.5868 | 343.06530 | 673.8228 | 2017-11-04 |
2017-11-08 | 502 | 490.4625 | 306.77759 | 667.1090 | 2017-11-04 |
2017-11-09 | 500 | 591.0459 | 418.96071 | 744.3992 | 2017-11-04 |
2017-11-10 | 564 | 639.5798 | 470.66999 | 822.2510 | 2017-11-04 |
2017-11-11 | 771 | 863.4232 | 702.26621 | 1038.8670 | 2017-11-04 |
2017-11-12 | 727 | 835.4909 | 668.37954 | 1003.9611 | 2017-11-04 |
2017-11-13 | 540 | 590.3177 | 417.70471 | 767.3952 | 2017-11-04 |
2017-11-14 | 378 | 509.2485 | 341.35449 | 676.0495 | 2017-11-04 |
La funcion performance_metrics
computa varias metricas de performance a partir de un dataframe de cross validation de prophet
performance_metrics(cv_base, rolling_window = 0.5)
horizon <time> | mse <dbl> | rmse <dbl> | mae <dbl> | mape <dbl> | coverage <dbl> | |
---|---|---|---|---|---|---|
166 | 8 days | 44693.53 | 211.4084 | 149.1865 | 0.2336534 | 0.7869822 |
181 | 8 days | 44728.63 | 211.4914 | 149.3212 | 0.2337456 | 0.7869822 |
196 | 8 days | 44999.90 | 212.1318 | 150.2154 | 0.2354206 | 0.7869822 |
211 | 8 days | 44992.96 | 212.1154 | 150.1856 | 0.2356882 | 0.7869822 |
226 | 8 days | 44978.56 | 212.0815 | 150.0730 | 0.2355448 | 0.7869822 |
241 | 8 days | 44426.89 | 210.7769 | 148.3514 | 0.2340338 | 0.7928994 |
256 | 8 days | 44661.22 | 211.3320 | 149.5046 | 0.2354395 | 0.7928994 |
271 | 8 days | 42546.82 | 206.2688 | 146.5958 | 0.2339764 | 0.7988166 |
286 | 8 days | 41723.58 | 204.2635 | 144.6489 | 0.2317360 | 0.8047337 |
301 | 8 days | 41406.61 | 203.4861 | 143.6888 | 0.2306376 | 0.8047337 |
La funcion add_seasonality
nos permite agregar nuevas estacionalidades. Las estacionalidades se modelan utilizando series de Fourier que nosotros debemos definir.
Definimos:
m: modelo
name: nombre de la estacionalidad
period: cantidad de dias del periodo
fourier.order: orden de la serie de fourier para modelar la estacionalidad
# Llamamos solo al modelo
prophet_mensual=prophet()
# Agregamos la estacionalidad mensual
prophet_mensual=add_seasonality(prophet_mensual, name='monthly', period=365/12, fourier.order = 4)
# Le pasamos el dataset
prophet_mensual = fit.prophet(m = prophet_mensual, prophet_df)
Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
Initial log joint probability = -4.68957
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Notemos que sigue deshabilitando la estacionalidad anual y diaria.
plot(prophet_mensual,fcst=predict(prophet_mensual, prophet_df)) +theme_bw()
La funcion prophet_plot_components
nos devuelve los efectos de los componentes en nuestra variable a predecir
prophet_plot_components(prophet_mensual, fcst=predict(prophet_mensual, prophet_df)) +theme_bw()
NULL
Notemos que la tendencia y estacionalidad semanal se mantiene y ahora se agrega la tendencia mensual que habíamos incorporado. Nos indica que el ciclo mensual se caracteriza por un crecimiento en la mitad de mes.
ds <S3: POSIXct> | y <dbl> | yhat <dbl> | yhat_lower <dbl> | yhat_upper <dbl> | cutoff <S3: POSIXct> |
---|---|---|---|---|---|
2017-11-05 | 710 | 825.3621 | 695.33891 | 952.6816 | 2017-11-04 |
2017-11-06 | 468 | 550.9673 | 426.19967 | 681.3129 | 2017-11-04 |
2017-11-07 | 439 | 467.0597 | 326.26020 | 596.3100 | 2017-11-04 |
2017-11-08 | 502 | 507.5143 | 375.78788 | 628.5371 | 2017-11-04 |
2017-11-09 | 500 | 678.0770 | 546.52351 | 804.6335 | 2017-11-04 |
2017-11-10 | 564 | 822.4006 | 681.94426 | 947.1849 | 2017-11-04 |
2017-11-11 | 771 | 1099.4084 | 971.63202 | 1230.5643 | 2017-11-04 |
2017-11-12 | 727 | 1071.5801 | 950.47095 | 1197.8239 | 2017-11-04 |
2017-11-13 | 540 | 789.4764 | 668.19689 | 912.1604 | 2017-11-04 |
2017-11-14 | 378 | 650.3806 | 518.90610 | 790.6199 | 2017-11-04 |
La funcion performance_metrics
computa varias metricas de performance a partir de un dataframe de cross validation de prophet
performance_metrics(cv_mensual, rolling_window = 0.5)
horizon <time> | mse <dbl> | rmse <dbl> | mae <dbl> | mape <dbl> | coverage <dbl> | |
---|---|---|---|---|---|---|
166 | 8 days | 51269.27 | 226.4272 | 161.6261 | 0.2585572 | 0.6982249 |
181 | 8 days | 51218.07 | 226.3141 | 161.3472 | 0.2581174 | 0.6982249 |
196 | 8 days | 51505.86 | 226.9490 | 161.8643 | 0.2594592 | 0.6982249 |
211 | 8 days | 51379.93 | 226.6714 | 161.3474 | 0.2590916 | 0.7041420 |
226 | 8 days | 51398.46 | 226.7123 | 161.4917 | 0.2592958 | 0.7041420 |
241 | 8 days | 50989.65 | 225.8089 | 160.2577 | 0.2583991 | 0.7100592 |
256 | 8 days | 51085.61 | 226.0213 | 160.7386 | 0.2589998 | 0.7100592 |
271 | 8 days | 49484.62 | 222.4514 | 158.3411 | 0.2579350 | 0.7159763 |
286 | 8 days | 48842.41 | 221.0032 | 156.4219 | 0.2555808 | 0.7218935 |
301 | 8 days | 48474.58 | 220.1694 | 155.5300 | 0.2546266 | 0.7278107 |
Como último paso vamos a agregar las ventas de navidad y ciertos dias de promociones como eventos especiales
Creamos el dataframe de eventos con: nombre del evento, fechas y una “ventana” para definir si el evento se estira a ciertos dias.
# Navidad
christmas = data.frame(holiday= 'christmas',
ds=ymd(c('2017-12-16','2017-12-17','2017-12-18',
'2017-12-19','2017-12-20','2017-12-21',
'2017-12-22','2017-12-23')),
lower_window= 0,
upper_window= 0)
# Promociones
big_sales = data.frame(
holiday= 'big_sales',
ds= ymd(c('2017-09-16','2017-10-08','2017-10-14',
'2017-11-20','2017-12-03','2017-12-30')),
lower_window= 0,
upper_window= 0)
holidays= bind_rows(christmas, big_sales)
glimpse(holidays)
Observations: 14
Variables: 4
$ holiday [3m[38;5;246m<chr>[39m[23m "christmas", "christmas", "christmas", "christmas", "christmas", "christmas", "christmas", "christmas", "…
$ ds [3m[38;5;246m<date>[39m[23m 2017-12-16, 2017-12-17, 2017-12-18, 2017-12-19, 2017-12-20, 2017-12-21, 2017-12-22, 2017-12-23, 2017-09-…
$ lower_window [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ upper_window [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
# Llamamos al modelo con el dataset de eventos
prophet_full=prophet(holidays = holidays)
# Agregamos la estacionalidad mensual
prophet_full=add_seasonality(prophet_full, name='monthly', period=30.5, fourier.order = 4)
# Le pasamos el dataset
prophet_full = fit.prophet(m = prophet_full, prophet_df)
Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
Initial log joint probability = -4.68957
Optimization terminated normally:
Convergence detected: absolute parameter change was below tolerance
Fijense que el modelo automaticamente deshabilita la estacionalidad anual y diaria.
plot(prophet_full,fcst=predict(prophet_full, prophet_df)) +theme_bw()
La funcion prophet_plot_components
nos devuelve los efectos de los componentes en nuestra variable a predecir
prophet_plot_components(prophet_full, fcst=predict(prophet_full, prophet_df)) +theme_bw()
NULL
La tendencia y estacionalidad semanal se mantienen aproximadamente igual. La estacionalidad mensual cambia bastante y se observa que existen tres picos en el ciclo mensual. Por su parte, los eventos se modelan como pequeños saltos o picos.
ds <S3: POSIXct> | y <dbl> | yhat <dbl> | yhat_lower <dbl> | yhat_upper <dbl> | cutoff <S3: POSIXct> |
---|---|---|---|---|---|
2017-11-05 | 710 | 783.1627 | 671.3485 | 897.8747 | 2017-11-04 |
2017-11-06 | 468 | 521.4809 | 406.4339 | 626.5506 | 2017-11-04 |
2017-11-07 | 439 | 402.6993 | 293.5229 | 513.6699 | 2017-11-04 |
2017-11-08 | 502 | 430.6764 | 316.7475 | 540.7213 | 2017-11-04 |
2017-11-09 | 500 | 619.2535 | 513.2782 | 734.1938 | 2017-11-04 |
2017-11-10 | 564 | 792.2517 | 676.5184 | 905.0437 | 2017-11-04 |
2017-11-11 | 771 | 1018.5607 | 912.6347 | 1127.5350 | 2017-11-04 |
2017-11-12 | 727 | 1030.4046 | 927.1455 | 1144.5564 | 2017-11-04 |
2017-11-13 | 540 | 777.6664 | 662.4209 | 874.5959 | 2017-11-04 |
2017-11-14 | 378 | 606.8106 | 495.0995 | 709.8922 | 2017-11-04 |
performance_metrics(cv_full, rolling_window = 0.5)
horizon <time> | mse <dbl> | rmse <dbl> | mae <dbl> | mape <dbl> | coverage <dbl> | |
---|---|---|---|---|---|---|
166 | 8 days | 22962.02 | 151.5322 | 113.6740 | 0.1935432 | 0.6863905 |
181 | 8 days | 22948.59 | 151.4879 | 113.5696 | 0.1933579 | 0.6863905 |
196 | 8 days | 22896.71 | 151.3166 | 113.4371 | 0.1935369 | 0.6863905 |
211 | 8 days | 22897.76 | 151.3201 | 113.4409 | 0.1939022 | 0.6863905 |
226 | 8 days | 22878.26 | 151.2556 | 113.2004 | 0.1935826 | 0.6863905 |
241 | 8 days | 22798.06 | 150.9903 | 112.5861 | 0.1930732 | 0.6863905 |
256 | 8 days | 23041.29 | 151.7936 | 113.7577 | 0.1945015 | 0.6804734 |
271 | 8 days | 22888.22 | 151.2885 | 113.3264 | 0.1947925 | 0.6804734 |
286 | 8 days | 22888.03 | 151.2879 | 113.3228 | 0.1948820 | 0.6804734 |
301 | 8 days | 22903.57 | 151.3393 | 113.4189 | 0.1951136 | 0.6804734 |
Prophet también nos permite realizar gráficos interactivos que suelen ser muy útiles para presentar los resultados.
dyplot.prophet(prophet_full, fcst=predict(prophet_full, prophet_df))