{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "f6e52e61e4ba3ddcbd3d72403493e9553617c06f"
   },
   "source": [
    "# Time Series Modeling\n",
    "\n",
    "There are several things that are time dependent, I mean, today's values can have an effective relationship to values that have occurred in the past.\n",
    "\n",
    "Some examples related to the subject are demand of products during a certain period, harvest of commodities, stock prices and of course what we will try to predict, the climate change in Rio De Janeiro.\n",
    "\n",
    "Currently there are several types of time series forecast models, in this notebook I will try to use [Seasonal ARIMA Models](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average)\n",
    "\n",
    "First we need to import the essential libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
   },
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import pandas as pd \n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import statsmodels.api as sm\n",
    "from statsmodels.tsa.stattools import adfuller\n",
    "from statsmodels.graphics.tsaplots import plot_acf, plot_pacf\n",
    "# from sklearn.metrics import mean_squared_error\n",
    "from math import sqrt\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
    "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a"
   },
   "outputs": [],
   "source": [
    "# Reading and transforming the data file\n",
    "url=\"https://github.com/jekyll-one/nbinteract-notebooks/raw/main/data/GlobalLandTemperaturesByMajorCity.csv.zip\"\n",
    "cities = pd.read_csv(url)\n",
    "#cities = pd.read_csv('../input/earth-surface-temperature-data/GlobalLandTemperaturesByCity.csv.zip')\n",
    "rio = cities.loc[cities['City'] == 'Rio De Janeiro', ['dt','AverageTemperature']]\n",
    "rio.columns = ['Date','Temp']\n",
    "rio['Date'] = pd.to_datetime(rio['Date'])\n",
    "rio.reset_index(drop=True, inplace=True)\n",
    "rio.set_index('Date', inplace=True)\n",
    "\n",
    "#I'm going to consider the temperature just from 1900 until the end of 2012\n",
    "rio = rio.loc['1900':'2013-01-01']\n",
    "rio = rio.asfreq('M', method='bfill')\n",
    "rio.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "7a0b92284f4165be470bb19c024caa647070aa15"
   },
   "source": [
    "Below I'll try to make a brief explanation about ARIMA models:\n",
    "\n",
    "# <font color=green>SARIMA Model (p, d, q)(P, D, Q, S)</font>:\n",
    "SARIMA stands for Seasonal Auto Regressive Integrated Moving Average, The name scares, but this is not as scary as it seems.\n",
    "\n",
    "## <font color=green>Non seasonal ARIMA</font>:\n",
    "\n",
    "We can split the Arima term into three terms, AR, I, MA:\n",
    "\n",
    " * **AR(p)** stands for *autoregressive model*, the `p` parameter is an integer that confirms how many lagged series are going to be used to forecast periods ahead, example:\n",
    "     * The average temperature of yesterday has a high correlation with the temperature of today, so we will use AR(1) parameter to forecast future temperatures.\n",
    "     * The formula for the AR(p) model is: $\\hat{y}_{t} = \\mu + \\theta_{1}Y_{t-1} + ... + \\theta_{p}Y_{t-p}$ Where $\\mu$ is the constant term, the **p** is the periods to be used in the regression and $\\theta$ is the parameter fitted to the data.\n",
    "     \n",
    " * **I(d)** is the differencing part, the `d` parameter tells how many differencing orders are going to be used, it tries to make the series stationary, example:\n",
    " \n",
    "     * Yesterday I sold 10 items of a product, today I sold 14, the \"I\" in this case is just the first difference, which is +4, if you are using logarithm base this difference is equivalent to percentual difference. \n",
    "     * If d = 1: $y_{t} = Y_{t} - Y_{t-1}$ where $y_{t}$ is the differenced series and $Y_{t-period}$ is the original series\n",
    "     * If d = 2: $y_{t} = (Y_{t} - Y_{t-1}) - (Y_{t-1} - Y_{t-2}) = Y_{t} - 2Y_{t-1} + Y_{t-2}$\n",
    "     * Note that the second difference is a change-in-change, which is a measure of the local \"acceleration\" rather than trend.\n",
    "\n",
    "* **MA(q)** stands for *moving average model*, the `q` is the number of lagged forecast errors terms in the prediction equation, example:\n",
    "     * It's strange, but this MA term takes a percentage of the errors between the predicted value against the real. It assumes that the past errors are going to be similar in future events.\n",
    "     * The formula for the MA(p) model is: $\\hat{y}_{t} = \\mu - \\Theta_{1}e_{t-1} + ... + \\Theta_{q}e_{t-q}$ Where $\\mu$ is the constant term, **q** is the period to be used on the $e$ term and $\\Theta$ is the parameter fitted to the errors\n",
    "     * The error equation is $ e_{t} = Y_{t-1} - \\hat{y}_{t-1}$\n",
    "     \n",
    "## <font color=green>Seasonal ARIMA</font>:\n",
    "\n",
    "The **p, d, q** parameters are capitalized to differ from the non seasonal parameters.\n",
    "\n",
    "* **SAR(P)** is the seasonal autoregression of the series.\n",
    "    * The formula for the SAR(P) model is: $\\hat{y}_{t} = \\mu + \\theta_{1}Y_{t-s}$ Where P is quantity of autoregression terms to be added, usually no more than 1 term, **s** is how many periods ago to be used as base and $\\theta$ is the parameter fitted to the data.\n",
    "    * Usually when the subject is weather forecasting, 12 months ago have some information to contribute to the current period.\n",
    "    * Setting P=1 (i.e., SAR(1)) adds a multiple of $Y_{t-s}$ to the forecast for $y_{t}$\n",
    "    \n",
    "* **I(D)** the seasonal difference MUST be used when you have an strong and stable pattern.\n",
    "     * If d = 0 and D = 1: $y_{t} = Y_{t} - Y_{t-s}$ where $y_{t}$ is the differenced series and $Y_{t-s}$ is the original seasonal lag.\n",
    "     * If d =1 and D = 1: $y_{t} = (Y_{t} - Y_{t-1}) - (Y_{t-s} - Y_{t-s-1}) = Y_{t} - Y_{t-1} -Y_{t-s} + Y_{t-s-1}$\n",
    "     * D should never be more than 1, and d+D should never be more than 2. Also, if d+D =2, the constant term should be suppressed.\n",
    "     \n",
    "* **SMA(Q)** \n",
    "     * Setting Q=1 (i.e., SMA(1)) adds a multiple of error $e_{t-s}$ to the forecast for $y_{t}$\n",
    "\n",
    "\n",
    "* **S** It's the seasonal period where you are going to calculate the the P, D, Q terms. If there is a 52 week seasonal correlation this is the number to be used on the 'S' parameter\n",
    "  \n",
    "  ## <font color=green>Trend</font>:\n",
    "  \n",
    "We will use [SARIMAX](https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html) to create a forecast, the following terms are a definition to the trend:\n",
    "\n",
    " * 'n' when there is no trend to be used (default).\n",
    " * ‘c’ indicates a constant (i.e. a degree zero component of the trend polynomial)\n",
    " * ‘t’ indicates a linear trend with time\n",
    " * ‘ct’ is both trend and constant. \n",
    " * Can also be specified as an iterable defining the polynomial as in numpy.poly1d, where [1,1,0,1] would denote a+bt+ct3\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "695ca46d08a15e74721534761261fa8ced799313"
   },
   "source": [
    "Now, let's plot the series and check how it behaves"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "53d39c22f9162eb0954797b852ad7010f1e00e96"
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(22,6))\n",
    "sns.lineplot(x=rio.index, y=rio['Temp'])\n",
    "plt.title('Temperature Variation in Rio De Janeiro from 1900 until 2012')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "5fb45d99d4c4ad715133575234d0bbf5dcd00d76"
   },
   "outputs": [],
   "source": [
    "# i'm going to create a pivot table to plot the monthly temperatures through the years\n",
    "rio['month'] = rio.index.month\n",
    "rio['year'] = rio.index.year\n",
    "pivot = pd.pivot_table(rio, values='Temp', index='month', columns='year', aggfunc='mean')\n",
    "pivot.plot(figsize=(20,6))\n",
    "plt.title('Yearly Rio temperatures')\n",
    "plt.xlabel('Months')\n",
    "plt.ylabel('Temperatures')\n",
    "plt.xticks([x for x in range(1,13)])\n",
    "plt.legend().remove()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "2e7a2b3788a809533c519f485e417e860de74b93"
   },
   "source": [
    "The series clearly has some seasonality, the higher temperatures are around November and February and the lower are between July and September. Just to make the things clear, I'll merge these lines into just one line, averaging the monthly levels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "f7627e89b1fc27ae6327dab1e75dd885bcc244b6"
   },
   "outputs": [],
   "source": [
    "monthly_seasonality = pivot.mean(axis=1)\n",
    "monthly_seasonality.plot(figsize=(20,6))\n",
    "plt.title('Monthly Temperatures in Rio De Janeiro')\n",
    "plt.xlabel('Months')\n",
    "plt.ylabel('Temperature')\n",
    "plt.xticks([x for x in range(1,13)])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "d8954d6f71a4d8c4300fc4eb81eb08f363d88667"
   },
   "source": [
    "Now i'm going to check if there is some trend through the years in this Series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "58718ef51d2dd7fbc7f99484bf32e9f1abbd3a5b"
   },
   "outputs": [],
   "source": [
    "year_avg = pd.pivot_table(rio, values='Temp', index='year', aggfunc='mean')\n",
    "year_avg['10 Years MA'] = year_avg['Temp'].rolling(10).mean()\n",
    "year_avg[['Temp','10 Years MA']].plot(figsize=(20,6))\n",
    "plt.title('Yearly AVG Temperatures in Rio De Janeiro')\n",
    "plt.xlabel('Months')\n",
    "plt.ylabel('Temperature')\n",
    "plt.xticks([x for x in range(1900,2012,3)])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "2a710e740a131546b07f7159e0d898e2fe623806"
   },
   "source": [
    "We can confirm that there is a constant increasing trend and that the average temperature increased from 23.5º to 24.5º, that's 4.25% in over100 years.\n",
    "\n",
    "Before we go on, i'm going to split the data in training, validation and test set. After training the model, I will use the last 5 years to do the data validation and test, being 48 months to do a month by month validation (walk forward) and 12 months to make an extrapolation for the future and compare to the test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "74afa879672a923d4657a1faeb706885967f3498"
   },
   "outputs": [],
   "source": [
    "train = rio[:-60].copy()\n",
    "val = rio[-60:-12].copy()\n",
    "test = rio[-12:].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "381912fc2850191c1c3dc0a4f7516348a83274f8"
   },
   "source": [
    "And before creating the forecasts we will create a baseline forecast in the validation set, in our simulation we will try to have a smaller error compared to this one.\n",
    "\n",
    "it will consider the previous month as a base forecast to the next month:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "698b4603160233cb777fcfccf7f5030a61379547"
   },
   "outputs": [],
   "source": [
    "# Excluding the first line, as it has NaN values\n",
    "baseline = val['Temp'].shift()\n",
    "baseline.dropna(inplace=True)\n",
    "baseline.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "464f004a1dad131efcf93a08697c24bafc7605b3"
   },
   "source": [
    "Also I'm going to create a function to use the [RMSE](https://en.wikipedia.org/wiki/Root-mean-square_deviation) as a base to calculate the error, but you are free to use another parameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "1be0502eac073a3ec31bac0a7e9f2d73a48034ed"
   },
   "outputs": [],
   "source": [
    "def measure_rmse(y_true, y_pred):\n",
    "#   return sqrt(mean_squared_error(y_true,y_pred))\n",
    "    return np.square(np.subtract(y_true, y_pred)).mean()\n",
    "\n",
    "# Using the function with the baseline values\n",
    "rmse_base = measure_rmse(val.iloc[1:,0],baseline)\n",
    "print(f'The RMSE of the baseline that we will try to diminish is {round(rmse_base,4)} celsius degrees')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "eb41cd1807bbd76bf576fc8210ed09a09e2dfd3b"
   },
   "source": [
    "As we can see, the series has a small uptrend and it appears that there is some seasonality with higher temperatures at the begining and end of the year and lower temperatures around the middle of the year.\n",
    "\n",
    "To create a time series forecast, the series must be stationary (constant mean, variance and autocorrelation).\n",
    "\n",
    "One way to check if the series is stationary is using the **adfuller function**, if the P-Value is lower than 5% (usual number used for this kind of study) the series is stationary and you can start creating your model. \n",
    "\n",
    "If the series isn't stationary you can do some data transformation like using natural logarithm, deflation, differencing, etc.\n",
    "\n",
    "Below is the function that I used to check the stationarity, it plots: \n",
    "\n",
    " * The series itself;\n",
    " * The autocorrelation function **(ACF)**:\n",
    "      * It shows the correlation between the current temperatures versus the lagged versions of itself.\n",
    " * The partial autocorrelation **(PACF)**:\n",
    "     * It shows the correlation between the current temperatures versus the lagged version excluding the effects of earlier lags, for example, it show the effective influence of the lag 3 in the current temperatures excluding the effects of the lags 1 and 2.\n",
    "\n",
    "For more interesting sources you can read the materials on this amazing website made by Mr. Robert Nau: [ Duke University](http://people.duke.edu/~rnau/411home.htm), also you can check [Jason Brownlee's](machinelearningmastery.com) website, which have a lot of time series content."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "e13ffe3696e6bc9f2e739ff0ba1db6b40aca2c58"
   },
   "outputs": [],
   "source": [
    "def check_stationarity(y, lags_plots=48, figsize=(22,8)):\n",
    "    \"Use Series as parameter\"\n",
    "    \n",
    "    # Creating plots of the DF\n",
    "    y = pd.Series(y)\n",
    "    fig = plt.figure()\n",
    "\n",
    "    ax1 = plt.subplot2grid((3, 3), (0, 0), colspan=2)\n",
    "    ax2 = plt.subplot2grid((3, 3), (1, 0))\n",
    "    ax3 = plt.subplot2grid((3, 3), (1, 1))\n",
    "    ax4 = plt.subplot2grid((3, 3), (2, 0), colspan=2)\n",
    "\n",
    "    y.plot(ax=ax1, figsize=figsize)\n",
    "    ax1.set_title('Rio De Janeiro Temperature Variation')\n",
    "    plot_acf(y, lags=lags_plots, zero=False, ax=ax2);\n",
    "    plot_pacf(y, lags=lags_plots, zero=False, ax=ax3);\n",
    "    sns.distplot(y, bins=int(sqrt(len(y))), ax=ax4)\n",
    "    ax4.set_title('Distribution Chart')\n",
    "\n",
    "    plt.tight_layout()\n",
    "    \n",
    "    print('Results of Dickey-Fuller Test:')\n",
    "    adfinput = adfuller(y)\n",
    "    adftest = pd.Series(adfinput[0:4], index=['Test Statistic','p-value','Lags Used','Number of Observations Used'])\n",
    "    adftest = round(adftest,4)\n",
    "    \n",
    "    for key, value in adfinput[4].items():\n",
    "        adftest[\"Critical Value (%s)\"%key] = value.round(4)\n",
    "        \n",
    "    print(adftest)\n",
    "    \n",
    "    if adftest[0].round(2) < adftest[5].round(2):\n",
    "        print('\\nThe Test Statistics is lower than the Critical Value of 5%.\\nThe serie seems to be stationary')\n",
    "    else:\n",
    "        print(\"\\nThe Test Statistics is higher than the Critical Value of 5%.\\nThe serie isn't stationary\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "a679135a4bc49576647892edd1129091c44e3a97"
   },
   "outputs": [],
   "source": [
    "# The first approach is to check the series without any transformation\n",
    "check_stationarity(train['Temp'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "2ca000e74194e548482d619ef98d20416a381427"
   },
   "source": [
    "The series has an interesting behavior, there is a sequential significative negative autocorrelation starting at lag 6 and repeating each 12 months, it's because of the difference in the seasons, if today is winter with cold temperatures in 6 months we will have higher temperatures in the summer, that's why the negative autocorrelation occurs. These temperatures usually walk in opposite directions.\n",
    "\n",
    "Also, from lag 12 and sequentially from every 12 lags there is a significant positive autocorrelation. The **PACF** shows a positive spike in the first lag and a drop to negative **PACF** in the following lags.\n",
    "\n",
    "This behavior between the **ACF** and **PACF** plots suggests an AR(1) model and also a first seasonal difference ($Y_{t} - Y_{t-12}$). I'll plot the stationarity function again with the first seasonal difference to see if we will need some SAR(P) or SMA(Q) parameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "9d8c35f82870fcc46d4186a1d8f98201d8b0e6b8"
   },
   "outputs": [],
   "source": [
    "check_stationarity(train['Temp'].diff(12).dropna())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "8d499eb7033fcd976dad639fd08033c5cef57c37"
   },
   "source": [
    "As the plots above showed, the first **ACF** lags have a gradual decay, while the **PACF** drops under the confidence interval after the third lag, this is an **AR** signature with a parameter of 3, so this is an **AR(3)** model.\n",
    "\n",
    "As we used a first seasonal difference, the **ACF** and **PACF** showed a significative drop in the 12th lag, it means an **SMA** signature with a parameter of 1 lag, resuming this is an **SAR(1) with a first difference**.\n",
    "\n",
    "Initially i'm going to work with the following (p,d,q) orders: (3, 0, 0), and with the following seasonal (P, D, Q, S) orders (0,1,1,12) and as the series has a clear uptrend i'm going to use it in the model ('c'). \n",
    " \n",
    " To start forecasting the validation set, I'm going to create a function to use one-step-forecast in the whole validation set and measure the error:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "00fffafae7a58bfbb937164765047a756c11f401"
   },
   "outputs": [],
   "source": [
    "def walk_forward(training_set, validation_set, params):\n",
    "    '''\n",
    "    Params: it's a tuple where you put together the following SARIMA parameters: ((pdq), (PDQS), trend)\n",
    "    '''\n",
    "    history = [x for x in training_set.values]\n",
    "    prediction = list()\n",
    "    \n",
    "    # Using the SARIMA parameters and fitting the data\n",
    "    pdq, PDQS, trend = params\n",
    "\n",
    "    #Forecasting one period ahead in the validation set\n",
    "    for week in range(len(validation_set)):\n",
    "        model = sm.tsa.statespace.SARIMAX(history, order=pdq, seasonal_order=PDQS, trend=trend)\n",
    "        result = model.fit(disp=False)\n",
    "        yhat = result.predict(start=len(history), end=len(history))\n",
    "        prediction.append(yhat[0])\n",
    "        history.append(validation_set[week])\n",
    "        \n",
    "    return prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "f38e7aac69c42217c77e54c84c86a7396a420281"
   },
   "outputs": [],
   "source": [
    "# Let's test it in the validation set\n",
    "train = rio[:-60].copy()\n",
    "val['Pred'] = walk_forward(train['Temp'], val['Temp'], ((3,0,0),(0,1,1,12),'c'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "5f6ec1dd3cafb453c4f753fd6c2e7b4a0a041cb2"
   },
   "outputs": [],
   "source": [
    "# Measuring the error of the prediction\n",
    "rmse_pred = measure_rmse(val['Temp'], val['Pred'])\n",
    "\n",
    "print(f\"The RMSE of the SARIMA(3,0,0),(0,1,1,12),'c' model was {round(rmse_pred,4)} celsius degrees\")\n",
    "print(f\"It's a decrease of {round((rmse_pred/rmse_base-1)*100,2)}% in the RMSE\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "43aaaffb83105fb20a86c3244d7d672c96b33d20"
   },
   "outputs": [],
   "source": [
    "# Creating the error column\n",
    "val['Error'] = val['Temp'] - val['Pred']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "b0d4d3117ebb29320a9e47b5c1e8e7cbcaf8a6d2"
   },
   "source": [
    "It's always important to check the residuals, I'm going to create a function to plot some important charts to help us visualize the residuals.\n",
    "\n",
    "I'm going to plot the following charts:\n",
    "* Current and Predicted values through the time.\n",
    "* Residuals vs Predicted values in an scatterplot.\n",
    "* QQ Plot showing the distribution of errors and its ideal distribution\n",
    "* Autocorrelation plot of the Residuals to see if there is some correlation left."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "8eb5e920af6c5823340c00503c82c71575f6f005"
   },
   "outputs": [],
   "source": [
    "def plot_error(data, figsize=(20,8)):\n",
    "    '''\n",
    "    There must have 3 columns following this order: Temperature, Prediction, Error\n",
    "    '''\n",
    "    plt.figure(figsize=figsize)\n",
    "    ax1 = plt.subplot2grid((2,2), (0,0))\n",
    "    ax2 = plt.subplot2grid((2,2), (0,1))\n",
    "    ax3 = plt.subplot2grid((2,2), (1,0))\n",
    "    ax4 = plt.subplot2grid((2,2), (1,1))\n",
    "    \n",
    "    #Plotting the Current and Predicted values\n",
    "    ax1.plot(data.iloc[:,0:2])\n",
    "    ax1.legend(['Real','Pred'])\n",
    "    ax1.set_title('Current and Predicted Values')\n",
    "    \n",
    "    # Residual vs Predicted values\n",
    "    ax2.scatter(data.iloc[:,1], data.iloc[:,2])\n",
    "    ax2.set_xlabel('Predicted Values')\n",
    "    ax2.set_ylabel('Errors')\n",
    "    ax2.set_title('Errors versus Predicted Values')\n",
    "    \n",
    "    ## QQ Plot of the residual\n",
    "    sm.graphics.qqplot(data.iloc[:,2], line='r', ax=ax3)\n",
    "    \n",
    "    # Autocorrelation plot of the residual\n",
    "    plot_acf(data.iloc[:,2], lags=(len(data.iloc[:,2])-1),zero=False, ax=ax4)\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "94e855ba19e3fb8f6e0f91e55c3dfb2ff26f86bc"
   },
   "outputs": [],
   "source": [
    "# We need to remove some columns to plot the charts\n",
    "val.drop(['month','year'], axis=1, inplace=True)\n",
    "val.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "879bb8c9d859badf80f269516280e92f6af5d928"
   },
   "outputs": [],
   "source": [
    "plot_error(val)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "da3c399aa031ae871b8196b590e19d2aac985c0a"
   },
   "source": [
    "Analyzing the plots above we can see that the predictions fit very well on the current values.\n",
    "\n",
    "The **Error vs Predicted values** has a linear distribution (the errors are between -1.5 and +1.5 while the temperature increases).\n",
    "\n",
    "The QQ Plot shows a normal pattern with some little outliers and,\n",
    "\n",
    "The autocorrelation plot shows a positive spike over the confidence interval just above the second lag, but I believe that there is no need for more changes.\n",
    "\n",
    "Finally it's time to extrapolate the prediction in the **test set** for the last 12 months"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "58d266f5dde23bc097bf21db82df6ac2d47d02a4"
   },
   "outputs": [],
   "source": [
    "#Creating the new concatenating the training and validation set:\n",
    "future = pd.concat([train['Temp'], val['Temp']])\n",
    "future.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "dc6560a6f4a31563eaa89f3c13571a96712b0b5c"
   },
   "outputs": [],
   "source": [
    "# Using the same parameters of the fitted model\n",
    "model = sm.tsa.statespace.SARIMAX(future, order=(3,0,0), seasonal_order=(0,1,1,12), trend='c')\n",
    "result = model.fit(disp=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "67db2b62b528d188329093fff95be328c7921f14"
   },
   "source": [
    "Now I'm going to create a new column on the test set with the predicted values and I will compare them against the real values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "6d014d105bef271120431b54ad0d6871b3e3cc0d"
   },
   "outputs": [],
   "source": [
    "test['Pred'] = result.predict(start=(len(future)), end=(len(future)+13))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "0eeb5bd5b7160f3b3fb1bac4abbe5f7cb3983bf3"
   },
   "outputs": [],
   "source": [
    "test[['Temp', 'Pred']].plot(figsize=(22,6))\n",
    "plt.title('Current Values compared to the Extrapolated Ones')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "e716e94396ea93c068c4bfcbdb46920e0ec8edcb"
   },
   "source": [
    "It seems that the SARIMA parameters were well fitted, the predicted values are following the real values and also the seasonal pattern.\n",
    "\n",
    "Finally I'll evaluate the model with the RMSE in the test set (baseline against the extrapolation):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "07c72fc8637d620f00a95497ea48d9f1a1b213e1"
   },
   "outputs": [],
   "source": [
    "test_baseline = test['Temp'].shift()\n",
    "\n",
    "test_baseline[0] = test['Temp'][0]\n",
    "\n",
    "rmse_test_base = measure_rmse(test['Temp'],test_baseline)\n",
    "rmse_test_extrap = measure_rmse(test['Temp'], test['Pred'])\n",
    "\n",
    "print(f'The baseline RMSE for the test baseline was {round(rmse_test_base,2)} celsius degrees')\n",
    "print(f'The baseline RMSE for the test extrapolation was {round(rmse_test_extrap,2)} celsius degrees')\n",
    "print(f'That is an improvement of {-round((rmse_test_extrap/rmse_test_base-1)*100,2)}%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "a686604a1c62c947d237d9215be021f6a0c06b43"
   },
   "source": [
    "I Hope you liked this analysis, if you have any doubt or comment fell free to talk with me, also you can find me on my [LinkedIn](https://www.linkedin.com/in/leandro-rabelo-08722824) or in my [Twitter](https://twitter.com/leandrovrabelo)\n",
    "\n",
    "Thanks!!!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}