# Impact of Trends

Time-series data changes its waveform over time, but it may increase or decrease over time. Such gradual, non-periodic changes are sometimes referred to as trends. Data with a trend changes the mean, variance, and other statistics of the data over time, and as a result is more difficult to predict. On this page, we will try to remove the trend component from time series data using python.

import japanize_matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


## Generate sample data

date_list = pd.date_range("2021-01-01", periods=720, freq="D")
value_list = [
10
+ np.cos(np.pi * i / 28.0) * (i % 3 > 0)
+ np.cos(np.pi * i / 14.0) * (i % 5 > 0)
+ np.cos(np.pi * i / 7.0)
+ (i / 10) ** 1.1 / 20
for i, di in enumerate(date_list)
]

df = pd.DataFrame(
{
"日付": date_list,
"y": value_list,
}
)

df.head(10)


02021-01-0111.000000
12021-01-0212.873581
22021-01-0312.507900
32021-01-0411.017651
42021-01-0511.320187
52021-01-0610.246560
62021-01-079.350058
72021-01-089.740880
82021-01-099.539117
92021-01-108.987155
plt.figure(figsize=(10, 5))
sns.lineplot(x=df["日付"], y=df["y"])


## Forecast Time Series Data with XGBoost

df["曜日"] = df["日付"].dt.weekday
df["年初からの日数%14"] = df["日付"].dt.dayofyear % 14
df["年初からの日数%28"] = df["日付"].dt.dayofyear % 28

def get_trend(timeseries, deg=3, trainN=0):
"""Create a trend line for time-series data

Args:
timeseries(pd.Series) : Time series data
deg(int) : Degree of polynomial
trainN(int): Number of data used to estimate the coefficients of the polynomial

Returns:
pd.Series: Time series data corresponding to trends
"""
if trainN == 0:
trainN = len(timeseries)

x = list(range(len(timeseries)))
y = timeseries.values
coef = np.polyfit(x[:trainN], y[:trainN], deg)
trend = np.poly1d(coef)(x)
return pd.Series(data=trend, index=timeseries.index)

trainN = 500
df["Trend"] = get_trend(df["y"], trainN=trainN, deg=2)

plt.figure(figsize=(10, 5))
sns.lineplot(x=df["日付"], y=df["y"])
sns.lineplot(x=df["日付"], y=df["Trend"])


X = df[["曜日", "年初からの日数%14", "年初からの日数%28"]]
y = df["y"]

trainX, trainy = X[:trainN], y[:trainN]
testX, testy = X[trainN:], y[trainN:]
trend_train, trend_test = df["Trend"][:trainN], df["Trend"][trainN:]


XGBoost does not know that data changes slowly between training and test data. Therefore, the more you predict the future, the more your predictions will be off down the road. For XGBoost to forecast well, the y distribution of the training and test data must be close.

import xgboost as xgb
from sklearn.metrics import mean_squared_error

regressor = xgb.XGBRegressor(max_depth=5).fit(trainX, trainy)
prediction = regressor.predict(testX)

plt.figure(figsize=(10, 5))
sns.lineplot(x=df["日付"][trainN:], y=prediction)
sns.lineplot(x=df["日付"][trainN:], y=testy)

plt.legend(["モデルの出力", "正解"], bbox_to_anchor=(0.0, 0.78, 0.28, 0.102))
print(f"MSE = {mean_squared_error(testy, prediction)}")

MSE = 2.815118389938834


#### Forecasting with the trend taken into account

We first remove the portion corresponding to the trend from the observed values and then predict the values without the trend. The XGBoost prediction is then added to the XGBoost prediction to obtain the final prediction.

regressor = xgb.XGBRegressor(max_depth=5).fit(trainX, trainy - trend_train)
prediction = regressor.predict(testX)
prediction = [pred_i + trend_i for pred_i, trend_i in zip(prediction, trend_test)]

plt.figure(figsize=(10, 5))
sns.lineplot(x=df["日付"][trainN:], y=prediction)
sns.lineplot(x=df["日付"][trainN:], y=testy)

plt.legend(["モデルの出力＋Trend", "正解"], bbox_to_anchor=(0.0, 0.78, 0.28, 0.102))
print(f"MSE = {mean_squared_error(testy, prediction)}")

MSE = 0.46014173311011325


#### Comments

(Comments will appear after approval)