import datetime
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats
from statsmodels.tsa import stattools
data = pd.read_csv("sample.csv")
data.head(10)
Date | Temp | |
---|---|---|
0 | 1981-01-01 | 20.7 |
1 | 1981-01-02 | 17.9 |
2 | 1981-01-03 | 18.8 |
3 | 1981-01-04 | 14.6 |
4 | 1981-01-05 | 15.8 |
5 | 1981-01-06 | 15.8 |
6 | 1981-01-07 | 15.8 |
7 | 1981-01-08 | 17.4 |
8 | 1981-01-09 | 21.8 |
9 | 1981-01-10 | 20.0 |
The Date column is currently read as an Object type, i.e., a string. To treat it as a timestamp, use the following datetime — Basic Date and Time Types to convert it to a datetime type.
data["Date"] = data["Date"].apply(
lambda x: datetime.datetime.strptime(str(x), "%Y-%m-%d")
)
print(f"Date column dtype: {data['Date'].dtype}")
Date column dtype: datetime64[ns]
To begin, we briefly review what the data looks like. We will use pandas.DataFrame.describe to check some simple statistics for the Temp column.
data.describe()
Temp | |
---|---|
count | 3650.000000 |
mean | 11.177753 |
std | 4.071837 |
min | 0.000000 |
25% | 8.300000 |
50% | 11.000000 |
75% | 14.000000 |
max | 26.300000 |
Use seaborn.lineplot to see what the cycle looks like.
plt.figure(figsize=(12, 6))
sns.lineplot(x=data["Date"], y=data["Temp"])
plt.ylabel("Temp")
plt.grid(axis="x")
plt.grid(axis="y", color="r", alpha=0.3)
plt.show()
plt.figure(figsize=(12, 6))
plt.hist(x=data["Temp"], rwidth=0.8)
plt.xlabel("Temp")
plt.ylabel("日数")
plt.grid(axis="y")
plt.show()
Using pandas.plotting.autocorrelation_plot Check autocorrelation to check the periodicity of time series data. Roughly speaking, autocorrelation is a measure of how well a signal matches a time-shifted signal of itself, expressed as a function of the magnitude of the time shift.
plt.figure(figsize=(12, 6))
pd.plotting.autocorrelation_plot(data["Temp"])
plt.grid()
plt.axvline(x=365)
plt.xlabel("lag")
plt.ylabel("autocorrelation")
plt.show()
We check to see if the data are a unit root process. The Augmented Dickey-Fuller test is used to test the null hypothesis of a unit root process.
statsmodels.tsa.stattools.adfuller
stattools.adfuller(data["Temp"], autolag="AIC")
(-4.444804924611697,
0.00024708263003610177,
20,
3629,
{'1%': -3.4321532327220154,
'5%': -2.862336767636517,
'10%': -2.56719413172842},
16642.822304301197)
The trend line is drawn by fitting a one-dimensional polynomial to the time series. Since the data in this case is almost trend-stationary, there is almost no trend.
numpy.poly1d — NumPy v1.22 Manual
def get_trend(timeseries, deg=3):
"""Create a trend line for time-series data
Args:
timeseries(pd.Series) : time-series data
Returns:
pd.Series: trend line
"""
x = list(range(len(timeseries)))
y = timeseries.values
coef = np.polyfit(x, y, deg)
trend = np.poly1d(coef)(x)
return pd.Series(data=trend, index=timeseries.index)
data["Trend"] = get_trend(data["Temp"])
# グラフをプロット
plt.figure(figsize=(12, 6))
sns.lineplot(x=data["Date"], y=data["Temp"], alpha=0.5, label="Temp")
sns.lineplot(x=data["Date"], y=data["Trend"], label="トレンド")
plt.grid(axis="x")
plt.legend()
plt.show()
The green line is the trend line.
data_sub = data.copy()
data_sub["Temp"] = (
data_sub["Temp"] + np.log(data_sub["Date"].dt.year - 1980) * 10
) # Dummy Trends
data_sub["Trend"] = get_trend(data_sub["Temp"])
plt.figure(figsize=(12, 6))
sns.lineplot(x=data_sub["Date"], y=data_sub["Temp"], alpha=0.5, label="Temp")
sns.lineplot(x=data_sub["Date"], y=data_sub["Trend"], label="トレンド")
plt.grid(axis="x")
plt.legend()
plt.show()