Time Series Analysis of Solar Radiation
Solar Panel Performance (Part 2)¶
Introduction¶
In a previous blog post (https://coolum001.github.io/solar.html), I looked at the long term efficiency of my solar panels (over almost ten year of operation). Using Time Series Analysis libraries, I concluded there was a very small long-term decline in efficiency, but ended by saying that maybe climate change had played a role.
Then I realized that (courtesy of the Queensland Government) it is possible to get :
Solar radiation - total incoming downward shortwave radiation on a horizontal surface (MJ/m2)
from https://www.longpaddock.qld.gov.au/silo/point-data/#responseTab2.
I learnt a few things about Pandas wrangling in the process, mainly in the area of lining up time-series to compare them.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.seasonal import STL
from statsmodels.formula.api import ols
import pathlib
%load_ext watermark
%load_ext lab_black
Solar radiation data load and explorationa¶
I specified logpaddock
grid point nearest to me (a distance of under 1K shouldn't make much difference to solar radiation), and downloaded a CSV file.
data_dir = pathlib.Path('../data/')
file1 = 'LongPaddockSolar.csv'
data_path = data_dir / file1
Read the CSV file into pandas.
solar = pd.read_csv(
data_path,
header=0,
usecols=[2, 4],
parse_dates={'PDate': [0]},
dayfirst=True,
)
Explore the data, and the data types
solar.head()
solar.dtypes
solar.columns
We note that one of the column names has a pesky trailing space, so get rid of it.
solar.rename(
columns={'Solar ': 'Solar'},
inplace=True,
)
solar.columns
Visualization¶
Do a quick and easy pandas plot. As with my energy generated graphics, there are very low dips due to cloudy days.
solar.plot(
'PDate',
'Solar',
linewidth=0.1,
)
Check the date range of this dataset.
solar.head(1), solar.tail(1)
Analysis¶
First up, we process the dataset to find any seasonal and trend components. Note that I have not created a date-time index for the data series being processed, so we have to specify we want a seaonal analysis based on years.
tsa_df = seasonal_decompose(
solar['Solar'],
model='add',
period=365,
)
_ = tsa_df.plot()
As we might expect, the presence of major dips in the curve make Time Series Analysis doubtful. The Residual 'errors' are large, and exhibit a clear periodic effect, maybe indiating that the analysis did not capture all the seasonal affect.
There is another time series analysis module STL
:
decompose a time series into three components: trend, season(al) and residual. STL uses LOESS (locally estimated scatterplot smoothing) to extract smooths estimates of the three components.
Ths gives us essentially the same result, but with a smoothed trend.
res = STL(
solar.drop(columns='PDate')['Solar'],
period=365,
).fit()
res.plot()
plt.show()
Plotting the residuals for the two approaches doesn't reveal any amjor improvement in one over the other.
tsa_df.resid.plot(
kind='line',
linewidth=0,
marker='o',
alpha=0.2,
)
plt.axhline(1.0, color='black')
res.resid.plot(
kind='line',
linewidth=0,
marker='o',
alpha=0.2,
)
plt.axhline(
0.0,
color='black',
)
Solar panels data load and exploration¶
So let's go back to our original solar energy generation from my panels. Last time, I went to some trouble to filter out the dips caused by clouds, but here I want to compare raw solar radiation data with my energy generation data.
def read_year_file(year: int):
'''
read_year_file: read the solar panel data in CSV format for the given year
We construct a file name from the supplied year value and read the CSV data,
skipping over row 0 (first row) as a header,
and only using the first two columns, that first of which contains a date.
The day number comes first (i.e. DD/MM/YYY, as opposed to MM/DD/YYYY)
Parameters:
year: int - the notional year of the data set to be loaded
note - each 365 day data set spans two years, due the the way data download was
done from the PVOutput website
Returns:
pandas dataframe
Limitations:
Does not check for file actually existing
Does not check for successful pandas CSV read
'''
data_dir = pathlib.Path('../data/SolarDaily')
file1 = 'PVOutput-COOLUMSUNSPOT-' + str(year) + '.csv'
data_path = data_dir / file1
df = pd.read_csv(
data_path,
header=0,
usecols=[0, 1],
parse_dates={'Date': [0]},
dayfirst=True,
)
return df
# end read_year_file
Data load and visualization.
year_dfs = [
read_year_file(year) for year in range(2014, 2024, 1)
]
panels = pd.concat(year_dfs)
panels.sort_values(
'Date',
inplace=True,
)
panels.plot(
'Date',
'GENERATED',
linewidth=0.1,
)
Let's get the date range covered by my panels dataset.
panels.head(1), panels.tail(1)
panels.dtypes
panels.sort_values(by='Date', inplace=True)
panels = panels.reset_index().drop(columns=['index'])
panels.head()
panels.tail()
Now get the date range covered by my solar panel data.
start_date = panels['Date'].min()
start_date
end_date = panels['Date'].max()
end_date
Now we create an integer index for the Solar Radiation dataframe, sorted by date.
solar.sort_values(
by='PDate',
inplace=True,
)
solar.reset_index().drop(columns=['index'])
Now get the Solar Radiation row that matches the date of my last Panel dataframe
solar[solar['PDate'] == pd.Timestamp('2023-12-08 00:00:00')]
Now get the row from the Panels daraframe that matches the first Solar Radiation dataframe
panels[
panels['Date'] == pd.Timestamp('2015-01-01 00:00:00')
]
Check that the two truncated dataframe align.
panels.iloc[
234:,
].head(1)
panels.iloc[
234:,
].tail(1)
solar.iloc[
:3264,
].tail(1)
solar.iloc[0:].head(1)
Check that the shapes are the same
solar.iloc[0:3264].shape
panels.iloc[
234:,
].shape
Visualize two datasets¶
We plot the Solar Radiation for each day, against the Panel energy generation.
plt.plot(
solar.iloc[
0:3264,
]['Solar'],
panels.iloc[
234:,
]['GENERATED'],
marker='+',
linewidth=0,
alpha=0.2,
)
Run a linear regression. Note that if we don't explicity turn each Series into a List, pandas will try to align the index values, which is not what we want.
sun = list(
solar.iloc[
0:3264,
]['Solar']
)
generated = list(
panels.iloc[
234:,
]['GENERATED']
)
df_dict = {'Solar': sun, 'Panel': generated}
df = pd.DataFrame(df_dict)
Perform Ordinary Least Squares
res3 = ols(
'Panel ~ Solar',
data=df,
).fit()
res3.summary()
Plot the result.
plt.plot(
df['Solar'],
df['Panel'],
marker='+',
linewidth=0,
alpha=0.2,
)
plt.plot(
df['Solar'],
res3.predict(),
'r-',
)
The graphics (and R^2 ) values show there is fairly good linear relationship between my panels energy generation and solar radiation. The relationship is not perfectly linear maybe due to tempereature effects (panels are more efficent at cold temperatures), rain cleaning the panels, etc).
Solar radiation trend versus energy generation trends¶
So first we set up a date-time index on the Panel dataframe
panels = panels.set_index(pd.DatetimeIndex(panels['Date']))
The we get the smoothed trend line for Panel energy generation
res2 = STL(
panels.drop(columns='Date'),
period=365,
).fit()
res2.plot()
plt.show()
Now we plot the trend from solar radiation, and panel enegry generation, using left and right Y axis on the same graph. We first
set a date-time index on the Solar radiation dataframe, so STL
will associate a date-time with the trend values.
solar = solar.set_index(pd.DatetimeIndex(solar['PDate']))
res = STL(
solar.drop(columns='PDate')['Solar'],
period=365,
).fit()
Do the plotting
fig, ax1 = plt.subplots()
ax1.plot(
res.trend.index,
res.trend.values,
label='Solar radiation trend',
)
ax2 = ax1.twinx()
ax2.plot(
res2.trend.index,
res2.trend.values,
label='Solar generation trend',
color='green',
)
ax1.grid(
visible=True,
which='both',
axis='both',
)
ax1.set_ylabel('Solar radiation (MJ/m^2)')
ax2.set_ylabel('Energy generated (Wh)')
fig.legend()
Conclusions¶
I am actually surprised that the trend lines for these two datasets match as well as they do, especially since 2021.
I am sceptical about the trend lines produced by the TSA
call, as they are missing error bars, that might give some idea as the true confidence interval for the lines.
Note that both dataframe are heavily 'polluted' by cloudy days that generate low values essentially at random. Whether the seasonal decomposition algorithms can handle this is beyond my expertise.
Still it is food for thought: why is solar radiation trending up? Maybe my panel performance changes are to with solar radiation changes?
Reproducibility¶
%watermark
%watermark -co
%watermark -iv