Seven lessons learned from working with IoT data

Herewith the seven lessons learned after working for a couple of weeks with the solar energy production data and with the weather data where I have been building a statistical model to correlate both datasets. During my work I have been using following three data sets:

- The energy production data of the solar panels (data every 5 minutes of the day)

- The weather observations historical data (hourly data)

- The sunset and sunrise (daily data)

I worked in four of steps:

- In a first step, I reduced the energy production data to hourly data, so the measurements can be aligned with the weather observation data.

- In a second step, I merged the data by hour, so having a single record for every hour of the day that contains the energy production data details, the weather observation details, and the sunset / sunrise hours.

- In a third step, I figured out the need to improve and to enhance the data.

- In a fourth step, I build several models to determine the correlation between the energy production (kwh) and the rest of the data.

During my bumpy experience I learned a lot, herewith the list of seven lessons:
1. Bigger is not always better.
The first models were very promising, I came to a quite impressive (unfortunate learned thereafter very unrealistic) correlation of 97,2%. My naïve and initial hope to easily improve this great result by using a larger data set turned out differently. See the table below
Occurrences Linear Correlation
8,760 0.972
20,424 0.969
29,184 0.965
The more data I added the worst it was getting. Furthermore, I noticed that the mean absolute error and standard deviation were very big numbers. I had to tackle both issues by improving data quality and look into models with a better fit.
2. Improve Quality of data takes time
I inspected the individual data records by starting to sort the values of the different fields. For the energy production data, I found several values that were out of range, such as higher than technically possible or unexpected zeros. From the data, I could conclude that some records contained bad values, probably caused by following reasons:

- After an outage of the data collecting device the next reading after restoring is too high. The next reading from the grid converters is the cumulated value of all missed readings. I did some data cleaning by smoothing out those values.

- During an outage the values were zero, so we are having zero values which cannot be correlated correctly with the weather data. To solve this I skipped those records.

- The first step I described aggregates the 5 minutes’ data to hourly data. Rarely the data records where not exactly measured on the 5 minutes, e.g. data was 12:01 instead of 12:00. I don’t know the exact reason, probably some delay in the writing of the inverter data, but I had to change the aggregation to not skip some of the data records.

Finding those anomalies is one thing, fixing them was by far sometimes more difficult!

3. Tooling or programming
I used various tools to improve the dataset, but working with the full dataset became so slow that I decided to tackle the issues in a different way. I had all the data stored in a NoSQL database on a server in the cloud, so instead of using desktop tools I started to use JavaScripts to process the data to do the job. It saves me hours and hours of waiting! So, knowing how to write code or scripts helps!
4. Generate and derive additional data
I could not find a satisfying model. The weather influences the energy production, as does daytime length, season, elevation of the sun, etc. So, to incorporate those into the model I had to add some data. I calculated two additional values for every entry: (1) the daytime length, depending on sunrise and sunset of the day, and (2) the theoretical maximum energy production for every hour, depending on daytime length and solar elevation. This approach results in a far better model, as now the negative impact of the weather caused by wind, rain, cloud coverage, etc. can be used to calculate the difference between the theoretical amount of energy to be produced versus the real amount of energy produces.
5. Prepare data
I lost out of sight that parameters of the model have sometimes small numbers and others have bigger numbers. To have a balanced calculation I normalized the different parameters between the min and max value. I also lost out of sight that for some parameters such as the wind direction it makes no sense to start calculating with the number of degrees, instead I converted the wind direction into categories (N, NE, E, SE, S, SW, W, and NW).
6. Selection of statistical model
What model to use? Linear regression, neural net, time series, ….. ? It took me a while to figure out how to work with all these models. At the end, I worked with the most logical model by using a time series analysis model, which worked out very well. Don't be fooled by the tools that sometimes suggest different models.
7. Seek for a second opinion
Four eyes see more than two, and two brains know more than one … by discussing my approach and the models used with a colleague I could improve the prediction model. Seeking out for a second opinion avoids a lock-in view into a certain direction.