Data analysis. Home assistant and Jupyter notebook
Photo by Franki Chamaki on Unsplash

How to analyze energy data usage with Home Assistant

Energy data analisys and other examples with Home Assistant.

Today we are going to extract conclusions using the data from our Home Assistant system, we will see in detail how to perform the analysis of our electricity consumption and we will enter into the realm of visualization with Seaborn …

Today we will see at a glance several topics: On the one hand we will see that this database that keeps growing is a very interesting source to extract information that can be used to create new automations, or even to save money using our Home Assistant.

On the one hand we will start with the basics: we will briefly see how to use the Google Colab tool on which we will be based, with all the power of python and its packages, including Seaborn for visualization, and without having to install anything on our computer, connecting with our database and looking at some temperatures as an example.

Then we will see something more complex, analyzing the volume of a TV and arranging the data according to the hour of day

Finally, we will see the energy data from our electricity consumption and will go step by step transforming those values into something with meaning to graph, and lead to conclusions from this energy consumption analysis.

I leave you some links of sensors and devices that we will see and with which we will work:

The first of them is the well-known Shelly EM, being the best-selling consumption meter in home automation, due to its benefits. Includes a 50A clamp, more than enough for most homes. It is a high quality device at afordable price:

Another of the sensors that we will use is the Xiaomi temperature and humidity meter, one of the cheapest sensors on the market, using Bluetooth Low Energy (BLE) so the battery will last for years. Here you have a pack:

We will use as an example a Samsung Smart TV connected to our network, so in addition to being able to control it, we will be able to obtain the data of the selected source and volume level:

We will honor the saying «Torture the data until they confess»

HOME ASSISTANT DATABASE

We will use the database option that comes by default (sqlite) but it is very easy to adapt the analysis to other databases, since the queries that we will perform will remain the same and you will only have to change the connection to the database.

We use the File Editor addon, navigating to the config folder and downloading the file «home-assistant_v2.db«

Download sqlite database Home Assistant

It depends on the size, and it will take more or less, but finally we will have this file on our computer

Using Google Colab

Today there are many and very good tools for data analysis, both in Python and R. The choice will depend on many factors (prior knowledge, aim of analysis, etc …). After all, a simple spreadsheet can also help us on many occasions.

My recommendation is to learn how to use widely established and well-known tools, based on Python with well-documented libraries and a bunch of examples (Pandas, NumPy, …). If you are not familiar with all this jergon, don’t worry, because the steps are very simple and you do not need to be an expert to start obtaining results as we will see.

The tool that we will use: Google Colab, has many advantages and will allow us to use Jupyter Notebooks online from the browser, without having to install anything on our computer and with all the power of Python on our hands. We only need a Google account.

Here’s how to start, but I’ll guide you throught our example step by step.

Jupyter Notebook, is a graphical interface where we can insert text and code in the same environment, while executing it. It’s an ideal tool to show you how to analyze any data ( in this case extracted from our Home Assistant system)

Once we have access to Google Colab, you will see that it creates a folder in Google Drive to store the notebooks:

Google Colab

Let’s create a new notebook for our analysis. When opening it, on the left menu, we will upload the file of our database (home-assistant_v2.db):

Google Colab Home Assistant

There is also the option to previously upload the database file to GoogleDrive and mapping it. This second option has the advantage of not losing the file between sessions, so it won’t be necessary to upload it every time we want to work with. It is the option we have chosen in these examples.

Setting up libraries and connection

The first thing to do is to import the necessary libraries:

# necessary packages
import pandas as pd
import seaborn as sns
import sqlite3

In each cell of code, we will have the option to execute it:

Pandas Seaborn Home Assistant

At any time, if you see a problem, we can always reset the session and execute all the cells from the Runtime menu:

Jupyter Notebook Home Assistant

At the end of this post I’ll leave you a link with the complete jupyter notebook that we are going to use, in case you need to adapt it.

The connection with our database is very simple using the sqlite3 library. In our case we have previously uploaded it to Google Drive, so we must specify the full path:

# We create the connection with the Home Assistant database
# Change to your own path
con=sqlite3.connect('/content/drive/MyDrive/Colab Notebooks/Explore HA DB sqlite/home-assistant_v2.db') 

PLAYING WITH TEMPERATURES

Once we have a connection to our database, let’s see a very basic example on how to access the temperature sensor records. That will give us an idea about the queries that we can use.

Temperature sensors. Raw data

In a single command we use, in the form of a text string, the SQL query that we want to execute. In this case we wanted to show all the records between two dates, from sensors whose name contains «ble_temperature»:

# Reading data to a dataframe using a SQL query against the Home Assistant database
# Filtering for a specific week and sensors that contain "ble_temperature"
temperature_df=pd.read_sql_query('SELECT last_changed,entity_id,state FROM states WHERE last_changed BETWEEN "2021-06-01" AND "2021-06-07" and entity_id LIKE "%ble_temperature%"',con)

Be aware to how we used single and double quotes.

Visualizing records count

We can see some records of this dataframe created using the head() function:

# We see the first rows of the dataframe
temperature_df.head()

We now can see in the dataframs the date / time, the name of the sensor in each row and of course the temperature value.

And finally we can render some nice graphic with Seaborn. In this case a category chart:

# Our first graph, counting the number of observations from each sensor
g1=sns.catplot(data=temperature_df,x='entity_id', kind="count", aspect=3)
g1.set_xticklabels(rotation=90)
Temperature Home Assistant

As you can see, with very few lines of code we had access to all the data that our Home Assistant system has recorded (who said those records were not useful ?).

The game starts…

PLAYING WITH OUR TV VOLUME

Our smart TV is a very common integration in today’s home automation systems, either to automatically turn it on or off based on presence, or to display messages.

We are going to analyze the volume intensity that we set on our TV, depending on the selected source and come to some conclusions.

TV volume. Raw data

As always, we start by making a query to our sqlite database with the connection we have already established in a previous step. In our specific case, we directly look for the Home Assistant entity called «media_player.tv_samsung»:

# We create the dataframe with the data of the specific sensor
# In this case we are interested in the date / time, and the attributes
samsung_df=pd.read_sql_query('SELECT last_changed,entity_id,attributes FROM states WHERE entity_id="media_player.tv_samsung"',con)

The «last_changed» and «attributes» fields will be necessary to know the date / time as well as all the data related to this entity.

Let’s see an instance of the attribute’s content including the volume and source values, and we will see how to extract them in the next paragraphs:

{"source_list": ["TV", "HDMI", "Prime Video", "Netflix", "Universal Guide", "Rakuten TV", "YouTube", "e-Manual", "Internet", "Gallery", "Apple TV", "PrivacyChoices", "Samsung Promotion", "DAZN", "Apple Music", "Calm", "alacarta RTVE", "Disney+"], "volume_level": 0.11, "is_volume_muted": false, "media_content_type": "video", "media_title": "TV/HDMI", "app_id": "TV/HDMI", "source": "TV/HDMI", "ip_address": "192.168.X.X", "device_model": "UE43RU7105KXXC", "device_name": "[TV] Samsung 7 Series (43)", "friendly_name": "TV Samsung", "icon": "mdi:television", "supported_features": 24509, "device_class": "tv"

Getting the source and timestamp

We now want to extract the volume value buried in the attributes field. For this, we will use the split function. Web will look for a delimiter in the entire field to divide it. Usually the delimiter is a single character (a comma, a tab, a carriage return, …).

The trick here is to use a string as the first delimiter. That string will be «volume level» (quotes included). This will divide the field into 2 columns, the volume value being in the second part.

Next step is to use split again with the comma (,) as the second delimiter to leave only the float numeric data:

# We will divide into 2 columns using the text "volume_level:" as a delimiter, so that the numerical data will remain in the second column [1].
# We divide again, now using the comma ',' as a delimiter and this time we choose the first generated column [0].
# Finally we convert to float type.
# Now our dataframe contains the volume levels

vol=samsung_df['attributes'].str.split('"volume_level":', expand=True)
vol=vol[1].str.split(',', expand=True)
vol=vol[0].astype(float)

vol.describe()

Using the describe() function of a dataframe with numeric data, we are presented with a brief summary of the quartiles, minimums and maximums, as well as the mean and standard deviation:

Note: If we assume a normal distribution (I have not checked it), we would see that it is defined as N(0.090967,0.065467).

We follow same procedure to extract the selected source from our smart-TV data:

source=samsung_df['attributes'].str.split('"source":', expand=True)
source=source[1].str.split(',', expand=True)
source=source[0]
source

With the value_counts() function, we see how many recors we have with each source. This will already give us a quick first idea of the most used source:

Now we have to extract the time data. For now, to perform the analysis we focus on the hour, discarding the day, year, minute and second:

# We use the date / time converting from string, and for now we are only interested in the time (% H)
hour=samsung_df['last_changed'].astype('datetime64').dt.strftime('%H').astype(int)

And joining all the data that we have separately (the time, the volume and the source) in a single dataframe:

test=pd.concat([hour,vol, source],axis=1)
test.columns = ['hour','volume','input']
test

Visualizing the TV usage

Now that we have our complete dataframe, we are going to go ahead with the representations using Seaborn.

Volume vs source as a start:

# Volume vs input
g2=sns.catplot(x='input', y='volume', data=test, aspect=3)
g2.set_xticklabels(rotation=90)
Smart TV volume Home Assistant

Also interesting volume vs hour:

# Volume vs hour
g3=sns.catplot(x='hour', y='volume', data=test, aspect=3)
g3.set_xticklabels(rotation=90)
Smart TV volume Home Assistant

And finally hour vs source:

# hour vs input
g4=sns.catplot(x='input', y='hour', data=test, aspect=3)
g4.set_xticklabels(rotation=90)
Smart TV volume Home Assistant

We can even add a legend to distinguish the third dimension in any of those 2-dimensional graphs. For example, in the «volume vs hour» chart, we can color according to the selected TV source:

# input categories
g5=sns.relplot(x="hour", y="volume", hue="input", data=test, aspect=3)
Smart TV volume Home Assistant

Another possibility, which I’m sure you’ll use more than once, is to be able to filter.

Let’s first see which are the most used sources:

# Vamos a filtrar a aquellas fuentes de las que tengamos al menos 10 datos, aplicando un filtro lambda
# We are going to select those sources with at least 10 data, applying a lambda filter
test2=test.groupby('input').filter(lambda x: len(x) >= 10)

# We see the new input category count
test2['input'].value_counts()

It’s crystal-clear: We basically see TV (antenna in this case, not HDMI), Netflix and Prime

Let’s then only visualyze those 3 main sources:

# We graph with the 3 main sources
g6=sns.relplot(x="hour", y="volume", hue="input", data=test2, aspect=3)
Smart TV volume Home Assistant

Last but not least, we are going to interpret those results and visualizations, and to see how can we use them in our Home Assistant system on the next section.

TV use analysis and conclusions

We have seen in the previous section how to easily obtain graphs of our usage data, from our smart-tv, and how to filter it.

Obviously these techniques can be used to analyze any pattern or behavior that our home automation system is able to store.

There is not a single complete analysis without drawing some conclusions and usefulness. Let’s not forget that our data is only ours and it worth a lot. We just have to take advantage of it.

In that specific case with the use made of this appliance, we clearly see that:

  • The main use is focused on 3 sources: TV, Netflix and Prime
  • Volume is set louder when antenna is selected than when Netflix is selected
  • The highest volume hours are around 12:00 and around 17:00. Makes sense.

These data, together with the numerical values, can be used to automatically establish the volume based on the hour of the day and the selected source, with a very high degree of accuracy. For example using an average volume according to the historical data.

ANALYZING THE ELECTRIC CONSUMPTION

We are now at the aim of this guide. One of the uses of data analysis is to be aware of our own behavior, and there are few sensors that gives more information about it than a consumption meter like the Shelly EM.

Electric energy data. Raw data

As in previous examples, we start by consulting our database to have the raw data, while eliminating non-numeric data just in case:

# We query our Home Assistant database using SQL
cons_df=pd.read_sql_query('SELECT last_changed,entity_id,state FROM states WHERE entity_id="sensor.shelly_shem_5e1a16_2_total_consumption"',con)

# Filtramos quitando posibles valores no numéricos del campo state
cons_df = cons_df[pd.to_numeric(cons_df['state'],errors='coerce').notna()]

cons_df.head()

This is what the data extracted from the table looks like:

análisis de consumo de energía

As we can see, in each reading the meter increases the value (state) measured in kWh.

Setting up our dataframe

For now, to analyze the behavior we focus on the following data: date, day of the week (we calculate this from the date), the time and of course the value of the Shelly meter:

# We are only interested in:
# The date
# The weekday, weekday -> 0 = Monday .. 6 = Sunday
# The time
# The value of the energy meter

date_cons = cons_df['last_changed'].astype('datetime64').dt.strftime('%Y.%m.%d')
weekday_cons = cons_df['last_changed'].astype('datetime64').dt.weekday
hour_cons = cons_df['last_changed'].astype('datetime64').dt.strftime('%H').astype(int)
counter_cons = cons_df['state'].astype(float)

Now we have to join these individual dataframes to form a single one:

# We join everything to form a new dataframe
test1=pd.concat([date_cons,weekday_cons,hour_cons, counter_cons],axis=1)
test1.columns = ['date','weekday','hour','counter']

test1.head()

This being the result, showing the first 5 rows:

Total energy calculation

As we can see, the kwh counter is an always increasing value, so we need to figure out how to calculate the energy on every hour substracting the initial from the final value.

The calc that we will do to find out the energy consumed in each hour requires to group, and to choose the minimum value of the counter and the maximum in each specific hour (called aggregate functions because they give us a unique value based on other values):

# We group for each date, day of the week and time, and add the maximum value (final value of the counter) and the minimum (initial value in that hour of the counter)
# We create a pivot table in this way

test2 = test1.groupby(['date','weekday','hour'])['counter'].agg(cnt_ini="min", cnt_end="max")

test2.head()

Let’s see the outcome:

By grouping, we have created what is known as a pivot table, where now the indexes of our tables are labels.

The energy consumption in each hour will be the maximum value of the counter in that hour minus the minimum value. Once the energy has been calculated in each hour, we may get rid of the minimum and maximum columns since we no longer need them:

# We add the energy column as the difference between the final value and the initial value of the counter
test2['energy']= test2['cnt_end'] - test2['cnt_ini']

# Once the energy has been calculated, we no longer need the specific init and end values
test2.drop(['cnt_ini', 'cnt_end'], axis=1, inplace=True)

test2.head()

Now we are getting closer to what we need:

We still have to carry out one more operation to clasiffy according to day of week and hour. Therefore we will group again based on those two fields and we will add the energy data for each hour (the average to make it more representative):

# For each day of the week and hour, we obtain the corresponding average consumption
test3=test2.groupby(['weekday','hour'])['energy'].agg(cons="mean")

test3.head()

Well, this now does seem like it may be almost ready to represent … let’s go

Visualizing electric consumption

To perform the energy consumption analysis, the first thing we will do is use the unstack function to return to a dataframe instead of a pivot table, and reset the indexes so that they are more representative (days of the week and hours):

# Finally we convert back to a normal dataframe from the pivot table

test3=test3.unstack(level=0)


# We make it friendlier by renaming the indexes and columns
test4=test3.reset_index(drop=True)
test4.columns = ['mon','tue','wed','thu','fri','sat','sun']
test4.index=['00','01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23']

test4.head()

And this is the last dataframe we have to use it directly in the renderings with Seaborn:

Let’s draw it:

# We already have everything we need to represent the map of our consumption

# We set a good size
sns.set(rc={'figure.figsize':(12,12)})

# Our favorite color palette
sns.color_palette("magma", as_cmap=True)

# And we represent, together with the average consumption values and hiding the color bar
g7=sns.heatmap(test4, annot=True, fmt="0.2f",cbar=False)

We have used a heatmap chart, labeling with the values of KWh (2 decimal places):

Energy consumption Home Assistant

Electric consumption analysis

We have ended up with a very cool energy data chart about the energy that represents the average consumption in a couple of weeks. The range of colors goes from darker (low consumption values) to lighter (high values).

  • It is clear that during the week our consumption begins at 10 in the morning until approximately 23:00, with the exception of Friday afternoon where we do not have as much consumption.
  • In the mornings, before 10, the highest consumption is due to breakfasts (between 5 and 6 AM)

So far, so good. Eeach analysis you make will be very specific and the conclusions will vary, but the tools used are very powerful, and data visualization is essential to obtain those conclusions.

Best of all, once you have wrote your jupyter notebook, you just have to reload your database file and run again. In a matter of seconds a new energy consumption analysis graph will appear with the updated data.

I leave you a link to the Explore HA DB_EN.ipynb file and the devices that we have used this time:

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *