So, my openhab system periodically decides to leave the building. Appears there is a problem from time to time when the z-wave binding loses communication to the z-wave stick it gets upset and tells openhab to take a hike.
This is bad. Once because it exposed something I missed in my fault tolerance. I had compensated for network issues and full machine failover. But the actual process going belly up…. ooops. My Bad.
Soooo I see it crash while at the gym today and the only thing in my head….
So I appear to have done that.
Let me bring you up to speed on the current state of my home automation. After the great NAS failing of 2015 I was forced to reduce some of my virtual environment. I have not brought my secondary HA controller back online yet. However, it appears that still using keepalived I am able to help address this random problem.
I have added in a new option in my keepalived.conf
1 script "/usr/local/sbin/healthcheck.sh"
2 interval 10 # check every 10 seconds
3 fall 2 # require 2 failures for KO
4 rise 2 # require 2 successes for OK
5}
6
7vrrp_instance VI_1 {
8 state MASTER
9 interface eth0
10 virtual_router_id 220
11 priority 150
12 notify /usr/local/sbin/notify-keepalived.sh
13 advert_int 1
14 authentication {
15 auth_type PASS
16 auth_pass fakepass
17 }
18
19 virtual_ipaddress {
20 192.168.2.90
21 }
22 track_script {
23 chk_hahealth
24 }
25}
So what this does is add a keepalived health check. Every 10 seconds keepalived runs the script /usr/local/sbin/healthcheck.sh and gets an exit code of 0 or 1. 0 if all is good. 1 if the world fell apart.
The code for this script is
1#!/bin/sh
2SERVICE=openhab;
3
4if ps ax | grep -v grep | grep $SERVICE > /dev/null
5then
6 echo "$SERVICE service running, everything is fine"
7 /usr/bin/logger "$SERVICE service running, everything is fine"
8 exit 0
9else
10 echo "$SERVICE is not running"
11 /usr/bin/logger "$SERVICE is not running"
12 /etc/init.d/openhab restart
13 exit 1
14fi
Explanation:
So this script just checks to see if the openhab process is running. If its good, exit 0. If its not, exit 1 but go ahead and try to restart openhab. When keepalived gets the exit 1 code it keeps track of it. You will see in the config that there is a fall 2 line. That means that if there are 2 exit 1 status’s keepalived will go into a failed state. When the second HA box is back online this will force openhab to move over to the other one. However, I have not seen this happen so far as openhab loads pretty quick so since there is 10 seconds between the checks the second check comes back with an exit 0 and resets the fall count.
Comments