So,  my openhab system periodically decides to leave the building.  Appears there is a problem from time to time when the z-wave binding loses communication to the z-wave stick it gets upset and tells openhab to take a hike.

This is bad.  Once because it exposed something I missed in my fault tolerance.   I had compensated for network issues and full machine failover.  But the actual process going belly up…. ooops.  My Bad.

Soooo I see it crash while at the gym today and the only thing in my head….

gSHIj

So I appear to have done that.

Let me bring you up to speed on the current state of my home automation.  After the great NAS failing of 2015 I was forced to reduce some of my virtual environment.   I have not brought my secondary HA controller back online yet.  However, it appears that still using keepalived I am able to help address this random problem.

I have added in a new option in my keepalived.conf

 

 1    script "/usr/local/sbin/healthcheck.sh"
 2    interval 10 # check every 10 seconds
 3    fall 2 # require 2 failures for KO
 4    rise 2 # require 2 successes for OK
 5}
 6
 7vrrp_instance VI_1 {
 8   state MASTER
 9   interface eth0
10   virtual_router_id 220
11   priority 150
12   notify /usr/local/sbin/notify-keepalived.sh
13   advert_int 1
14   authentication {
15        auth_type PASS
16        auth_pass fakepass
17   }
18
19   virtual_ipaddress {
20      192.168.2.90
21   }
22   track_script {
23     chk_hahealth
24   }
25}

So what this does is add a keepalived health check.   Every 10 seconds keepalived runs the script /usr/local/sbin/healthcheck.sh and gets an exit code of 0 or 1.  0 if all is good.  1 if the world fell apart.

Environmental concept. Some images in montage provided by NASA (http://visibleearth.nasa.gov/)

The code for this script is

 1#!/bin/sh
 2SERVICE=openhab;
 3
 4if ps ax | grep -v grep | grep $SERVICE > /dev/null
 5then
 6 echo "$SERVICE service running, everything is fine"
 7 /usr/bin/logger "$SERVICE service running, everything is fine"
 8 exit 0
 9else
10 echo "$SERVICE is not running"
11 /usr/bin/logger "$SERVICE is not running"
12 /etc/init.d/openhab restart
13 exit 1
14fi

Explanation:

So this script just checks to see if the openhab process is running.  If its good, exit 0.  If its not, exit 1 but go ahead and try to restart openhab.  When keepalived gets the exit 1 code it keeps track of it.  You will see in the config that there is a fall 2 line.  That means that if there are 2 exit 1 status’s keepalived will go into a failed state.  When the second HA box is back online this will force openhab to move over to the other one.  However, I have not seen this happen so far as openhab loads pretty quick so since there is 10 seconds between the checks the second check comes back with an exit 0 and resets the fall count.