OpenHab crashing with Z-Wave FIX IT FIX IT FIX IT FIX IT!!

So,  my openhab system periodically decides to leave the building.  Appears there is a problem from time to time when the z-wave binding loses communication to the z-wave stick it gets upset and tells openhab to take a hike.

This is bad.  Once because it exposed something I missed in my fault tolerance.   I had compensated for network issues and full machine failover.  But the actual process going belly up…. ooops.  My Bad.

Soooo I see it crash while at the gym today and the only thing in my head….

gSHIj

So I appear to have done that.

Let me bring you up to speed on the current state of my home automation.  After the great NAS failing of 2015 I was forced to reduce some of my virtual environment.   I have not brought my secondary HA controller back online yet.  However, it appears that still using keepalived I am able to help address this random problem.

I have added in a new option in my keepalived.conf

 


vrrp_script chk_hahealth {
    script "/usr/local/sbin/healthcheck.sh"
    interval 10 # check every 10 seconds
    fall 2 # require 2 failures for KO
    rise 2 # require 2 successes for OK
}

vrrp_instance VI_1 {
   state MASTER
   interface eth0
   virtual_router_id 220
   priority 150
   notify /usr/local/sbin/notify-keepalived.sh
   advert_int 1
   authentication {
        auth_type PASS
        auth_pass fakepass
   }

   virtual_ipaddress {
      192.168.2.90
   }
   track_script {
     chk_hahealth
   }
}

So what this does is add a keepalived health check.   Every 10 seconds keepalived runs the script /usr/local/sbin/healthcheck.sh and gets an exit code of 0 or 1.  0 if all is good.  1 if the world fell apart.

Environmental concept. Some images in montage provided by NASA (http://visibleearth.nasa.gov/)

The code for this script is


#!/bin/sh
SERVICE=openhab;

if ps ax | grep -v grep | grep $SERVICE > /dev/null
then
 echo "$SERVICE service running, everything is fine"
 /usr/bin/logger "$SERVICE service running, everything is fine"
 exit 0
else
 echo "$SERVICE is not running"
 /usr/bin/logger "$SERVICE is not running"
 /etc/init.d/openhab restart
 exit 1
fi


Explanation:

So this script just checks to see if the openhab process is running.  If its good, exit 0.  If its not, exit 1 but go ahead and try to restart openhab.  When keepalived gets the exit 1 code it keeps track of it.  You will see in the config that there is a fall 2 line.  That means that if there are 2 exit 1 status’s keepalived will go into a failed state.  When the second HA box is back online this will force openhab to move over to the other one.  However, I have not seen this happen so far as openhab loads pretty quick so since there is 10 seconds between the checks the second check comes back with an exit 0 and resets the fall count.

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.