lwolf.org/stderr

Debug notes from day-to-day systems operations

01 May 2020

K3s nodes won't join the master

Background : 4 nodes ARM cluster: Raspberry PI as k3s master and 3 Odroid HC-1 nodes serving as workers. All nodes are set up auto-update using k3os.

Problem: All worker nodes stopped connecting to the master and entered NotReady state.

kubectl get nodes -o wide
NAME           STATUS                     ROLES    AGE   VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION   CONTAINER-RUNTIME
odroid-hc-01   NotReady                   <none>   29d   v1.17.4+k3s1   192.168.11.14   <none>        k3OS v0.10.0   4.14.150-170     containerd://1.3.3-k3s2
odroid-hc-02   NotReady                   <none>   32d   v1.17.4+k3s1   192.168.11.15   <none>        k3OS v0.10.0   4.14.150-170     containerd://1.3.3-k3s2
odroid-hc-03   NotReady                   <none>   29d   v1.17.4+k3s1   192.168.11.16   <none>        k3OS v0.10.0   4.14.150-170     containerd://1.3.3-k3s2
rpi3-01        Ready,SchedulingDisabled   master   32d   v1.17.2+k3s1   192.168.11.20   <none>        k3OS v0.9.1    4.19.75-v7+      containerd://1.3.3-k3s1

k3s kept printing the same unhelpful log message:

error level=error msg="json: cannot unmarshal array into Go struct field Control.Skips of type map[string]bool"
error level=error msg="json: cannot unmarshal array into Go struct field Control.Skips of type map[string]bool"
error level=error msg="json: cannot unmarshal array into Go struct field Control.Skips of type map[string]bool"

Resolution : After some time I found the issue on GitHub with a helpful comment saying that the reason could be in version mismatch between master and worker.

That turned out to be the case. I setup k3s auto-update daemon for all nodes except for master and at some point, they did update.

Marking the master with auto-update label immediately triggered an update of the node and in a few minutes, all nodes were able to join.