Kubernetes Watches will ghost you without warning

Alternate title: if you’re ahead of Clayton you’re doing well, at least for a few hours.

Consider the following code in the Kubernetes e2e test framework:

ctx, cancel := context.WithTimeout(context.Background(), csiPodRunningTimeout)
defer cancel()
pvcWatch, err := f.ClientSet.CoreV1().PersistentVolumeClaims(f.Namespace.Name).Watch(ctx, metav1.ListOptions{})
framework.ExpectNoError(err, "create PVC watch")
defer pvcWatch.Stop()

...

for {
	select {
	case event := <-pvcWatch.ResultChan():
		framework.Logf("PVC event %s: %#v", event.Type, event.Object)
		switch event.Type {
		case watch.Modified:
			pvc, ok := event.Object.(*v1.PersistentVolumeClaim)
			if !ok {
				framework.Failf("PVC watch sent %#v instead of a PVC", event.Object)
			}
			_, set := pvc.Annotations["volume.kubernetes.io/selected-node"]
			if set {
				nodeAnnotationSet = true
			} else if nodeAnnotationSet {
				nodeAnnotationReset = true
			}
		case watch.Deleted:
			break loop
		case watch.Error:
			// Can this occur when the apiserver is under heavy load?
			// If yes, then we should bail out of the test here early and
			// skip further checks instead of treating it as a test failure.
			framework.Failf("PVC watch failed prematurely: %v", event.Object)
		}
	case <-ctx.Done():
		framework.Failf("Timeout while waiting to observe PVC list")
	}
}

The problem is hard to spot unless you’re familiar with Kubernetes watches, and perhaps even if you are familiar but don’t work on OpenShift.

Watches and Channels

Watches can and do terminate at any time, gracefully or not. Sometimes a new apiserver leader is elected and the old one terminates watches and clients must reconnect to the new leader. Sometimes the leader just goes away because its node got rebooted. Sometimes there’s a network hiccup and the HTTP connection backing the watch times out. Regardless of the cause, they happen and your code needs to handle them. OpenShift CI forces frequent leader elections to specifically catch these issues before they get to customers.

A watch stuffs events into a Go channel. The code using the watch reads events out of the channel, usually in a for loop (to continuously grab events) with a select block (to ensure individual read operations don’t block which enables cancelation when the channel returned by ctx.Done() is closed). Reading from a Go channel (case event := <-pvcWatch.ResultChan()) returns an optional second boolean indicating whether the channel has been closed.

The testcase loop doesn’t exit until either the testcase times out and the ctx.Done() channel is closed, one of the event handler cases fails the testcase, or the PersistentVolumeClaim is deleted. So what happens if the Watch is closed unexpectedly and nothing checks whether the channel is closed?

The read returns a null value immediately. The code continuously executes the framework.Logf("PVC event %s: %#v", event.Type, event.Object) line for 30 seconds until the test terminates. Depending on how fast your machine is, this can be millions of log lines and lots of CPU.

How do we fix it?

Assume watches can terminate at any time, and that you need to restart the watch if it does. That’s what the SharedInformer framework does for you.
If you’re going to use watches directly always handle channel closure and restart your watch.

for {
	select {
	case event, ok := <-pvcWatch.ResultChan():
		if !ok {
			framework.Failf("PVC watch ended prematurely")
		}

		framework.Logf("PVC event %s: %#v", event.Type, event.Object)
		switch event.Type {