Name: How slow HDDs caused various issues in Talos and k3s
Rating: 3.6 (1737 reviews)
Author: mediocredevops

Hi there fellow people,

It's been a while since I've come back here, I keep leaving this place to start my own blog but I never go through with it. Which sucks but it's alright.

I'm still homelabbing although it's not completely where I would like it to be I'm trying.

This weekend, I tried to get my Proxmox Cluster running with Talos linux. Somewhere down the line I added HDD's in my cluster, and then went ahead and built a Ceph Cluster with it. This was the first mistake :0

Mistakes happen.

Yes, they do and it's a part of life, few days ago even my k3s cluster wasn't working correctly, while a VM I had did work exactly the way it was supposed to. It never hit me that I have HDDs because well I forgot

How'd I fix it?

Well I didn't, my friends at the Kargo Discord Server(We're building cool stuff there), pointed that the errors could be Disk I/O related. Bear in mind up until now I still was under the notion that I have SSDs.

So What are these errors you talk about?

Timeouts, Timeouts and Timeouts :D

I had containers failing with timeouts—a lot. ETCD was slow, Kube API server was slow.

E0912 18:33:33.703841       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0912 18:33:38.701599       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": context deadline exceeded
I0912 18:33:38.701663       1 leaderelection.go:285] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0912 18:33:42.475565       1 leaderelection.go:308] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "kube-scheduler": the object has been modified; please apply your changes to the latest version and try again
E0912 18:33:42.475595       1 server.go:242] "Leaderelection lost"

The Kube-scheduler was dying too on Talos

E0913 08:34:02.257514       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)                                                                                                                      
E0913 08:34:07.256320       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": context deadline exceeded                                                                                                                                                                        
I0913 08:34:07.256364       1 leaderelection.go:285] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition                                                                        
E0913 08:34:09.375301       1 server.go:242] "Leaderelection lost"

Looking at the Proxmox stats nothing stood out either

I checked with dd too.

dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.60134 s, 671 MB/s

/ # dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.0GB) copied, 0.739437 seconds, 1.4GB/s
/ # dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=append
1+0 records in
1+0 records out
1073741824 bytes (1.0GB) copied, 0.487405 seconds, 2.1GB/s

So what helped?

Well the fio tool is what helped me ultimately, but before that it was apt when installing some packages.

So with Talos, you can do something like:

kubectl debug -n kube-system -it --image debian node/$NODE

This is where I figured out okay something is really SLOOOOW

The fio tool told me the tests will take 2 Hours, when the same on my laptop was within 10 seconds.

Apt was taking 20 minutes to install a package.

So all in all yes slow disks, can cause all sorts of problems. Thank you for reading.

Until Next time, if you'd like to talk more I'm @mediocredevops on Twitter.

References:
Excellent Article here on Etcd and Fio