How slow HDDs caused various issues in Talos and k3s

Leon Nunes - Sep 15 - - Dev Community

Hi there fellow people,

It's been a while since I've come back here, I keep leaving this place to start my own blog but I never go through with it. Which sucks but it's alright.

I'm still homelabbing although it's not completely where I would like it to be I'm trying.

This weekend, I tried to get my Proxmox Cluster running with Talos linux. Somewhere down the line I added HDD's in my cluster, and then went ahead and built a Ceph Cluster with it. This was the first mistake :0

Mistakes happen.

Yes, they do and it's a part of life, few days ago even my k3s cluster wasn't working correctly, while a VM I had did work exactly the way it was supposed to. It never hit me that I have HDDs because well I forgot

How'd I fix it?

Well I didn't, my friends at the Kargo Discord Server(We're building cool stuff there), pointed that the errors could be Disk I/O related. Bear in mind up until now I still was under the notion that I have SSDs.

So What are these errors you talk about?

Timeouts, Timeouts and Timeouts :D

I had containers failing with timeouts—a lot. ETCD was slow, Kube API server was slow.

E0912 18:33:33.703841       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0912 18:33:38.701599       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": context deadline exceeded
I0912 18:33:38.701663       1 leaderelection.go:285] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0912 18:33:42.475565       1 leaderelection.go:308] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "kube-scheduler": the object has been modified; please apply your changes to the latest version and try again
E0912 18:33:42.475595       1 server.go:242] "Leaderelection lost"
Enter fullscreen mode Exit fullscreen mode

The Kube-scheduler was dying too on Talos

E0913 08:34:02.257514       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)                                                                                                                      
E0913 08:34:07.256320       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:7445/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": context deadline exceeded                                                                                                                                                                        
I0913 08:34:07.256364       1 leaderelection.go:285] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition                                                                        
E0913 08:34:09.375301       1 server.go:242] "Leaderelection lost"
Enter fullscreen mode Exit fullscreen mode

Looking at the Proxmox stats nothing stood out either

Image description

I checked with dd too.

dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.60134 s, 671 MB/s
Enter fullscreen mode Exit fullscreen mode
/ # dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.0GB) copied, 0.739437 seconds, 1.4GB/s
/ # dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=append
1+0 records in
1+0 records out
1073741824 bytes (1.0GB) copied, 0.487405 seconds, 2.1GB/s
Enter fullscreen mode Exit fullscreen mode

So what helped?

Well the fio tool is what helped me ultimately, but before that it was apt when installing some packages.

So with Talos, you can do something like:

kubectl debug -n kube-system -it --image debian node/$NODE
Enter fullscreen mode Exit fullscreen mode

This is where I figured out okay something is really SLOOOOW

The fio tool told me the tests will take 2 Hours, when the same on my laptop was within 10 seconds.

Apt was taking 20 minutes to install a package.

So all in all yes slow disks, can cause all sorts of problems. Thank you for reading.

Until Next time, if you'd like to talk more I'm @mediocredevops on Twitter.

References:
Excellent Article here on Etcd and Fio

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player