Linux: systemd-unit files edit, restart on failure and email notifications

Arseny Zinchenko - Mar 3 '19 - - Dev Community

We have a RabbitMQ service which sometimes can go down.

So need to:

  1. restart it if is exited with the failure
  2. send an email notification

Let’s do it via RabbitMQ’s systemd service (though there are various options, e.g. using the monit, check the Monit: мониторинг и перезапуск NGINX post).

Will use two options here:

  • RestartSec=: delay on restart – to have a chance to finish some disk I/O operations if any, just in case
  • Restart=: the condition to be used

Available conditions for the Restart are:

Table 2. Exit causes and the effect of the Restart= settings on them

Restart settings/Exit causes no always on-success on-failure on-abnormal on-abort on-watchdog
Clean exit code or signal X X
Unclean exit code X X
Unclean signal X X X X
Timeout X X X
Watchdog X X X X

systemd-unit files edit

The default RabbitMQ’s unit-file in the /lib/systemd/system/rabbitmq-server.service.

You can observe it using systemctl cat:



$ admin@bttrm-production-console:~$ systemctl cat rabbitmq-server.service
/lib/systemd/system/rabbitmq-server.service

[Unit]
Description=RabbitMQ Messaging Server
After=network.target

[Service]
Type=simple
User=rabbitmq
SyslogIdentifier=rabbitmq
LimitNOFILE=65536
ExecStart=/usr/sbin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait
ExecStop=/usr/sbin/rabbitmqctl stop

[Install]
WantedBy=multi-user.target


Enter fullscreen mode Exit fullscreen mode

Do not edit it in the /lib/systemd/system/ directly, like any other file there as it will be overwritten during rabbitmq-server package next upgrade.

When you need to update any service’s default behavior – you have to put your new files in the /etc/systemd/system directory.

To edit an existing service – use the systemctl edit foo.service with the --full option:



# root@bttrm-dev-console:/home/admin# systemctl edit --full rabbitmq-server.service


Enter fullscreen mode Exit fullscreen mode

This will create a temporary file like /etc/systemd/system/rabbitmq-server.service.d/.#override.conf6a0bfbaa5ed8b8d8 with the current /lib/systemd/system/rabbitmq-server.service content and here you can update it.

Restart of failure

Add both options here – Restart=on-failure и RestartSec=60s:



[Unit] Description=RabbitMQ Messaging Server 
After=network.target 

[Service] 
Type=simple
User=rabbitmq
SyslogIdentifier=rabbitmq
LimitNOFILE=65536
ExecStart=/usr/sbin/rabbitmq-server
ExecStartPost=/usr/lib/rabbitmq/bin/rabbitmq-server-wait
ExecStop=/usr/sbin/rabbitmqctl stop

Restart=on-failure
RestartSec=60s

[Install]
WantedBy=multi-user.target


Enter fullscreen mode Exit fullscreen mode

Re-read systemd‘s config files:



# root@bttrm-dev-console:/home/admin# systemctl daemon-reload


Enter fullscreen mode Exit fullscreen mode

systemd will create a /etc/systemd/system/rabbitmq-server.service file with the new content.

Now get RabbitMQ’s server PID:



# root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service | grep PID
Main PID: 14668 (rabbitmq-server)


Enter fullscreen mode Exit fullscreen mode

Kill it with SIGKILL (check the Linux&FreeBSD: команды kill, nohup — сигналы и управление процессами) to make on-failure parameter be applied:



# root@bttrm-dev-console:/home/admin# kill -9 14668


Enter fullscreen mode Exit fullscreen mode

Check its status now:



# root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) (Result: signal) since Thu 2019-02-28 12:08:32 EET; 4s ago
Process: 7093 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
Main PID: 14668 (code=killed, signal=KILL)


Enter fullscreen mode Exit fullscreen mode

Logs:



...
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL
Mar 01 13:26:00 bttrm-dev-console rabbitmq[27392]: Stopping and halting node 'rabbit@bttrm-dev-console'
...
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state.
Mar 01 13:26:00 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.
...


Enter fullscreen mode Exit fullscreen mode

And after one minute:



# root@bttrm-dev-console:/home/admin# systemctl status rabbitmq-server.service
● rabbitmq-server.service - RabbitMQ Messaging Server
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: activating (start-post) since Thu 2019-02-28 12:09:33 EET; 2s ago
...
Feb 28 12:09:33 bttrm-stage-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Feb 28 12:09:33 bttrm-stage-console systemd[1]: Stopped RabbitMQ Messaging Server.
Feb 28 12:09:33 bttrm-stage-console systemd[1]: Starting RabbitMQ Messaging Server
...


Enter fullscreen mode Exit fullscreen mode

Logs again:



Mar 01 13:27:01 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Mar 01 13:27:01 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server.
Mar 01 13:27:01 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server
...
Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: Waiting for 'rabbit@bttrm-dev-console' 
...
Mar 01 13:27:01 bttrm-dev-console rabbitmq[27526]: pid is 27533 ...
Mar 01 13:27:04 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.
...


Enter fullscreen mode Exit fullscreen mode

“Service hold-off time over, scheduling restart” – here is our 60 seconds delay.

email notification

Now let’s add an email notification to be sent if RabbitMQ went down with an error.

Send test email first:



# root@bttrm-dev-console:/home/admin# echo "Stage RabbitMQ restarted on failure!" | mailx -s "RabbitMQ failure notice" admin@example.com


Enter fullscreen mode Exit fullscreen mode

Now you can use ExecStopPost= or OnFailure=. OnFailure looks better – let’s use it.

Create the /etc/systemd/system/rabbitmq-notify-email@.service file:



[Unit]
Description=%i failure email notification 

[Service]
Type=oneshot
ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification" admin@example.com'


Enter fullscreen mode Exit fullscreen mode

Add the OnFailure option to the rabbitmq-server.service using systemctl edit in the [Unit] block:



[Unit] Description=RabbitMQ Messaging Server 
After=network.target 
OnFailure=rabbitmq-notify-email@%i.service ...


Enter fullscreen mode Exit fullscreen mode

Do not forget to reload systemd files:



# root@bttrm-dev-console:/home/admin# systemctl daemon-reload


Enter fullscreen mode Exit fullscreen mode

Kill RabbitMQ again:



# root@bttrm-dev-console:/home/admin# kill -9 29970


Enter fullscreen mode Exit fullscreen mode

Check logs:



...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Main process exited, code=killed, status=9/KILL
Feb 28 13:55:33 bttrm-dev-console rabbitmq[30476]: Stopping and halting node 'rabbit@bttrm-dev-console' ...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Unit entered failed state.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Triggering OnFailure= dependencies.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Failed with result 'signal'.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting rabbitmq-server failure email notification...
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Started rabbitmq-server failure email notification.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: rabbitmq-server.service: Service hold-off time over, scheduling restart.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Stopped RabbitMQ Messaging Server.
Feb 28 13:55:33 bttrm-dev-console systemd[1]: Starting RabbitMQ Messaging Server
...
Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: Waiting for 'rabbit@bttrm-dev-console'
...
Feb 28 13:55:34 bttrm-dev-console rabbitmq[30619]: pid is 30625 ...
Feb 28 13:55:37 bttrm-dev-console systemd[1]: Started RabbitMQ Messaging Server.
...


Enter fullscreen mode Exit fullscreen mode
  1. Triggering OnFailure= dependencies.
  2. Started rabbitmq-server failure email notification.

Okay – all works.

Mail logs:



# root@bttrm-dev-console:/home/admin# tail /var/log/exim4/mainlog
2019-02-28 13:48:58 1gzK7S-0007Td-Bt H=alt2.aspmx.l.google.com [2a00:1450:400b:c01::1b] Network is unreachable
2019-02-28 13:51:09 1gzK7S-0007Td-Bt H=alt1.aspmx.l.google.com [172.217.192.27] Connection timed out
2019-02-28 13:51:42 1gzK7S-0007Td-Bt =\> admin@example.com R=dnslookup T=remote\_smtp H=alt2.aspmx.l.google.com [74.125.193.27] X=TLS1.2:ECDHE\_RSA\_CHACHA20\_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK  1551354702 x34si4667116edb.147 - gsmtp"
2019-02-28 13:51:42 1gzK7S-0007Td-Bt Completed
2019-02-28 13:53:53 1gzK16-0006pp-NU H=alt2.aspmx.l.google.com [74.125.193.27] Connection timed out
2019-02-28 13:53:53 1gzK16-0006pp-NU H=aspmx2.googlemail.com [2800:3f0:4003:c02::1a] Network is unreachable
2019-02-28 13:54:59 1gzK16-0006pp-NU =\> admin@example.com R=dnslookup T=remote\_smtp H=aspmx3.googlemail.com [74.125.193.26] X=TLS1.2:ECDHE\_RSA\_CHACHA20\_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK  1551354899 s45si1200185edm.357 - gsmtp"
2019-02-28 13:54:59 1gzK16-0006pp-NU Completed
2019-02-28 13:54:59 End queue run: pid=29201
2019-02-28 13:55:33 1gzKHl-0007xl-Lm \<= root@dev.backend-console-internal.example.com U=root P=local S=1331


Enter fullscreen mode Exit fullscreen mode

If you didn’t get an email – check the exim‘s queue:



# root@bttrm-dev-console:/home/admin# exim -bp
0m  1.2K 1gzL3R-0000dn-5h 
<root@dev.backend-console-internal.example.com>
admin@example.com


Enter fullscreen mode Exit fullscreen mode

It hangs here.

Run it manually:



# root@bttrm-dev-console:/home/admin# runq


Enter fullscreen mode Exit fullscreen mode

Check logs again:



# root@bttrm-dev-console:/home/admin# cat /var/log/exim4/mainlog | grep 1gzL3R-0000dn-5h
2019-02-28 14:44:49 1gzL3R-0000dn-5h \<= root@dev.backend-console-internal.example.com U=root P=local S=1241
2019-02-28 14:46:48 1gzL3R-0000dn-5h H=aspmx.l.google.com [2607:f8b0:400d:c0f::1a] Network is unreachable
2019-02-28 14:46:49 1gzL3R-0000dn-5h =\> admin@example.com R=dnslookup T=remote\_smtp H=aspmx.l.google.com [173.194.68.26] X=TLS1.2:ECDHE\_RSA\_CHACHA20\_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK  1551358009 w11si208223qvc.68 - gsmtp"
2019-02-28 14:46:49 1gzL3R-0000dn-5h Completed


Enter fullscreen mode Exit fullscreen mode

And your email:

To solve sending email issue (not sure why exim won’t send them) – add some dirty “hack” to the /etc/systemd/system/rabbitmq-notify-email@.service – the ExecStartPost option:



... 
ExecStart=/bin/bash -c '/bin/systemctl status %i | /usr/bin/mailx -s "[%i] failure notification" admin@example.com' 
ExecStartPost=runq ...


Enter fullscreen mode Exit fullscreen mode

To remove an old message from the queue – use their IDs:



# root@bttrm-dev-console:/home/admin# exim -Mrm 1gzVar-0003oO-Rf
Message 1gzVar-0003oO-Rf has been removed


Enter fullscreen mode Exit fullscreen mode

Done.

Similar posts

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player