GBase 8c is a multi-mode, distributed database that supports horizontal scaling, allowing for expansion and contraction operations. This article outlines how to troubleshoot issues when scaling operations fail.
Normally, when using the gha_ctl
tool for scaling, two success messages are expected. The first success checks whether the input parameters are correct and if the data directory is not empty. If this step fails, you can modify the parameters and try again. The second success indicates the start of subprocesses for expansion or contraction, which can take some time.
1. Overview of Scaling Operations
Expansion
When expanding, the operation progresses through several phases. You can check the phase of failure by running the command gha_ctl get expand history -l $dcslist
. The phases are:
- add_primary: Initialize the DN node and add the DN host.
- prepare: Check the node group to be expanded, create the target node group, and set the source and target node groups for expansion.
-
execute: Use the
gs_redis
tool to redistribute data. - add_standby: Add the standby DN node.
- clean_data: Change the expansion status to "end."
Contraction
For contraction, the process also progresses through several phases. You can check the phase of failure using the same command gha_ctl get expand history -l $dcslist
. Contraction only removes nodes without adding new ones:
- prepare: Check the node group to be contracted, create the target node group, and set the source and target node groups for contraction.
-
execute: Use the
gs_redis
tool to redistribute data. - drop_group: Drop the shrinking datanode group.
- clean_data: Change the expansion status to "end."
2. Failure Overview and Case Studies
When failures occur in the add_primary, add_standby, or prepare phases, you should check the logs at /var/log/messages
and /tmp/gha_ctl/gha_ctl.log
on the gha_server
(or the primary gha_server
in multi-server setups). On the added DN nodes, check the logs under $GAUSSLOG/gbase/om/gs_expansion***
.
If a failure occurs during the execute phase, error messages will often state that "gs_redis failed on...". In this case, check the gs_redis
logs, which can be found in the $GAUSSLOG/bin/gs_redis
directory on one of the CN nodes. Additionally, check the CN node's pg_log
directory for more detailed error information.
(1) Case Study 1
The following issue occurred:
[gbase@gbase8c-82 script]$ ./gha_ctl expand datanode 'dn4 (dn4_1 100.0.0.84 30010 /home/gbase/data/dn4/dn4_1 8020)' -l http://100.0.0.82:2379,http://100.0.0.83:2379,http://100.0.0.84:2379 -u 40ac7d83-6be3-486c-83c4-8942a16d3590
{
"ret": 0,
"msg": "Success"
}
[gbase@gbase8c-82 script]$ {
"ret": -1,
"msg": "Init fail"
}
Troubleshooting Steps:
First, check which phase the failure occurred in:
[gbase@gbase8c-82 script]$ ./gha_ctl get expand history -l http://100.0.0.82:2379,http://100.0.0.83:2379,http://100.0.0.84:2379
{
"state": "idle",
"current": "",
"history": [
{
"time": "2022-12-29 10:27:59",
"uuid": "40ac7d83-6be3-486c-83c4-8942a16d3590",
"phase": "add_primary",
"status": "failed",
"info": {
"dn4": [
{
"name": "dn4_1",
"host": "100.0.0.84",
"port": "30010",
"work_dir": "/home/gbase/data/dn4/dn4_1",
"agent_port": "8020",
"role": "primary",
"agent_host": "100.0.0.84"
}
]
}
}
]
}
The failure occurred during the "add_primary" phase. Checking the gs_expansion***
log on node 84, no errors were found. However, checking the /tmp/gha_ctl/gha_ctl.log
on the gha_server
revealed the following:
2022-12-29 10:28:04 gaussdb.py expansion 89 DEBUG 345309 Execute expansion command in [100.0.0.84]: source ~/.bashrc;gs_expansion -U gbase -G gbase -X /tmp/gs_gha_2022-12-29_10:28:02_796027/clusterconfig.xml -h 100.0.0.84 --from-gha --inst-name dn4_1 --group-name dn4
2022-12-29 10:28:08 command_util.py execute 249 DEBUG 345309 cmd:ssh -E /dev/null -p 22 gbase@100.0.0.84 "source ~/.bashrc;gs_expansion -U gbase -G gbase -X /tmp/gs_gha_2022-12-29_10:28:02_796027/clusterconfig.xml -h 100.0.0.84 --from-gha --inst-name dn4_1 --group-name dn4", status:1, output:[GAUSS-51100] : Failed to verify SSH trust on these nodes:
gbase8c-82, gbase8c-83, gbase8c-84, 100.0.0.82, 100.0.0.83, 100.0.0.84 by individual user.
2022-12-29 10:28:08 instance.py init 1614 INFO 345309 Node dn4_1 init error:Failed to execute the command: source ~/.bashrc;gs_expansion -U gbase -G gbase -X /tmp/gs_gha_2022-12-29_10:28:02_796027/clusterconfig.xml -h 100.0.0.84 --from-gha --inst-name dn4_1 --group-name dn4. Error:
Run cmd failed:cmd[ssh -E /dev/null -p 22 gbase@100.0.0.84 "source ~/.bashrc;gs_expansion -U gbase -G gbase -X /tmp/gs_gha_2022-12-29_10:28:02_796027/clusterconfig.xml -h 100.0.0.84 --from-gha --inst-name dn4_1 --group-name dn4"], msg[[GAUSS-51100] : Failed to verify SSH trust on these nodes:
gbase8c-82, gbase8c-83, gbase8c-84, 100.0.0.82, 100.0.0.83, 100.0.0.84 by individual user.]
2022-12-29 10:28:08 common.py add_one_node 190 ERROR 345309 init one node dn4_1 failed, code: -1, response: Init fail
The issue was caused by SSH trust not being configured between the nodes. After configuring SSH trust, the expansion operation succeeded.
(2) Case Study 2
The following issue occurred:
[gbase@gbase8c-82 script]$ ./gha_ctl expand datanode 'dn4 (dn4_1 100.0.0.84 30010 /home/gbase/data/dn4/dn4_1 8020)' -l http://100.0.0.82:2379,http://100.0.0.83:2379,http://100.0.0.84:2379 -u 40ac7d83-6be3-486c-83c4-8942a16d3590
{
"ret": 0,
"msg": "Success"
}
[gbase@gbase8c-82 script]$ {
"ret": -1,
"msg": "gs_redis on cn1 failed"
}
Troubleshooting Steps:
Based on the error message, the failure occurred while executing gs_redis
. Checking the gs_redis
log on cn1
at $GAUSSLOG/bin/gs_redis
, we found the following:
tid[392445]: INFO: redistributing database "postgres"
tid[392445]: INFO: lock schema postgres.public
INFO: please do not close this session until you are done adding the new node
CONTEXT: referenced column: pgxc_lock_for_transfer
tid[392445]: INFO: redistributing table "spatial_ref_sys"
tid[392445]: INFO: ---- 1. setup table spatial_ref_sys ----
tid[392445]: ERROR: query failed: ERROR: dn4: relation "public.spatial_ref_sys" does not exist
DETAIL: query was: ALTER TABLE public.spatial_ref_sys SET (append_mode=on,rel_cn_oid =17324)
We logged into dn4
and found that the postgres
database indeed did not have the public.spatial_ref_sys
table.