1. Cluster Status Monitoring
1.1 Viewing Cluster Status
To monitor the running status of each node in the cluster, including the status of gcware services, gcluster, gnode, and syncserver services, as well as data consistency and node availability, you can use the following three methods:
1) Frontend Method:
Execution Frequency: Every hour
Execution Environment: Unified monitoring platform of the Big Data Platform
Execution Steps:
- Log in to the unified monitoring platform of the Big Data Platform and check the overall running status of the cluster on the "Home" page, as shown below:
-
Green indicates normal cluster status, while red indicates an abnormal cluster status.
- To check the running status of all nodes in a specific cluster, click on "Monitoring Center --> GBase Environment --> Cluster Status Monitoring".
2) Backend Method:
Execution Frequency: Every hour
Execution Environment: ORACLE monitoring database
Execution Steps:
Description: The DWMS_DATABASE_INFO
table in the monitoring database records the status of all production environment nodes, collected every minute using gcadmin
. The field meanings are consistent with those in gcadmin
.
- To check for node abnormalities:
select cluster_id, node_id, gnode, gcluster, data_state, node_state from dwms_database_info where valid_flag = 0 and (gnode = 'CLOSE' or gcluster = 'CLOSE' or data_state = 1 or node_state = 'Offline');
If there are records, it indicates an abnormal cluster status.
- To check if the loader has any abnormalities:
select cluster_id, node_id from dwms_node_conf where (cluster_id, node_id) not in (select cluster_id, node_id from dwms_database_info where valid_flag = 0);
3) Manual Script Execution
Execution Frequency: Every hour
Execution Environment: Each node in the cluster
Execution Steps:
- Log in to any node in the cluster you want to check and use the
gcadmin
command to view the current status of the cluster, as shown below:
Descriptions:
-
gcware
: Online is normal, Offline is abnormal -
gnode
: OPEN is normal, CLOSE is abnormal -
gcluster
: OPEN is normal, CLOSE is abnormal -
syncserver
: OPEN is normal, CLOSE is abnormal -
datastate
: 0 indicates normal node data, 1 indicates abnormal node data. If 1 appears, the cluster sync tool will automatically recover. -
nodestate
: Displays the health status of the node host. 0 is normal, 1 indicates a fault status.
1.2 Handling Cluster Status Abnormalities
Any abnormalities in gcware
, gnode
, gcluster
, syncserver
, datastate
, or nodestate
during monitoring require attention. Issues must be addressed, reported, and resolved in collaboration to handle the abnormal node.
2. SQL Monitoring
2.1 Viewing Current SQL
When the cluster is running slowly, you can view all SQL currently being executed in the cluster to determine which SQL statements are taking too long and whether the overall execution of SQL in the cluster is slow. Monitoring the SQL running status can help identify cluster problems or potential issues.
Three conditions to determine if the cluster's overall SQL execution is slow:
- There are more than 10 SQL statements running for over 1000 seconds, and no SQL executing under 10 seconds.
- A high proportion of DDL statements (create, drop) with most taking longer than 30 seconds.
- Simple SQL execution (e.g.,
select count(*) from table_a
) taking longer than 30 seconds.
You can view the running SQL using the following three methods:
1) Frontend Method:
Execution Frequency: Every hour
Execution Environment: Unified monitoring platform of the Big Data Platform
Execution Steps:
- Log in to the unified monitoring platform of the Big Data Platform and click on "Monitoring Center --> GBase Environment --> GBase Concurrency Monitoring".
2) Backend Method:
Execution Frequency: Every hour
Execution Environment: ORACLE monitoring database
Execution Steps:
Description: The DWMS_SQL_INFO
table in the monitoring database records the SQL information currently being executed in all production environments, collected every minute.
- To view currently executing SQL:
select exe_node_ip, session_pid, session_time, sql_info from dwms_sql_info where valid_flag = 0 and sql_type = 'TRANS' and sql_info not like '%sub_step%' and sql_info not like '%GCLUSTER_LOCAL%' and cluster_id='BDPA' order by to_number(session_time);
- To view current loading processes:
select exe_node_ip, session_pid, session_time, sql_info from dwms_sql_info where valid_flag = 0 and sql_type = 'LOAD' and sql_info not like '%sub_step%' and sql_info not like '%GCLUSTER_LOCAL%' and cluster_id='BDPA' order by to_number(session_time);
3) Manual Script Execution:
Execution Frequency: Every minute
Execution Environment: First loader of each cluster
Execution Steps:
- Log in to the first loader of each environment and execute:
tail -f gbase/ping_node/always_result.txt
to view all SQL currently being executed in the cluster. non-load process number: 46
indicates the number of non-load SQL processes, and load process number: 0
indicates the number of load processes.
Note:
Using the above three methods, you can view the problematic SQL execution time, originating node, and SQL statement. However, the SQL displayed is truncated. To view the full SQL, log in to the node where the SQL originated and execute:
show full processlist;
Or, directly from the Linux command line:
gbase –u username –p password –h node_IP -e "show full processlist;" | grep -ivE "sleep|Daemon"
Since the GBase cluster uses a distributed flat architecture, each node can act as the cluster entry point. Therefore, to view all SQL currently being executed in the entire cluster, you need to connect to each node and execute the above SQL.
2.2 Handling SQL Abnormalities
SQL monitoring mainly focuses on SQL statements with long execution times. Long execution times may be due to:
- SQL statements not optimized for GBase features, resulting in low execution efficiency.
- Cartesian product issues. If present, the cluster may lack resources, drastically reducing performance.
To address SQL statement issues, focus on:
- Improper table types or distribution columns
- Cartesian product problems
If an SQL statement is problematic and its continued execution may affect cluster usage, use the kill
command to terminate the problematic SQL.