1.1 Data Inconsistency Error
Description
Cluster nodes report data inconsistency alarms.
Analysis
Data inconsistency may occur if a node experiences a brief network interruption. Typically, data synchronization happens automatically when the network is restored. However, if the inconsistency persists for an extended period, manual synchronization is required.
Emergency Procedure
When the network recovers, nodes with data inconsistency should automatically synchronize. If data remains unsynchronized for an hour after the network recovery, consider manual synchronization.
- Notify the operations department, open platform, and GBase vendor for assistance.
- For temporary table alarms, wait 10 minutes. If the cluster auto-synchronizes, the issue is resolved. Otherwise, GBase support may need to determine whether to stop the cluster service and running tasks, then perform steps 3-6 (depending on task size, typically between 1-4 hours).
- Stop GBase database service (20 minutes).
- GBase vendor analyzes the inconsistent tables and performs manual synchronization (2-8 hours, depending on table size and number of inconsistencies).
- Start the GBase database service and verify data consistency (30 minutes).
- Notify the operations department that the system is restored and resume running tasks.
1.2 Data Error
Description
An SQL statement returns incorrect result sets.
Analysis
This issue is caused by a bug in the GBase database execution plan, resulting in incorrect SQL statement result sets.
Emergency Procedure
If this issue is found, coordinate with the application team to assess the impact and consider subsequent fixes.
- Notify the operations department, open platform, and GBase vendor for assistance.
- GBase vendor analyzes and identifies the issue, providing a detailed explanation, fix, and workaround.
- The application department evaluates the impact range based on the vendor's explanation.
- Operations department and GBase vendor fix the erroneous data and modify programs to avoid the issue.
- GBase vendor provides a patched version to fix the issue.
1.3 Execution Error
Description
An SQL statement execution results in an error.
Analysis
The error is caused by a bug in the GBase database.
Emergency Procedure
If this issue is found, task the vendor to analyze the bug and provide a resolution timeframe.
- Notify the operations department, open platform, and GBase vendor for assistance.
- GBase vendor analyzes the issue and provides a workaround.
- The application department implements the workaround based on the vendor's instructions.
- GBase vendor provides a patched version to fix the issue.
1.4 High Concurrency Leading to High Database Node Load
High concurrency can be identified by the following indicators:
- System CPU usage averages over 90%.
- Disk IO is nearly saturated.
- High concurrency is observed through
show processlist
, with 1-3 long-running tasks (exceeding or approaching 1 hour).
Solution
- Reduce the concurrency level in the scheduling system.
- Adjust the order of long and short jobs, ensuring a balanced execution to avoid a concentration of long-running jobs.