Emergency Handling for GBase Database Failures (3) - Database Service Anomalies & Data Loss

Cong Li - Jul 12 - - Dev Community

1. Database Service Anomalies

1.1 GBase Cluster Service Process Crash

Description

The cluster node services gclusterd, gbased, gcware, gcrecover, and gc_sync_server crash unexpectedly.

Analysis

The crash of the five processes (gclusterd, gbased, gcware, gcrecover, gc_sync_server) usually indicates a triggered GBase bug by a specific SQL query or scenario.

Emergency Handling Procedure

This issue is typically caused by a GBase bug triggered by a particular SQL query or scenario. Application assistance is needed to diagnose the root cause.

  1. Notify the open platform and GBase vendor to assist in diagnosing the issue.
  2. The operations team analyzes the abnormal SQL running in the system.
  3. The operations team stops the problematic SQL.
  4. The GBase vendor analyzes the issue scenario and provides a short-term solution and a timeline for a permanent fix.

1.2 GBase Cluster Services Unable to Start

Description

The cluster node services gclusterd, gbased, gcware, gcrecover, and gc_sync_server are unable to start.

Analysis

The inability to start these services usually indicates a GBase cluster product bug.

Emergency Handling Procedure

This issue is typically due to a GBase cluster product bug.

  1. The operations team notifies the open platform and GBase vendor to assist in diagnosing the issue.
  2. The operations team and GBase vendor analyze the running logs and the operational scenario.
  3. The GBase vendor analyzes the issue scenario and provides a short-term solution and a timeline for a permanent fix.

2. Data Loss

2.1 Cluster Data Loss Due to Multiple Node Failures

Description

Multiple node failures in the cluster lead to data loss.

Analysis

In extreme cases, multiple node failures in the GBase database can result in irrecoverable data loss.

Emergency Handling Procedure

Recover data using backup data.

  1. Notify the open platform and GBase vendor to assist in diagnosing the issue.
  2. The operations team stops running tasks. (10 minutes)
  3. The GBase vendor stops the database services.
  4. The GBase vendor restores the most recent backup data from the backup media. (The time required varies depending on data volume, usually between 12-24 hours)
  5. The GBase vendor starts the services and verifies cluster data consistency. (30 minutes)
  6. The operations team restores services and notifies the operations team to start tasks.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player