The GBase 8c distributed database architecture fully leverages the computational resources of each node, and its overall performance scales linearly with the number of nodes. To maximize performance and resource utilization in a distributed architecture, GBase 8c offers three distributed execution plans: FQS (Fast Query Shipping), Stream, and Remote-Query. Among these, FQS and Stream plans can be pushed down, meaning all Data Nodes (DNs) in the cluster participate in SQL execution.
Both execution plans fully utilize node resources. The difference is that the FQS plan sends the original statement directly from the Coordinator Node (CN) to all or part of the DNs, which execute the statement independently without data exchange. In contrast, the Stream plan generates an execution plan on the CN, which is then sent to the DNs. The DNs use Stream operators to exchange data during execution.
The Remote-Query plan is a compromise. When the first two plans cannot be generated, the CN creates a plan, sends part of the original statement to the DNs, which execute it independently, and then sends the results back to the CN to execute the remaining plan. This plan generally has poorer performance and is used less frequently.
Distributed databases often mean diverse application scenarios, massive data scales (PB-level), complex job logic, and long execution times, especially in the financial industry. The large data volumes and complex operations determine the extensive use of the Stream plan in production environments.
GBase 8c's Stream execution plan includes three Stream operators: gather, redistribute, and broadcast. The proper use of Stream operators makes large-scale data processing possible in a distributed architecture. However, these solutions often introduce new problems, making the optimization of Stream operators a crucial part of SQL optimization in GBase 8c. To optimize SQL performance in actual development, we often use the EXPLAIN
command to analyze execution plans, striving to use the streaming execution plan in a distributed environment to enhance the utilization of distributed computational resources.
Below, we analyze two specific cases of subplan optimization in distributed execution plans.
Case 1
Original SQL
Execution Plan Analysis, Using EXPLAIN ANALYZE
The current SQL execution time is approximately 28 seconds. As highlighted in the diagram, the three subqueries in the SQL each use the pgxc execution plan, failing to effectively utilize the distributed streaming execution plan of GBase 8c, resulting in low efficiency. The three subplans correspond to the above execution plan, leading to inefficient execution. The key issue is eliminating the pgxc execution plan of the subplans. By modifying the SQL to use temporary tables and addressing the issue of the wm_contact system function not being pushable, we rewrite the SQL to use temporary tables as follows:
New Execution Plan Analysis
After applying EXPLAIN ANALYZE
again, we see a changed execution plan:
The new execution plan eliminates the pgxc execution plans of the three subqueries and adopts the streaming execution plan. The execution time drops to approximately 600ms, significantly improving efficiency and demonstrating the advantages of distributed systems.
Case 2
Business development feedback indicated severe performance degradation compared to Oracle when executing the following SQL to update a large table. The execution time was on the order of hours:
UPDATE zh_dhcplat.sp_org SET Subtype = NULL
UPDATE zh_dhcplat.sp_org SET Subtype = (SELECT qylb FROM zh_cyzt.tpp_cyqyxx b WHERE a.obj_id = b.id) WHERE EXISTS (SELECT 1 FROM zh_cyzt.tpp_cyqyxx b WHERE a.obj_id = b.id)
During SQL2 execution, approximately 68,512 records were updated, but it took 3,686 seconds, roughly an hour, indicating very slow update speed. The environment is a three-shard GBase 8c distributed database cluster.
Execution Plan Analysis
Using EXPLAIN ANALYZE
, we found the issue mainly caused by the subquery. After rewriting the SQL to use an UPDATE FROM
approach, the subplan was eliminated, and the new execution plan showed that the update completed in less than 3 seconds:
Conclusion
The above cases illustrate the significant impact of execution plan changes on SQL execution performance in a distributed environment. These cases highlight inefficiencies caused by subqueries in GBase 8c's distributed environment and how rewriting SQL to eliminate subplans or pgxc execution plans can leverage the distributed streaming plan, enhancing SQL execution efficiency through multi-node concurrency. In GBase 8c development, it's crucial to use the EXPLAIN
tool to analyze SQL execution plans to develop high-efficiency SQL.