The GBase database (GBase 数据库) is suitable for PB-level big data and TB-level small data scenarios, including massive data applications in industries such as telecom., finance, transportation, energy, etc.
Overview
This article introduces the loading module of the GBase 8a cluster from the perspectives of principles and architecture. It provides a performance tuning guide and an approach for loading data from Kafka sources, serving as a reference for users to understand and use the GBase 8a loading function.
Key Features
- Supports various data sources including local files, ftp/sftp, http/https, hdfs, kafka, and S3.
- Compatible with plaintext and avro/orc (not yet released) file formats.
- Supports compressed formats such as gzip, snappy, and lzo.
- Offers standard text, fixed-length, and flexible loading modes.
- Supports kerberos authentication and multi-namenode high availability.
- Enables parallel loading for both single and multiple tables, with multiple data sources.
- Supports multi-point transmission, multi-node parallel data parsing, and single-node multithreading.
Architecture and Principles
The architecture of GBase 8a loading functionality is as follows:
1.) Issuing Data Load SQL: The GCluster receives the SQL task.
2.1) URL Parsing: The GCluster parses the URL to generate a specific list of source data files.
2.2) Data Distribution by Node: The GCluster evenly distributes source data files based on the number of GNode loading nodes. For example, if there is a 15GB data file to be split among 3 loading nodes, the GCluster will divide the data, assigning a 5GB loading task to each node. Once the data is divided, the GCluster issues the loading SQL task to the GNode loading nodes.
3) Data Retrieval by GNode: After receiving the loading SQL task, each GNode retrieves the specified data from the file server.
3.1) Data Parsing by GNode: The GNode parses the loaded data it retrieves.
3.2.1) Data Distribution to Shards: Valid data is allocated according to the table and hash column, and directed to the corresponding table shard’s Data Center (DC).
3.2.2) Compressed File Transfer to Primary Shard: The GNode forwards the compressed DC files to the corresponding primary shard node.
3.2.3) Real-time Forwarding to Replica Nodes: The primary shard GNode node assembles the received DC files and forwards them in real time to the replica nodes.
Loading Performance Tuning
With multi-point transmission, each node in the GBase 8a cluster participates in data parsing and loading. Performance scales linearly with the number of nodes, achieving up to 33TB/hour on an 8-server setup through tuning.
Key Performance Parameters
-
gcluster_loader_max_data_processors
Description: Number of parallel loading machines per task
Default: 16
Recommended: 4–8, especially in high-concurrency or multi-node setups.
-
gbase_loader_parallel_degree
Description: Parallelism of a single loading task on data nodes
Default: Half the CPU cores (or the thread pool size if set higher).
Recommended: 4–6, can be set via SET
or PARALLEL
in the load statement.
-
_gbase_dc_sync_size
Description: Disk space allocation for data files
Default: 1TB
Recommended: Lower to 10MB if data node disk IO is busy, performance is low, or CPU utilization is minimal.
-
gbase_parallel_max_thread_in_pool
Description: Total threads in the pool for each gnode
Default: Twice the number of CPU cores
Recommended: 4–8 times the CPU cores per gnode.
-
gcluster_enable_serial_load
&gcluster_serial_exec_query
Description: Limits concurrent tasks; controls the SQL load to gnode based on the gcluster_serial_exec_query
value.
-
gbase_loader_max_line_length
Description: Max size for long lines
Default: 4MB
Recommended: Increase if data files contain lines longer than 4MB; lines exceeding this limit will be logged as errors.
-
gbase_loader_read_timeout
Description: Timeout for reading files (ftp/http/sftp)
Default: 300 seconds, 0 for unlimited.
Recommended: Increase if network or IO issues occur under heavy load.
Data Source Concurrency Evaluation
The maximum concurrent loading tasks (N
) and max_data_processors
(M
) should guide data source concurrency; recommended data source concurrency should be at least M * N
.
Chunked Loading
Chunked Loading Principle
The gcluster
handles file splitting using URL-represented logical chunks, which are then loaded in parallel by multiple data nodes. The file server does not physically split the files.
Data loading in GBase 8a is processed row by row. Each data node loads the data from its assigned logical chunk, excluding rows that belong to the previous node (up to the first newline character) and reading complete rows until the first newline character of the next node.
Data Load Size
The amount of data that can be read and processed in a single load is controlled by the following parameters:
-
gbase_loader_buffer_count
- Description: Number of memory blocks used during loading. Each data node maintains a linked list of memory blocks.
- Default Value: 16
-
gbase_loader_line_length
- Description: Size of each memory block. The data node reads 4M of data from the source in one go and places it in the memory block list for subsequent processing.
- Default Value: 4M
Use Cases for Chunked Loading
The chunked loading configuration can be adjusted based on different scenarios:
For scenarios with fewer loading nodes but a large amount of data,
gcluster
splits the total data volume among the loading nodes, significantly improving loading performance.-
For scenarios with many loading nodes but a small amount of data, avoid small chunk sizes to maintain performance. Use the following parameter for optimization:
- gcluster_loader_min_chunk_size
-
Description: Sets the minimum chunk size for large files, ranging from 4M to 128M, similar to the
MIN_CHUNK_SIZE
SQL parameter. - Default Value: 64M, meaning files smaller than 128M will not be split.
Scenarios Where Files Are Not Split
In the following cases, data files will not be split during loading:
- The
NOSPLIT
parameter is used in the load statement to disable parallel chunked loading. - The load statement includes
having lines separator
orformat 5
. - The file being loaded is a
.gz
or.snappy
compressed file; it's recommended to package smaller files.
When chunked loading is disabled, each data node processes one file, as shown in the following diagram:
Load Balancing Configuration
GBase 8a supports load balancing by automatically selecting the optimal loading nodes based on the number of tasks. The following parameters are available for configuring load balancing to improve performance:
-
gcluster_loader_max_data_processors
-
Description: Number of loading nodes in the cluster, similar to the
MAX_DATA_PROCESSORS
SQL parameter. - Range: 1-300
- Default Value: 16
- Recommended Value: 4-6
-
Description: Number of loading nodes in the cluster, similar to the
-
gcluster_load_rebalance_seed
- Description: Controls load balancing for loading tasks, allowing fine-tuning of node selection.
- Default Value: 5
- Recommended Value: Randomly selects the data nodes with the fewest tasks to serve as loading nodes. In high-concurrency scenarios, this value can be increased to add more data nodes for loading.
Example: In a 12-node cluster with MAX_DATA_PROCESSORS=4
, a loading task randomly selects up to four data nodes. If gcluster_load_rebalance_seed
is set to 5, and the minimum task count across nodes is 20, nodes with 25 or fewer tasks are eligible for selection. If fewer than four nodes meet this requirement, only those that do are used. Increase rebalance_seed
as needed but generally keep it under 10.
Monitoring Load Status
GBase 8a provides system tables and logs to monitor load status and results, allowing for progress tracking, result queries, and tracing erroneous data.
Viewing Load Status Information
After starting a load task, you can check its status via SQL. The information_schema.load_status
table records the status of all running load tasks.
Load Results
The results of load tasks are stored in the load_result.log
file, and can be viewed through the information_schema.load_result
system view.
- LOAD_RESULT Table: Records load information for the current coordinator node.
- CLUSTER_LOAD_RESULT Table: Records load information across all coordinator nodes.
Note: Queries on information_schema
tables consume memory. The load_result
table is session-specific, so each concurrent query loads its own copy into session memory. Manage information_schema
queries and clear outdated logs as needed.
Load Interface Development
GBase 8a supports data loading through the gccli
tool and JDBC/.NET interfaces. Available development interfaces include:
-
JDBC Interface
-
executeUpdate(loadSql)
– Executes the load statement. -
stmt.getSkippedLines()
– Retrieves the number of skipped error rows. -
stmt.getLoadTaskID()
– Retrieves the load task ID.
-
-
ADO.NET Interface
-
cmd.ExecuteNonQuery()
– Executes the load SQL. -
cmd.RecordsSkipped
– Retrieves the number of skipped error rows. -
cmd.RecordsTaskID
– Retrieves the load task ID.
-
-
ODBC Interface
-
SQLROWCOUNT(hstmt, &rowCount)
– Retrieves the number of successfully loaded rows. -
SQLGetStmtAttr()
forSQL_RES_GBASE_LOAD_TASK_ID
– Retrieves the task ID. -
SQL_RES_LOAD_TASK_SKIPPED_LINES
– Retrieves the number of skipped rows.
-
-
Python API Interface
- Provides the
getInfo
interface to retrieve the count of successfully loaded rows, skipped rows, and task ID.
- Provides the
Additional Loading Issues
Data File Read-Write Consistency
When using software like FreeNAS to set up a data file server, it is recommended to use the same file access interface to avoid inconsistencies caused by using both NFS and FTP on the same data file.-
SQL Parameter
[MAX_BAD_RECORDS number]
- Description: Sets the maximum number of error rows allowed.
- Range: 0 to 4294967295, with 0 indicating that any error causes an exit and rollback of the load task.
- Tuning: Each cluster node independently counts loading errors. If any node reaches the limit, all nodes' tasks are terminated. Before committing a clustered load, check if the total error count exceeds the limit; if so, the task is abandoned.
Optimization for Loading Multiple Small Files (Versions 862.33R39, 953+)
Files under 10M are considered small, while files over 1G are considered large. Performance improves as file count increases, particularly with smaller file sizes. The system reuses network connections for the same data source, and for files within the same directory, converts them to wildcard patterns for bulk retrieval.
Kafka Data Integration
GBase 8a supports using a Kafka cluster as a data source for loading and exporting data. Load task details for Kafka data sources can be viewed in the gclusterdb
system database under topicname_dbname_tbname
.
Example SQL for Loading Text Data from Kafka
kafka://broker/topic[?duration=1000][#frombeginning]
load data infile 'kafka://192.168.146.20:9092/vct?duration=1000#frombeginning' into table vc1.testdb.t
Example SQL for Exporting Text Data to Kafka
kafka://broker1/topicname?[brokers=broker2|broker3|...][partition=pt1]
select * from test.lineitem into outfile 'kafka://192.168.8.127:9092/test?brokers=192.168.8.127:9092|192.168.8.127:9093&partition=0' [OPTIONS]
Kafka Broker High Availability
GBase 8a supports configuring Kafka brokers for high availability, including Kerberos authentication. To set this up:
1) Add the following parameter in the gcluster
configuration file:
gcluster_kafka_brokers '192.168.146.20:9092,192.168.146.21:9092'
2) Restart gcluster
services to apply:
gcluster_services all start
Kafka Stream Loading
GBase 8a supports Kafka stream loading through custom scripts:
while true do load_sql done
load data infile ‘kafka://192.168.146.20:9092/topic1?duration=2000#frombeginning’ into table t;
By mastering the data loading architecture and performance tuning techniques, users can fully leverage the power of the GBase database (GBase 数据库) to optimize data operations. We hope this guide provides you with practical insights to enhance your GBase 8a MPP Cluster performance and achieve more efficient data management.