The GBase database (GBase 数据库) is suitable for PB-level big data and TB-level small data scenarios, including massive data applications in industries such as telecom., finance, transportation, energy, etc.

Overview

This article introduces the loading module of the GBase 8a cluster from the perspectives of principles and architecture. It provides a performance tuning guide and an approach for loading data from Kafka sources, serving as a reference for users to understand and use the GBase 8a loading function.

Key Features

Supports various data sources including local files, ftp/sftp, http/https, hdfs, kafka, and S3.
Compatible with plaintext and avro/orc (not yet released) file formats.
Supports compressed formats such as gzip, snappy, and lzo.
Offers standard text, fixed-length, and flexible loading modes.
Supports kerberos authentication and multi-namenode high availability.
Enables parallel loading for both single and multiple tables, with multiple data sources.
Supports multi-point transmission, multi-node parallel data parsing, and single-node multithreading.

Architecture and Principles

The architecture of GBase 8a loading functionality is as follows:

1.) Issuing Data Load SQL: The GCluster receives the SQL task.
2.1) URL Parsing: The GCluster parses the URL to generate a specific list of source data files.
2.2) Data Distribution by Node: The GCluster evenly distributes source data files based on the number of GNode loading nodes. For example, if there is a 15GB data file to be split among 3 loading nodes, the GCluster will divide the data, assigning a 5GB loading task to each node. Once the data is divided, the GCluster issues the loading SQL task to the GNode loading nodes.
3) Data Retrieval by GNode: After receiving the loading SQL task, each GNode retrieves the specified data from the file server.
3.1) Data Parsing by GNode: The GNode parses the loaded data it retrieves.
3.2.1) Data Distribution to Shards: Valid data is allocated according to the table and hash column, and directed to the corresponding table shard’s Data Center (DC).
3.2.2) Compressed File Transfer to Primary Shard: The GNode forwards the compressed DC files to the corresponding primary shard node.
3.2.3) Real-time Forwarding to Replica Nodes: The primary shard GNode node assembles the received DC files and forwards them in real time to the replica nodes.

Loading Performance Tuning

With multi-point transmission, each node in the GBase 8a cluster participates in data parsing and loading. Performance scales linearly with the number of nodes, achieving up to 33TB/hour on an 8-server setup through tuning.

Key Performance Parameters

gcluster_loader_max_data_processors

Description: Number of parallel loading machines per task

Default: 16

Recommended: 4–8, especially in high-concurrency or multi-node setups.

gbase_loader_parallel_degree

Description: Parallelism of a single loading task on data nodes

Default: Half the CPU cores (or the thread pool size if set higher).

Recommended: 4–6, can be set via SET or PARALLEL in the load statement.

_gbase_dc_sync_size

Description: Disk space allocation for data files

Default: 1TB

Recommended: Lower to 10MB if data node disk IO is busy, performance is low, or CPU utilization is minimal.

gbase_parallel_max_thread_in_pool

Description: Total threads in the pool for each gnode

Default: Twice the number of CPU cores

Recommended: 4–8 times the CPU cores per gnode.

gcluster_enable_serial_load & gcluster_serial_exec_query

Description: Limits concurrent tasks; controls the SQL load to gnode based on the gcluster_serial_exec_query value.

gbase_loader_max_line_length

Description: Max size for long lines

Default: 4MB

Recommended: Increase if data files contain lines longer than 4MB; lines exceeding this limit will be logged as errors.

gbase_loader_read_timeout

Description: Timeout for reading files (ftp/http/sftp)

Default: 300 seconds, 0 for unlimited.

Recommended: Increase if network or IO issues occur under heavy load.

Data Source Concurrency Evaluation

The maximum concurrent loading tasks (N) and max_data_processors (M) should guide data source concurrency; recommended data source concurrency should be at least M * N.

Chunked Loading

Chunked Loading Principle

The gcluster handles file splitting using URL-represented logical chunks, which are then loaded in parallel by multiple data nodes. The file server does not physically split the files.

Data loading in GBase 8a is processed row by row. Each data node loads the data from its assigned logical chunk, excluding rows that belong to the previous node (up to the first newline character) and reading complete rows until the first newline character of the next node.

Data Load Size

The amount of data that can be read and processed in a single load is controlled by the following parameters:

gbase_loader_buffer_count
- Description: Number of memory blocks used during loading. Each data node maintains a linked list of memory blocks.
- Default Value: 16
gbase_loader_line_length
- Description: Size of each memory block. The data node reads 4M of data from the source in one go and places it in the memory block list for subsequent processing.
- Default Value: 4M

Use Cases for Chunked Loading

The chunked loading configuration can be adjusted based on different scenarios:

For scenarios with fewer loading nodes but a large amount of data, gcluster splits the total data volume among the loading nodes, significantly improving loading performance.
For scenarios with many loading nodes but a small amount of data, avoid small chunk sizes to maintain performance. Use the following parameter for optimization:
- gcluster_loader_min_chunk_size
- Description: Sets the minimum chunk size for large files, ranging from 4M to 128M, similar to the MIN_CHUNK_SIZE SQL parameter.
- Default Value: 64M, meaning files smaller than 128M will not be split.

Scenarios Where Files Are Not Split

In the following cases, data files will not be split during loading:

The NOSPLIT parameter is used in the load statement to disable parallel chunked loading.
The load statement includes having lines separator or format 5.
The file being loaded is a .gz or .snappy compressed file; it's recommended to package smaller files.

When chunked loading is disabled, each data node processes one file, as shown in the following diagram:

Load Balancing Configuration

GBase 8a supports load balancing by automatically selecting the optimal loading nodes based on the number of tasks. The following parameters are available for configuring load balancing to improve performance:

gcluster_loader_max_data_processors
- Description: Number of loading nodes in the cluster, similar to the MAX_DATA_PROCESSORS SQL parameter.
- Range: 1-300
- Default Value: 16
- Recommended Value: 4-6
gcluster_load_rebalance_seed
- Description: Controls load balancing for loading tasks, allowing fine-tuning of node selection.
- Default Value: 5
- Recommended Value: Randomly selects the data nodes with the fewest tasks to serve as loading nodes. In high-concurrency scenarios, this value can be increased to add more data nodes for loading.

Example: In a 12-node cluster with MAX_DATA_PROCESSORS=4, a loading task randomly selects up to four data nodes. If gcluster_load_rebalance_seed is set to 5, and the minimum task count across nodes is 20, nodes with 25 or fewer tasks are eligible for selection. If fewer than four nodes meet this requirement, only those that do are used. Increase rebalance_seed as needed but generally keep it under 10.

Monitoring Load Status

GBase 8a provides system tables and logs to monitor load status and results, allowing for progress tracking, result queries, and tracing erroneous data.

Viewing Load Status Information
After starting a load task, you can check its status via SQL. The information_schema.load_status table records the status of all running load tasks.

Load Results
The results of load tasks are stored in the load_result.log file, and can be viewed through the information_schema.load_result system view.

LOAD_RESULT Table: Records load information for the current coordinator node.
CLUSTER_LOAD_RESULT Table: Records load information across all coordinator nodes.

Note: Queries on information_schema tables consume memory. The load_result table is session-specific, so each concurrent query loads its own copy into session memory. Manage information_schema queries and clear outdated logs as needed.

Load Interface Development

GBase 8a supports data loading through the gccli tool and JDBC/.NET interfaces. Available development interfaces include:

JDBC Interface
- executeUpdate(loadSql) – Executes the load statement.
- stmt.getSkippedLines() – Retrieves the number of skipped error rows.
- stmt.getLoadTaskID() – Retrieves the load task ID.
ADO.NET Interface
- cmd.ExecuteNonQuery() – Executes the load SQL.
- cmd.RecordsSkipped – Retrieves the number of skipped error rows.
- cmd.RecordsTaskID – Retrieves the load task ID.
ODBC Interface
- SQLROWCOUNT(hstmt, &rowCount) – Retrieves the number of successfully loaded rows.
- SQLGetStmtAttr() for SQL_RES_GBASE_LOAD_TASK_ID – Retrieves the task ID.
- SQL_RES_LOAD_TASK_SKIPPED_LINES – Retrieves the number of skipped rows.
Python API Interface
- Provides the getInfo interface to retrieve the count of successfully loaded rows, skipped rows, and task ID.

Additional Loading Issues

Data File Read-Write Consistency
When using software like FreeNAS to set up a data file server, it is recommended to use the same file access interface to avoid inconsistencies caused by using both NFS and FTP on the same data file.
SQL Parameter [MAX_BAD_RECORDS number]
- Description: Sets the maximum number of error rows allowed.
- Range: 0 to 4294967295, with 0 indicating that any error causes an exit and rollback of the load task.
- Tuning: Each cluster node independently counts loading errors. If any node reaches the limit, all nodes' tasks are terminated. Before committing a clustered load, check if the total error count exceeds the limit; if so, the task is abandoned.
Optimization for Loading Multiple Small Files (Versions 862.33R39, 953+)
Files under 10M are considered small, while files over 1G are considered large. Performance improves as file count increases, particularly with smaller file sizes. The system reuses network connections for the same data source, and for files within the same directory, converts them to wildcard patterns for bulk retrieval.

Kafka Data Integration

GBase 8a supports using a Kafka cluster as a data source for loading and exporting data. Load task details for Kafka data sources can be viewed in the gclusterdb system database under topicname_dbname_tbname.

Example SQL for Loading Text Data from Kafka

kafka://broker/topic[?duration=1000][#frombeginning]
load data infile 'kafka://192.168.146.20:9092/vct?duration=1000#frombeginning' into table vc1.testdb.t

Example SQL for Exporting Text Data to Kafka

kafka://broker1/topicname?[brokers=broker2|broker3|...][partition=pt1]
select * from test.lineitem into outfile 'kafka://192.168.8.127:9092/test?brokers=192.168.8.127:9092|192.168.8.127:9093&partition=0' [OPTIONS]

Kafka Broker High Availability

GBase 8a supports configuring Kafka brokers for high availability, including Kerberos authentication. To set this up:

1) Add the following parameter in the gcluster configuration file:

gcluster_kafka_brokers '192.168.146.20:9092,192.168.146.21:9092'

2) Restart gcluster services to apply:

gcluster_services all start

Kafka Stream Loading

GBase 8a supports Kafka stream loading through custom scripts:

while true   do   load_sql   done
load data infile ‘kafka://192.168.146.20:9092/topic1?duration=2000#frombeginning’ into table t;

By mastering the data loading architecture and performance tuning techniques, users can fully leverage the power of the GBase database (GBase 数据库) to optimize data operations. We hope this guide provides you with practical insights to enhance your GBase 8a MPP Cluster performance and achieve more efficient data management.

GBASE数据库 | Introduction to GBase 8a MPP Cluster Data Loading and Performance Tuning