1. Disk Storage Space Evaluation

The storage space requirements for a GBase cluster are calculated based on the data volume of the business system, the choice of compression algorithm, and the number of cluster replicas. The data volume of a business system usually includes the following aspects:

Historical data volume
Incremental data volume and the size of each increment
Data storage period and total data volume for the entire period
Data growth rate and reserved storage space

Example

Considering the above aspects, assume that the total data volume for the entire period of a certain business system is 30TB. The calculation method for the physical disk capacity of the GBase cluster is as follows:
Minimum Disk Space Requirements (MDSR) = Total Data Volume × Database and Related Workspace Factor × Replica Option Factor × RAID Factor × Operating System and File System Factor × Database Compression Factor.

Specific Parameter Description

Total Data Volume:
(Historical Data + Incremental Data) * (1 + Data Growth Rate)
For example, assuming the estimated total data volume over the data lifecycle is 30TB.
Database and Related Workspace Factor:
This considers system buffers, workspace, logs, secondary indexes, temporary tables, etc. The factor varies depending on the application, typically ranging from 1.2 to 2.0. For instance, for 100GB of user data space, 20GB to 100GB of database management and workspace is reserved. Based on engineering experience, this factor is set to 1.5.
Replica Option Factor:
Replication is the basis of GBase cluster's high availability mechanism. When replication is used, GBase cluster automatically maintains multiple copies of each data record on the physical disks managed by different nodes. Thus, if a node's disk system (including RAID protection) fails, client applications can still work by accessing the replica of the data on the failed disk. GBase cluster allows up to 2 replicas, meaning there can be 3 copies of the same data in the entire cluster. With 2 replicas, the factor is 3; with 1 replica, it is 2; and without replicas, the factor is 1. Considering system data reliability requirements, it is recommended to choose a replica factor of 2.
RAID Factor:
Based on actual project experience, it is recommended:
1) Use a separate RAID for the operating system, such as two 600GB 10K SAS disks in RAID 1 for the OS installation.
2) For RAID 5 configurations, if the number of disks (n) exceeds 10, use RAID 50, which involves creating two RAID 5 arrays and then combining them into RAID 0.
3) For RAID 5 configurations, set up a hot spare disk with the same specifications as the other disks in the RAID 5 array.
For example, with 13 600GB 15K SAS disks, configure two RAID 5 arrays each with 6 disks, then combine them into RAID 0, with one separate hot spare disk. Excluding the OS disk overhead and hot spare, the RAID factor for a GBase cluster with this setup is calculated as n/(n-1) for one RAID 5 array, and n/(n-2) for two RAID 5 arrays. Assuming an n=12 RAID 50 setup, the RAID factor is 12/10.
Operating System and File System Factor:
The Linux operating system requires space for software installation and operation, and the GBase cluster needs additional disk space within the Linux file system to manage user data. Based on GBase's actual usage, this factor ranges from 1.2 to 1.6. For high-performance and security requirements, a factor of 1.6 is recommended, with no scenario allowing it to be less than 1.2.
Database Compression Factor:
GBase cluster offers data compression technology to store user data in a compressed format, reducing the required physical storage space and decreasing I/O operations during database operations, thus improving performance. This compression factor typically ranges from 10% to 70%. Using the 55 compression algorithm, the compression ratio is between 1:3 and 1:5. Here, the lower limit is chosen, so the compression factor is 33%.

Thus, the minimum disk space requirement (MDSR) can be summarized as:
Minimum Disk Space Requirements (MDSR)
= Total Data Volume × 1.5 × 2 × 12/10 × 1.2 × 33%
= Total Data Volume × 1.4256.
Combining these calculations, a system with a total data volume of 30TB requires a disk capacity configuration of:
MDSR = 30TB * 1.4256 = 42.768TB.

2. Cluster Network Bandwidth Estimation

The GBase cluster requires a high-speed network to ensure overall performance. A 10Gbps network or even a 25Gbps network is recommended.

3. Disk I/O Requirements Evaluation

Disk configuration needs to consider two aspects: ensuring high availability and providing higher I/O performance to meet disk I/O demands. An example to illustrate disk I/O performance requirements evaluation:

For a telecom operator's marketing analysis system, with 30 million users (phone numbers) and 10k of data per user, complex ad-hoc queries can filter 90% of the data, resulting in an I/O read requirement of 30 million * 10k * 10% = 30GB.

The disk I/O requirements for this marketing analysis system depend on the following aspects:

Database concurrency: 20
Average data volume accessed per complex ad-hoc query: 30GB
Average time taken for each complex ad-hoc query: 180 seconds

I/O throughput calculation: A * B / C = 20 * 30 * 1024 / 180 = 3410MB/s. Considering a 30% reserve for system I/O capacity, the disk I/O performance requirement is 3410 / 70% = 4800MB/s.

With a GBase cluster node read/write I/O performance of 200MB/s, a 24-node GBase cluster is required to meet the I/O demands of this marketing analysis system. If the cluster size is set to 12 nodes during initial design, each server must have I/O read/write performance of 400MB/s.

Note: The I/O read/write performance of 200MB/s and 400MB/s refers to random access read/write performance under 20 concurrent operations.

4. Memory Requirements Evaluation

4.1. Complex Application Memory Configuration Recommendations

Considering the memory requirements for each operator of a single GBase cluster node (gnode) during database operations (assuming a 10-node cluster):

Data volume involved in operations:
For example, a join operation between a 200 million-row table and a 30 million-row table, followed by a group by aggregation on the join results, yielding 150 million rows. The data volume involved in operations is 230 million rows, exceeding 100GB (excluding fields not involved in operations), with the result set also exceeding 80GB. For a 10-node cluster, each node handles over 8GB of operation data, conservatively estimated at 10GB per node.
Intermediate result set size during SQL execution:
This includes the size of the hash table generated by joining two tables, temporary tables generated during SQL execution, etc. These intermediate result sets are usually not smaller than the original data volume involved in operations. In the above example, the intermediate result set size per gnode is also assumed to be 10GB.
SQL concurrency:
Clients typically require the database to support 5 to 100 concurrent operations.

In summary, the memory requirement for a single cluster node in the above scenario is 10-20GB. For 10 similar SQL scenarios running concurrently, a single gnode requires over 100GB of memory for database operations, with an additional 20GB or more allocated for data caching.

Considering the operating system's total physical memory usage rate of 60-80%, the recommended total physical memory for a single server in the above scenario is: 120GB / 0.8 = 150GB or more.

Based on project experience, complex applications often involve hash join, group by, order by, and other database operations. For gnode server memory configurations between 128GB and 200GB, the buffer sizes for various GBase cluster operators can be set as follows (considering concurrent scenarios with concurrency between 10 and 20):

gbase_buffer_distgrby=2G
gbase_buffer_hgrby=4G
gbase_buffer_hj=4G
gbase_buffer_sj=2G
gbase_buffer_sort=4G
gbase_buffer_result=2G
gbase_buffer_rowset=2G

In concurrent scenarios, the parameters gbase_parallel_degree and gbase_parallel_max_thread_in_pool for GBase cluster configuration must also be considered.

4.2. Memory Configuration Recommendations for Simple Query Applications

Main application scenarios include telecom industry call record query services.
Memory evaluation for such scenarios mainly depends on the proportion and volume of hot data. The total volume of hot data equals the memory requirement.

For example, in a telecom operator's cloud call record system, data is stored in monthly tables for 6+1 months of historical data; daily data volume is 600GB. The total data volume for a month is: 600GB/day * 30 days/month = 18TB. The total data volume over the entire data lifecycle is: 600GB/day * 30 days/month * 7 months = 126T.

Cloud call detail record (CDR) queries primarily focus on current month data. The definition of "hot data" in this context refers to the current month's CDR data. Assuming that the number of fields queried in CDR accounts for only one-third of the total fields, the total volume of hot data amounts to approximately 6TB (18TB/3). Therefore, under ideal conditions, a GBase cluster would require 6TB of memory to cache all hot data.

The columnar storage features of GBase, characterized by high compression ratios, intelligent indexing, and coupled with high-performance disk I/O, allow for meeting high-performance query requirements while minimizing dependence on large memory resources. Based on project experience, caching approximately 50% of hot data in memory is sufficient to meet the performance demands of cloud CDR queries. Thus, the overall memory requirement for the GBase cluster is calculated as 6TB * 50% = 3TB.

Considering a database server's memory utilization rate of 60% to 80%, it is recommended that the total memory for the GBase cluster be 3TB / 0.8 = 3.75TB. Assuming there are 15 nodes in the cluster, each GBase cluster node should be configured with 3750GB / 15 = 250GB of memory.

GBase 8a Implementation Guide: Resource Assessment