💼 I’m currently looking for a Data Engineer position which could offer sponsorship visa, if you think this article inspires a lot and you’re willing to help, please contact me on my LinkedIn profile 🔗Foster Guo, or refer to my online resume Foster Guo - Data Engineer to find my email or phone number.

Apple is my nickname, because my Chinese name is ‘富士’, which can be referred to Fuji Apple, Mount Fuji, Fujitsu, FujiFilm etc.

🍎 General, Elegant, Efficient —— Designed and developed by Apple

Basically, it’s a framework designed to assist content teams in retrieving data independently, without relying on data analysts or engineers.

Train of thought

📌 Apple said, let there be a platform implements strategy recall, and there was strategy recall platform.

Origin

Background

Content platforms like TikTok, Twitter, and Facebook often encounter several challenges:

Keyword Search: How can you efficiently find specific comments under a particular post using keywords?
User-specific Comments: How can you locate comments made by a specific user under a specific post?
Complex Data Retrieval: How can you retrieve data using complex strategies without relying on data analysts?

These issues are common for content operations and content safety teams on a daily basis.

So, What’s the Solution?

The solution is straightforward:

Freeloader: Request the required data from data engineers by specifying your needs. Provide a SQL query to retrieve the data for you.

No no no no no, no way, I don’t want to become a SQL machine. I want automatic solutions where you retrieve the data by yourself. If you’re not familiar with SQL, don’t worry. Let’s design it together.

So the practical solutions are:

Abstract Features: Abstract features related to different entities.
Combine and Search: Integrate and search these features effectively.

More specifically:

Complete Data Access: Ensure you have access to complete content data to avoid missing crucial information.
Target Retrieval Capabilities: Implement methods to retrieve specific targets from large datasets, including word matching, image/video/audio similarity matching, and model threshold matching.
Entity Combination: Enable the ability to combine different entities (such as user, post, comment, etc.) for more precise searches.
Speed: Ensure that the system operates as quickly as possible.

Abstraction

Near Real-Time data
- The volume of data is immense, making it impractical to calculate features on a case-by-case basis.
- Implement a system where data is added to storage daily or even hourly to keep up with the influx of information.
Entities
- Within the same product or business, different entities can be combined using one or multiple keys. For example, use post_id to link post and comment, or user_id to connect user, post, and comment.
- Each entity carries different features that can be sourced from a Data Warehouse.
Feature → Rule → Strategy
- Feature = Characteristics of an entity
  - Main Features: Defined by users or sourced from the data warehouse. Examples include title, content, OCR text, ASR text, image, video, etc. These are static and serve as inputs for derived features.
  - Behavior Features: Defined by user interactions and sourced from the data warehouse. Examples include view_count, comment_count, like_count, etc. These are dynamic and are used when constructing rules.
  - Environment Features: Defined by physical devices and sourced from the data warehouse. Examples include location, post_time, device_identity, etc. These are static and are used in rule construction.
  - Derived Features: Defined and calculated by the platform using various methods. Examples include title matches word1, image1 is similar to image2, etc. These are dynamic and used in rules.
- Rule = Feature + Match + Value
  - Feature has characteristic itself.
    - ID Feature: post_id, comment_id, user_id etc.
    - Text Feature: title, content, ASR, OCR, text, user_name, user_bio etc.
    - Metric Feature: like_cnt, view_cnt etc.
    - Enumeration Feature: is_promoted, city , risk_level etc.
    - Array Feature: hit_word_id_list, tag_id_list etc.
    - Media Feature: image, video etc.
  - Categorize Feature, assign appropriate Match, collect reusable Value.
  - Examples
    - title matches wordlist 792
    - view_count ≥ 100
    - comment_id in (32174921883, 98327197428, 84921375721)
- Strategy = {Entity‘s [Logic + Rule]} ^ n
  - {post’s [title matches wordlist 792 AND view_count ≥ 100]}
  - {user’s [post_count > 10 OR comment_count > 1000]} ^ {comment’s [comment_text matches wordlist 659]}

📌 Features are the foundational elements that are reusable across different strategies.
Strategies constructed from these features can be applied consistently to various entities, enhancing efficiency and flexibility.

For example

TikTok and Twitter have shared entity — Post, Post has feature — title, title has rules view_count_is_greater_than_100 and title_matches_hello, title_matches_hello is strategy_3, so that we can apply strategy_3 on TikTok and Twitter.
TikTok, Twitter and YouTube have shared entity — Post and Comment, which’s features (title, view_count, text) are shared too, so rules (text_not_matches_wordlist_659, title_matches_wordlist_792) are shared, finally strategy_2 is shared.

Unified Solution

We aim to have an integrated platform that supports strategy-based searches and exhibits the following characteristics:

Completeness: The data to be searched must be comprehensive. Any missing data could lead to serious consequences.
Generality: Abstracted features and combined strategies must be preserved for quick reuse in other products.
Accuracy: Building on completeness, the results retrieved must be precise and error-free.
Real-Time: The platform must ensure efficient searches, ideally within half an hour.

In addition, based on the different characteristics of derived features, the platform needs:

Wordlist Library: Manage word lists for different topics, including determining word list types, word list, exemption word lists, and matching methods.
Material Library: Manage materials for various topics, including defining material types and the materials themselves.
Model Library: Manage models for different topics, including specifying the required features for each model and the methods for invoking these models.

This platform can save relevant features based on risk control instructions, establish recall strategies, and perform searches across all business data. We refer to this process as strategy recall and the platform as the strategy recall platform (SRP).

Present

Strategy Recall Platform

SRP contains word list management, material management, model management, strategy management, task management, feature management, and entity and product management, 7 modules. It is

General: Reusable word lists, materials, models, features, and strategies across different use cases.
Elegant: Aesthetic design, elegant framework, and graceful implementation.
Efficient: Efficient strategy construction and effective recall case handling.

Modules

Word List Management: Add, update, or delete word lists; define the topic, type, and content of each word list.
Material Management: Add, update, or delete materials; define material topics, descriptions, URLs, and other relevant details.
Model Management: Managed by data scientists and data engineers, not designed for platform users.
Strategy Management: Add, update, or delete strategies; define applicable products, topics, and the strategy itself.
Task Management: Add, update, or delete tasks; select a strategy to trigger a task for data retrieval.
Feature Management: Add, update, or delete features. Each feature has attributes and can be of type string , date, int etc. Define applicable match methods, such as word match, =, >=, in, or similar to. Additionally, create composite features based on existing features, e.g., concat(title, content).
Entity and Product Management: Used by platform managers to oversee products, entities, and their connections, not intended for normal users.

Terminology

Strategy Recall: Constructing strategies based on features to search through all business data.
Strategy Recall Platform: The platform that implements strategy-based recall.
Word List: Abstracted and standardized general business word lists, which can be categorized as:
- Simple Words: Match exactly.
- Text deduplication: Calculating text similarity using edit distance.
- Similar Characters: Regex matching for visually or phonetically similar words.
- Acrostic Poems: Regex matching for acrostic poems.
- Regex: General regex matching.
Material: Abstracted and standardized general business materials, which can be categorized as images, audio, and video.
Model: Abstracted and standardized general business model.
Combined Strategies: Abstracted and standardized general business strategies that can be applied across different products.
Basic Objects: Classified based on different business entities, including posts, comments, and users. For example, shorts can be treated as posts due to the similar business meaning.
General Objects: Refers to business entities.
Filtered Objects: Further classification of general objects. For example, comments can be divided into comments and replies, users can be divided into post_users, comment_users and reply_users. A strategy can filter multiple filtered objects, such as posts, comments, and replies simultaneously.
Recall Objects: The targets of a strategy’s recall, selected from filtered objects. For instance, a strategy may filter posts and comments but recall only comments.
Strategy Pattern: Combinations of filtered objects in a strategy, which can be extended. The validity of these combinations is defined by the platform.

User Story

User: Content operations teams, analysts, and content safety teams.
User Stories:
- I have multiple word lists for different topics and want to know the number of posts related to each topic.:
  - Use word ****list management to add the relevant word lists.
  - On the strategy management page, select the products, choose post as the filter object, apply the rule that title_description matches word lists, and commit the task.
- I have a video seed and want to find out how many positive comments are under similar videos.:
  - Use word ****list management to add word lists related to positive topics.
  - Use material management to add the video seed.
  - On the strategy management page, select the products and choose post and comment as filter objects, select comment as the recall object. For post, apply the rule video is similar to material. For comment, apply the rule text matches word lists, and commit the task.
- I want to analyze the behavior of teenagers on videos with titles containing specific words from a word list*:*
  - Use word ****list management to add word lists related to a specific topic.
  - On the strategy management page, select the products and choose post, comment, and comment_user as filter objects, select comment as the recall object. For post, apply the rule title matches word lists. For comment, apply no rules. For user, apply the rule that user is a teenager, and commit the task.

Procedure

Modules E-R

Filtered Object E-R

📌 Generally thinking, strategy pattern is a subgraph of filtered object E-R.

Technical Details

📌 This section primarily discusses the backend implement, especially the data pipelines. Although I have some knowledge of React and can write frontend code, I consider myself a rookie in that area.

The data access permission is controlled by a RBAC service, I won’t discuss it either.

Technology Selection

Overview
- Backend Service: Node.js for MongoDB CRUD.
- Data Service: Python for task aggregation, SQL combination, scheduling task and submitting Spark / Presto jobs.
- Frontend: React.js + https://semi.design/zh-CN.
Business Data Storage
- Hudi MergeOnRead & Hive: Compatible with SQL, offering comprehensive metadata management and ease of calculating bucket numbers.
- Parquet: After several benchmarks, parquet was chosen, it’s efficient for columnar storage, optimizing data retrieval and storage.
Online Database
- MongoDB: Initially, I didn’t design the data structure optimally, leading to numerous changes. Despite this, MongoDB has proven to be a suitable choice due to its flexibility and performance.
Language
- JavaScript: MongoDB CRUD & frontend code
- Python3: Used for PySpark code, chosen for its ease of use and integration with big data processing.
- Java + Rust: Employed for high-performance word matching algorithms, balancing speed and efficiency.

Data Flow

Limitations

Real time (T) data only has basic features, it doesn’t support too complex strategies, so we offer a switch to users that determines whether the task could retrieve real time data.

Combined Strategy Framework

📌 The strategy pattern involves combining filter objects. By defining join conditions between entities in the strategy pattern according to the join order, you can create an extensible framework for combined strategies.

----------------------------- Insert SQL Block -------------------------------
insert into table abcdef

select
    yy_.*
from
--------------------------- First Filter SQL Block ---------------------------
    (
        select
            xxx
        from
            xxx
        where
            xxx
    ) xx_
--------------------------- Second Filter SQL Block --------------------------
    join (
        select
            yyy
        from
            yyy
        where
            yyy
    ) yy_ on xx_.xxx = yy_.yyy
--------------------------- Third Filter SQL Block ---------------------------
    join (
        select
            zzz
        from
            zzz
        where
            zzz
    ) zz_ on yy_.yyy = zz_.zzz

// for example
"post+all_comment+all_reply": {
    // post is the first filter sql block, it doesn't need a join condition.
    "all_comment": "all_comment_.post_id = post_.post_id",
    "all_reply": "all_reply_.post_id = all_comment_.post_id and (all_comment_.comment_id = all_reply_.parent_id or all_comment_.comment_id = all_reply_.sibling_id)"
}

// real sql example
// I use dynamic partition here, it's a part of task aggregation.
insert overwrite table xxx partition (partition_id)
select
    post_.post_id as case_id
    map_concat(post_.hit_text_map) as hit_text_map,
    map_concat(post_.hit_word_map) as hit_word_map,
    map_concat(post_.extra_column_map) as extra_map,
    post_.partition_id
from
    (
        select
            post_.*,
            map(
                'post_title_1', title,
                'post_ocr_1', extra_map['ocr']
            ) as hit_text_map,
            map(
                'post_title_1', hit_word_title_1[partition_id],
                'post_ocr_1', hit_word_ocr_1[partition_id]
            ) as hit_word_map,
            map(
                'post',
                to_json(
                    named_struct(
                        'post_id', post_id,
                        'title', title,
                        'publish_time', publish_time,
                        'user_id', user_id
                    )
                )
            ) as extra_column_map,
            partition_id
        from
            (
                select
                    *,
                    post_title_1_word_match(title) as hit_word_title_1,
                    post_ocr_1_word_match(extra_map['ocr']) as hit_word_ocr_1
                from
                    yyy
                where
                    date = '$v_curr_date'
                    and product = 'tiktok'
            ) post_ lateral view explode (
                filter(
                    array(
                        if(
                            hit_word_title_1['partition_id_1'] is not null
                            and hit_word_ocr_1['partition_id_1'] is not null,
                            'partition_id_1',
                            null
                        ),
                        if(
                            hit_word_title_1['partition_id_2'] is not null
                            and publish_time >= '2024-01-01'
                            'partition_id_2',
                            null
                        ),
                        if(
                            hit_word_ocr_1['partition_id_3'] is not null
                            and comment_count >= 100
                            'partition_id_3',
                            null
                        )
                    ),
                    x -> x is not null
                )
            ) explode_partition_id as partition_id
    ) post_

Performance Optimization

📌 "There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." —— Donald Knuth

Storage

Data Split

Reduce Data Volume: Limit the amount of data involved in querying to improve performance.
Visibility-Based Split: Separate visible content from non-visible content, as non-visible content may often be irrelevant.
Time-Based Split: Divide data by publish time, focusing on recent data when needed.
Trade-Off: While data duplication may occur, it is generally acceptable given the performance benefits.

Bucket

Point Query Acceleration: Improve the speed of point queries.
Join Shuffle Reduction: Eliminate or significantly reduce shuffle time in join operations.
Data Bucketing: Ensures uniform distribution of data, effectively preventing data skew.
Additional Benefits: Data within each bucket is ordered and compressed efficiently, saving storage space.
Trade-Off: Writing to Hudi tables introduces an extra shuffle layer, but in strategy recall scenarios, read operations are more common.

Entity	Bucket Key	Why
Post	post_id	Primary key
Comment	post_id	* Visible Post + Visible Comment
* Comment + Reply, add post_id as join key, utilize bucket super set join.
User	user_id	Primary key

UDF

Simplify UDF

UDF Efficiency: Python UDFs are inherently less efficient than native Spark functions.
UDF Simplicity: Keep UDFs simple, focusing only on essential tasks. For example, Matcher should only perform matching.
Operator Handling: Implement other operators (AND, OR, NOT, >=, REGEXP, etc.) directly in Spark SQL.

Python

Faster packages: orjson, msgspec, rapidfuzz, ahocorasick_rs.

Replace Python with Java + Rust

Java: Avoids data serialization overhead between JVM and Python programs.
Rust: Offers performance comparable to C++, enhancing speed and efficiency.
Matcher: https://github.com/Lips7/Matcher

Word List Aggregation

Maximize Rust Matcher Use: Leverage the Rust matcher to its fullest extent.
Aggregation Process: Combine word lists from various strategies and tasks into a single integrated input for the matcher, then expand it as needed.

Spark SQL

Spark SQL

Project Pushdown: Retrieve only the necessary columns to reduce I/O overhead.
Filter Pushdown: Apply filters as early as possible to minimize the amount of data processed.
ParquetVectorizedReader: Utilize vectorized technology to read Parquet files more efficiently.
MergeGetMultiMapValue: Read multiple values from a map in a single operation to enhance performance.
Cost-Based-Optimization: Reorder join sequence for better sql performance.

Arrow

Serialization/Deserialization Acceleration: Use Arrow to speed up the process of serialization and deserialization between JVM and Python programs.
Configuration: Increase spark.sql.execution.arrow.maxRecordsPerBatch to optimize performance by processing larger batches.
Trade-Off: If the value is too large, OOM happens.

Further Optimization

Composite Text Features

Feature Abstraction: Allows for the creation of composite features such as title_content = concat(title, content) and concat(asr, ocr). Building a single UDF is generally faster than constructing two separate UDFs.
Trade-Off: Logical word matching may result in false positives.

Task Aggregation

Efficient Processing: Read data once and compute results for multiple different tasks to optimize performance and reduce overall redundancy.
Implementation: Create a 5-minute time window during which tasks are partitioned by product, strategy pattern, and partition. These are then aggregated into a single Spark task and submitted to YARN.

Go further and beyond

JAVA UDF Integration: By implementing JAVA UDFs, we can leverage faster engines like Presto for executing SQL queries.
Pre-Compute Text Transformations: Since text processing accounts for 99% of our tasks, we can pre-compute transformations (such as converting Traditional Chinese to Simplified Chinese, text deletion, and normalization) in advance.
Dynamic Time Window: Instead of using a fixed 5-minute window, design a dynamic window that triggers aggregation immediately when the number of tasks reaches a threshold. Otherwise, wait for the 5-minute interval.
Persistent Spark Driver: Submit tasks to the same Spark driver rather than starting a new Spark driver for each aggregated task.

Operation

Target

The goal is to improve the retrieval process, focusing on Precision Rate and Recall Rate. Our query is defined as a strategy, which primarily consists of word list. Additionally, the retrieval process involves material search for video/image/audio similarity using vector search, and model trained by data scientists. Since the focus here is on word list, we will only discuss this one aspect.

Precision Rate

Word List: We provide various methods for word matching. For more details, please refer to https://github.com/Lips7/Matcher.
- Simple Matching: Supports exact and logical matches. It only matches when the text exactly contains the word. For example, hello&world&hello will match hellohelloworld and worldhellohello, but not helloworld due to the repeated occurrence of hello.
- Regex Matching:
  - SimilarChar: Matches similar characters using regex. Given the many variations of characters and words, simple word lists cannot cover all possibilities. For example, ["hello,hallo,hollo,hi", "word,world,wrd,🌍", "!,?,~"] will match helloworld!, hollowrd?, and hi🌍~. A simple word list would require writing 4 x 4 x 3 = 48 entries to cover all cases.
  - Acrostic: Matches acrostic poems using regex. Although less common, this method is valid.
  - Regex: Supports general regex matching without additional explanation.
- Similar Matching:
  - Levenshtein: Matches similar words using Levenshtein edit distance, which accounts for variations in character edits.

Recall Rate

Word List: We offer various text transformations while using word list match.
- Fanjian: Simplify traditional Chinese characters to simplified ones. Example: 蟲艸 -> 虫艹
- Delete: Remove specific characters. Example: Fu&*iii&^%%*&kkkk -> Fuiiikkkk
- Normalize: Normalize special characters to identifiable characters. Example: 𝜢𝕰𝕃𝙻𝝧 𝙒ⓞᵣℒ𝒟! -> hello world!
- PinYin: Convert Chinese characters to Pinyin for fuzzy matching. Example: 西安 -> xi an, matches 洗按 -> xi an, but not 先 -> xian
- PinYinChar: Convert Chinese characters to Pinyin. Example: 西安 -> xian, matches 洗按 and 先 -> xian

Business Intelligence

We provide Business Intelligence (BI) tools to users for detailed analysis of word, material, model, and strategy matches. Key features include:

Match Details: View detailed information on how each word, material, model, and strategy performs.
Statistics: Access useful statistics such as recall count of each task or strategy, precision rate based on user marks (if provided), distribution of feature category, for example, view_count < 100, 100 <= view_count < 1000 and 1000 <= view_count.
Auto Word Recommendation: Receive recommendations for words based on retrieved text.

Summary

In a nutshell, this is a general and highly abstracted framework. Although it was implemented using content data, it is not limited to this domain—it’s applicable to various types of data. Actually, after that, I successfully apply this framework to a content moderation system, to help moderators sampling data efficiently. Even if you only have one product, this framework can still be useful.

It empowers users with limited technical skills to retrieve data easily, leveraging their business knowledge. It enriches the variety of strategies available and inspires user creativity. This is the essence of the Strategy Recall Platform.