Hugging Face Zero GPU Spaces: ShieldGemma Application

Prashant Iyengar | CTO - Sep 9 - - Dev Community

What is ZeroGPU

  • Hugging Face Spaces now offers a new hardware option called ZeroGPU.
  • ZeroGPU uses Nvidia A100 GPU devices under the hood, and 40GB of vRAM are available for each workload.
  • This is achieved by making Spaces efficiently hold and release GPUs as needed, as opposed to a classical GPU Space that holds exactly one GPU at any given time.
  • You can explore and use existing public ZeroGPU Spaces for free. The list of public zero GPU spaces can be found at (https://huggingface.co/spaces/enzostvs/zero-gpu-spaces)

Hosting models on ZeroGPU Spaces has the following restrictions.

  • ZeroGPU is currently in beta. It only works with the Gradio SDK. It enables us to deploy Gradio applications run on ZeroGPU for free
  • It is only available for Personal accounts subscribed to HuggingFace Pro. It will appear in the hardware list when you select the Gradio SDK.

  • There is a limit of 10 ZeroGPU Spaces that Personal accounts with PRO subscriptions can host
  • Though the documentation mentions that ZeroGPU uses an A100,You may observe a significantly lower performance than a standard A100 as GPU allocation may be time-sliced

Spaces and zero GPU

To make your Space work with ZeroGPU, you need to decorate the Python functions that require a GPU with @spaces.GPU

import spaces

@spaces.GPU
def my_inference_function(input_data, output_data,mode, max_length, max_new_tokens, model_size):
Enter fullscreen mode Exit fullscreen mode

When a decorated function is invoked, the Space will be attributed a GPU, and it will release it upon completion of the function. You can find complete instructions to make your code compatible with zero GPU spaces at the link

ZeroGPU Explorers on Hugging Face

If your space is running on Zero GPU, you can see the status on the space's project page, along with the CPU and RAM consumption.

The supported software versions that are compatible with gradio SDK Zero GPU Spaces are

  • Gradio: 4+
  • PyTorch: All versions from 2.0.0 to 2.2.0
  • Python: 3.10.13

Duration

A duration param in the decorator spaces.GPU allows us to specific GPT time.

  • The default is 60 seconds.
  • If you expect your GPU function to take more than the 60s then you need to specify the same.
  • If you know that your function will take far less than the 60s default, specifying that it will provide higher priority for visitors to your Space in the queue
@spaces.GPU(duration=20)
defgenerate(prompt):
return pipe(prompt).images
Enter fullscreen mode Exit fullscreen mode

It will set the maximum duration of your function call to 120s.

Hosting Private ZeroGPU Spaces

  • We will create a space pi19404/ai-worker, which is zeroGPU space. The visibility of space is private.
  • The gradio server hosted on space A provides shieldGemma2 model inference endpoint.

For more details of shieldGemma2, refer the article

LLM Content Safety Evaluation using ShieldGemma

  • We can create a space pi19404/shieldgemma-demothat programmatically call an application hosted on space pi19404/ai-worker. The visibility of space B is public
  • We configure the hugging face token as a secret with name API_TOKEN project settings of space pi19404/shieldgemma-demo
  • We can call the gradio server API using gradio client as described below
from gradio_client import Client

API_TOKEN=os.getenv("API_TOKEN")
# Initialize the Gradio Client
# This connects to the private zeroGPU Hugging Face space "pi19404/ai-worker"
client = Client("pi19404/ai-worker",hf_token=API_TOKEN)
# Make a prediction using the client
# The predict method calls the specified API endpoint with the given parameters
result = client.predict(
    # Input parameters for the my_inference_function API
    input_data="Hello!!",     # The input text to be evaluated
    output_data="Hello!!",    # The output text to be evaluated (if applicable)
    mode="scoring",           # The mode of operation: "scoring" or "generative"
    max_length=150,           # Maximum length of the input prompt
    max_new_tokens=1024,      # Maximum number of new tokens to generate
    model_size="2B",          # Size of the model to use: "2B", "9B", or "27B"
    api_name="/my_inference_function"  # The specific API endpoint to call
)
# Print the result of the prediction
print(result)
Enter fullscreen mode Exit fullscreen mode

Explaining Rate Limits for ZeroGPU

The huggingface platform rate limits ZeroGPU spaces to ensure that a single user does not hog all available GPUs. The limit is controlled by a special token that the Hugging Face Hub infrastructure adds to all incoming requests to Spaces. This token is a request header called X-IP-Token and its value changes depending on the user who requests the ZeroGPU space.

With the Python client, you will quickly exhaust your rate limit, as all the requests to the ZeroGPU space will have the same token. So, to avoid this, we need to extract the user's token using Space pi19404/shieldgemma-demobefore we call Space pi19404/ai-worker programmatically.

When a new user visits the page

  • We use the load event to extract the user’s x-ip-token header when the user visits the page.
with gr.Blocks() as demo:
    """
    Main Gradio interface setup.This block sets up the Gradio interface, including:
    - A State component to store the client for the session.
    - A JSON component to display request headers for debugging.
    - Other UI components (not shown in this snippet).
    - A load event that calls set_client_for_session when the interface is loaded.
    """

    gr.Markdown("## LLM Safety Evaluation")
    client = gr.State()
    with gr.Tab("ShieldGemma2"):

        input_text = gr.Textbox(label="Input Text")
        output_text = gr.Textbox(
            label="Response Text",
            lines=5,
            max_lines=10,
            show_copy_button=True,
            elem_classes=["wrap-text"]
        )
        mode_input = gr.Dropdown(choices=["scoring", "generative"], label="Prediction Mode")
        max_length_input = gr.Number(label="Max Length", value=150)
        max_new_tokens_input = gr.Number(label="Max New Tokens", value=1024)
        model_size_input = gr.Dropdown(choices=["2B", "9B", "27B"], label="Model Size")
        response_text = gr.Textbox(
            label="Output Text",
            lines=10,
            max_lines=20,
            show_copy_button=True,
            elem_classes=["wrap-text"]
        )
        text_button = gr.Button("Submit")
        text_button.click(fn=my_inference_function, inputs=[client,input_text, output_text, mode_input, max_length_input, max_new_tokens_input, model_size_input], outputs=response_text)
    demo.load(set_client_for_session,None,client)
demo.launch(share=True)

Enter fullscreen mode Exit fullscreen mode
  • We create a new Gradio client with this header passed to the headers parameter.
# Create an OrderedDict to store clients, limited to 15 entries
client_cache = OrderedDict()
MAX_CACHE_SIZE = 15

default_client=Client("pi19404/ai-worker", hf_token=API_TOKEN)

def get_client_for_ip(ip_address,x_ip_token):
    """
    Retrieve or create a client for the given IP address.This function implements a caching mechanism to store up to MAX_CACHE_SIZE clients.
    If a client for the given IP exists in the cache, it's returned and moved to the end
    of the cache (marking it as most recently used). If not, a new client is created,
    added to the cache, and the least recently used client is removed if the cache is full.
    Args:
        ip_address (str): The IP address of the client.
        x_ip_token (str): The X-IP-Token header value for the client.
    Returns:
        Client: A Gradio client instance for the given IP address.
    """
    if x_ip_token is None:
        x_ip_token=ip_address
    #print("ipaddress is ",x_ip_token)
    if x_ip_token is None:
        new_client=default_client
    else:

        if x_ip_token in client_cache:
            # Move the accessed item to the end (most recently used)
            client_cache.move_to_end(x_ip_token)
            return client_cache[x_ip_token]
        # Create a new client
        new_client = Client("pi19404/ai-worker", hf_token=API_TOKEN, \\
        headers={"X-IP-Token": x_ip_token})
        # Add to cache, removing oldest if necessary
        if len(client_cache) >= MAX_CACHE_SIZE:
            client_cache.popitem(last=False)
        client_cache[x_ip_token] = new_client

    return new_client

Enter fullscreen mode Exit fullscreen mode
  • This ensures all subsequent predictions pass this header to the ZeroGPU space.
  • The client is saved in a State variable so that it is independent from other users. It is deleted automatically when the user exits the page.
  • We will also save the Gradio client in an in-memory cache so that we do not need to create a client again if the user loads a page with the same IP address.
# Create an OrderedDict to store clients, limited to 15 entries
client_cache = OrderedDict()
MAX_CACHE_SIZE = 15
default_client=Client("pi19404/ai-worker", hf_token=API_TOKEN)
def get_client_for_ip(ip_address,x_ip_token):
    """
    Retrieve or create a client for the given IP address.This function implements a caching mechanism to store up to MAX_CACHE_SIZE clients.
    If a client for the given IP exists in the cache, it's returned and moved to the end
    of the cache (marking it as most recently used). If not, a new client is created,
    added to the cache, and the least recently used client is removed if the cache is full.
    Args:
        ip_address (str): The IP address of the client.
        x_ip_token (str): The X-IP-Token header value for the client.
    Returns:
        Client: A Gradio client instance for the given IP address.
    """
    if x_ip_token is None:
        x_ip_token=ip_address
    #print("ipaddress is ",x_ip_token)
    if x_ip_token is None:
        new_client=default_client
    else:

        if x_ip_token in client_cache:
            # Move the accessed item to the end (most recently used)
            client_cache.move_to_end(x_ip_token)
            return client_cache[x_ip_token]
        # Create a new client
        new_client = Client("pi19404/ai-worker", hf_token=API_TOKEN, headers={"X-IP-Token": x_ip_token})
        # Add to cache, removing oldest if necessary
        if len(client_cache) >= MAX_CACHE_SIZE:
            client_cache.popitem(last=False)
        client_cache[x_ip_token] = new_client

    return new_client

Enter fullscreen mode Exit fullscreen mode

You can find the full gradio client code at

ShieldGemma Demo - a Hugging Face Space by pi19404

Public Gradio Interface and Code

you can find the link to gradio interface at

Shieldgemma Demo - a Hugging Face Space by pi19404

REFERENCES

.
Terabox Video Player