Granite 3.0

Granite 3.0 is an open-source, lightweight family of generative language models designed for a range of enterprise-level tasks. It natively supports multi-language functionality, coding, reasoning, and tool usage, making it suitable for enterprise environments.

I tested running this model to see what tasks it can handle.

Environment Setup

I set up the Granite 3.0 environment in Google Colab and installed the necessary libraries using the following commands:

!pip install torch torchvision torchaudio
!pip install accelerate
!pip install -U transformers

Execution

I tested the performance of both the 2B and 8B models of Granite 3.0.

2B Model

I ran the 2B model. Here’s the code sample for the 2B model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "auto"
model_path = "ibm-granite/granite-3.0-2b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

chat = [
    { "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },
]
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_tokens = tokenizer(chat, return_tensors="pt").to("cuda")
output = model.generate(**input_tokens, max_new_tokens=100)
output = tokenizer.batch_decode(output)
print(output[0])

Output

<|start_of_role|>user<|end_of_role|>Please list one IBM Research laboratory located in the United States. You should only output its name and location.<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>1. IBM Research - Austin, Texas<|end_of_text|>

8B Model

The 8B model can be used by replacing 2b with 8b. Here’s a code sample without role and user input fields for the 8B model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "auto"
model_path = "ibm-granite/granite-3.0-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

chat = [
    { "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },
]
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

input_tokens = tokenizer(chat, add_special_tokens=False, return_tensors="pt").to("cuda")
output = model.generate(**input_tokens, max_new_tokens=100)
generated_text = tokenizer.decode(output[0][input_tokens["input_ids"].shape[1]:], skip_special_tokens=True)
print(generated_text)

Output

1. IBM Almaden Research Center - San Jose, California

Function Calling

I explored the Function Calling feature, testing it with a dummy function. Here, get_current_weather is defined to return mock weather data.

Dummy Function

import json

def get_current_weather(location: str) -> dict:
    """
    Retrieves current weather information for the specified location (default: San Francisco).
    Args:
        location (str): Name of the city to retrieve weather data for.
    Returns:
        dict: Dictionary containing weather information (temperature, description, humidity).
    """
    print(f"Getting current weather for {location}")

    try:
        weather_description = "sample"
        temperature = "20.0"
        humidity = "80.0"

        return {
            "description": weather_description,
            "temperature": temperature,
            "humidity": humidity
        }
    except Exception as e:
        print(f"Error fetching weather data: {e}")
        return {"weather": "NA"}

Prompt Creation

I created a prompt to call the function:

functions = [
    {
        "name": "get_current_weather",
        "description": "Get the current weather",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and country code, e.g. San Francisco, US",
                }
            },
            "required": ["location"],
        },
    },
]
query = "What's the weather like in Boston?"
payload = {
    "functions_str": [json.dumps(x) for x in functions]
}
chat = [
    {"role":"system","content": f"You are a helpful assistant with access to the following function calls. Your task is to produce a sequence of function calls necessary to generate response to the user utterance. Use the following function calls as required.{payload}"},
    {"role": "user", "content": query }
]

Response Generation

Using the following code, I generated a response:

instruction_1 = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_tokens = tokenizer(instruction_1, return_tensors="pt").to("cuda")
output = model.generate(**input_tokens, max_new_tokens=1024)
generated_text = tokenizer.decode(output[0][input_tokens["input_ids"].shape[1]:], skip_special_tokens=True)
print(generated_text)

Output

{'name': 'get_current_weather', 'arguments': {'location': 'Boston'}}

This confirmed the model’s ability to generate the correct function call based on the specified city.

Format Specification for Enhanced Interaction Flow

Granite 3.0 allows format specification to facilitate responses in structured formats. This section explains using [UTTERANCE] for responses and [THINK] for inner thoughts.

On the other hand, since function calling is output as plain text, it may be necessary to implement a separate mechanism to distinguish between function calls and regular text responses.

Specifying Output Format

Here’s a sample prompt for guiding the AI’s output:

prompt = """You are a conversational AI assistant that deepens interactions by alternating between responses and inner thoughts.
<Constraints>
* Record spoken responses after the [UTTERANCE] tag and inner thoughts after the [THINK] tag.
* Use [UTTERANCE] as a start marker to begin outputting an utterance.
* After [THINK], describe your internal reasoning or strategy for the next response. This may include insights on the user's reaction, adjustments to improve interaction, or further goals to deepen the conversation.
* Important: **Use [UTTERANCE] and [THINK] as a start signal without needing a closing tag.**
</Constraints>

Follow these instructions, alternating between [UTTERANCE] and [THINK] formats for responses.
<output example>
example1:
  [UTTERANCE]Hello! How can I assist you today?[THINK]I’ll start with a neutral tone to understand their needs. Preparing to offer specific suggestions based on their response.[UTTERANCE]Thank you! In that case, I have a few methods I can suggest![THINK]Since I now know what they’re looking for, I'll move on to specific suggestions, maintaining a friendly and approachable tone.
...
</output example>

Please respond to the following user_input.
<user_input>
Hello! What can you do?
</user_input>
"""

Execution Code Example

the code to generate a response:

chat = [
    { "role": "user", "content": prompt },
]
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

input_tokens = tokenizer(chat, return_tensors="pt").to("cuda")
output = model.generate(**input_tokens, max_new_tokens=1024)
generated_text = tokenizer.decode(output[0][input_tokens["input_ids"].shape[1]:], skip_special_tokens=True)
print(generated_text)

Example Output

The output is as follows:

[UTTERANCE]Hello! I'm here to provide information, answer questions, and assist with various tasks. I can help with a wide range of topics, from general knowledge to specific queries. How can I assist you today?
[THINK]I've introduced my capabilities and offered assistance, setting the stage for the user to share their needs or ask questions.

The [UTTERANCE] and [THINK] tags were successfully used, allowing effective response formatting.

Depending on prompt, closing tags (such as [/UTTERANCE] or [/THINK]) may sometimes appear in the output, but overall, the output format can generally be specified successfully.

Streaming Code Example

Let’s also look at how to output streaming responses.

The following code uses the asyncio and threading libraries to asynchronously stream responses from Granite 3.0.

import asyncio
from threading import Thread
from typing import AsyncIterator
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TextIteratorStreamer,
)

device = "auto"
model_path = "ibm-granite/granite-3.0-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()

async def generate(chat) -> AsyncIterator[str]:
    # Apply chat template and tokenize input
    chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    input_tokens = tokenizer(chat, add_special_tokens=False, return_tensors="pt").to("cuda")

    # Set up the streamer
    streamer = TextIteratorStreamer(
        tokenizer,
        skip_prompt=True,
        skip_special_tokens=True,
    )
    generation_kwargs = dict(
        **input_tokens,
        streamer=streamer,
        max_new_tokens=1024,
    )
    # Generate response in a separate thread
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()

    for output in streamer:
        if not output:
            continue
        await asyncio.sleep(0)
        yield output

# Execute asynchronous generation in the main function
async def main():
    chat = [
        { "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },
    ]
    generator = generate(chat)
    async for output in generator:  # Use async for to retrieve responses sequentially
        print(output, end="|")

await main()

Example Output

Running the above code will generate asynchronous responses in the following format:

1. |IBM |Almaden |Research |Center |- |San |Jose, |California|

This example demonstrates successful streaming. Each token is generated asynchronously and displayed sequentially, allowing users to view the generation process in real time.

Summary

Granite 3.0 provides reasonably strong responses even with the 8B model. The Function Calling and Format Specification features also work quite well, indicating its potential for a wide range of applications.

I tried out Granite 3.0.

Granite 3.0

Environment Setup

Execution

2B Model

Output

8B Model

Output

Function Calling

Dummy Function

Prompt Creation

Response Generation

Output

Format Specification for Enhanced Interaction Flow

Specifying Output Format

Execution Code Example

Example Output

Streaming Code Example

Example Output

Summary