Granite 3.0
Granite 3.0 is an open-source, lightweight family of generative language models designed for a range of enterprise-level tasks. It natively supports multi-language functionality, coding, reasoning, and tool usage, making it suitable for enterprise environments.
I tested running this model to see what tasks it can handle.
Environment Setup
I set up the Granite 3.0 environment in Google Colab and installed the necessary libraries using the following commands:
!pip install torch torchvision torchaudio
!pip install accelerate
!pip install -U transformers
Execution
I tested the performance of both the 2B and 8B models of Granite 3.0.
2B Model
I ran the 2B model. Here’s the code sample for the 2B model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "auto"
model_path = "ibm-granite/granite-3.0-2b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
chat = [
{ "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },
]
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_tokens = tokenizer(chat, return_tensors="pt").to("cuda")
output = model.generate(**input_tokens, max_new_tokens=100)
output = tokenizer.batch_decode(output)
print(output[0])
Output
<|start_of_role|>user<|end_of_role|>Please list one IBM Research laboratory located in the United States. You should only output its name and location.<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>1. IBM Research - Austin, Texas<|end_of_text|>
8B Model
The 8B model can be used by replacing 2b
with 8b
. Here’s a code sample without role and user input fields for the 8B model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "auto"
model_path = "ibm-granite/granite-3.0-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
chat = [
{ "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },
]
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_tokens = tokenizer(chat, add_special_tokens=False, return_tensors="pt").to("cuda")
output = model.generate(**input_tokens, max_new_tokens=100)
generated_text = tokenizer.decode(output[0][input_tokens["input_ids"].shape[1]:], skip_special_tokens=True)
print(generated_text)
Output
1. IBM Almaden Research Center - San Jose, California
Function Calling
I explored the Function Calling feature, testing it with a dummy function. Here, get_current_weather
is defined to return mock weather data.
Dummy Function
import json
def get_current_weather(location: str) -> dict:
"""
Retrieves current weather information for the specified location (default: San Francisco).
Args:
location (str): Name of the city to retrieve weather data for.
Returns:
dict: Dictionary containing weather information (temperature, description, humidity).
"""
print(f"Getting current weather for {location}")
try:
weather_description = "sample"
temperature = "20.0"
humidity = "80.0"
return {
"description": weather_description,
"temperature": temperature,
"humidity": humidity
}
except Exception as e:
print(f"Error fetching weather data: {e}")
return {"weather": "NA"}
Prompt Creation
I created a prompt to call the function:
functions = [
{
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and country code, e.g. San Francisco, US",
}
},
"required": ["location"],
},
},
]
query = "What's the weather like in Boston?"
payload = {
"functions_str": [json.dumps(x) for x in functions]
}
chat = [
{"role":"system","content": f"You are a helpful assistant with access to the following function calls. Your task is to produce a sequence of function calls necessary to generate response to the user utterance. Use the following function calls as required.{payload}"},
{"role": "user", "content": query }
]
Response Generation
Using the following code, I generated a response:
instruction_1 = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_tokens = tokenizer(instruction_1, return_tensors="pt").to("cuda")
output = model.generate(**input_tokens, max_new_tokens=1024)
generated_text = tokenizer.decode(output[0][input_tokens["input_ids"].shape[1]:], skip_special_tokens=True)
print(generated_text)
Output
{'name': 'get_current_weather', 'arguments': {'location': 'Boston'}}
This confirmed the model’s ability to generate the correct function call based on the specified city.
Format Specification for Enhanced Interaction Flow
Granite 3.0 allows format specification to facilitate responses in structured formats. This section explains using [UTTERANCE]
for responses and [THINK]
for inner thoughts.
On the other hand, since function calling is output as plain text, it may be necessary to implement a separate mechanism to distinguish between function calls and regular text responses.
Specifying Output Format
Here’s a sample prompt for guiding the AI’s output:
prompt = """You are a conversational AI assistant that deepens interactions by alternating between responses and inner thoughts.
<Constraints>
* Record spoken responses after the [UTTERANCE] tag and inner thoughts after the [THINK] tag.
* Use [UTTERANCE] as a start marker to begin outputting an utterance.
* After [THINK], describe your internal reasoning or strategy for the next response. This may include insights on the user's reaction, adjustments to improve interaction, or further goals to deepen the conversation.
* Important: **Use [UTTERANCE] and [THINK] as a start signal without needing a closing tag.**
</Constraints>
Follow these instructions, alternating between [UTTERANCE] and [THINK] formats for responses.
<output example>
example1:
[UTTERANCE]Hello! How can I assist you today?[THINK]I’ll start with a neutral tone to understand their needs. Preparing to offer specific suggestions based on their response.[UTTERANCE]Thank you! In that case, I have a few methods I can suggest![THINK]Since I now know what they’re looking for, I'll move on to specific suggestions, maintaining a friendly and approachable tone.
...
</output example>
Please respond to the following user_input.
<user_input>
Hello! What can you do?
</user_input>
"""
Execution Code Example
the code to generate a response:
chat = [
{ "role": "user", "content": prompt },
]
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_tokens = tokenizer(chat, return_tensors="pt").to("cuda")
output = model.generate(**input_tokens, max_new_tokens=1024)
generated_text = tokenizer.decode(output[0][input_tokens["input_ids"].shape[1]:], skip_special_tokens=True)
print(generated_text)
Example Output
The output is as follows:
[UTTERANCE]Hello! I'm here to provide information, answer questions, and assist with various tasks. I can help with a wide range of topics, from general knowledge to specific queries. How can I assist you today?
[THINK]I've introduced my capabilities and offered assistance, setting the stage for the user to share their needs or ask questions.
The [UTTERANCE]
and [THINK]
tags were successfully used, allowing effective response formatting.
Depending on prompt, closing tags (such as [/UTTERANCE] or [/THINK]) may sometimes appear in the output, but overall, the output format can generally be specified successfully.
Streaming Code Example
Let’s also look at how to output streaming responses.
The following code uses the asyncio
and threading
libraries to asynchronously stream responses from Granite 3.0.
import asyncio
from threading import Thread
from typing import AsyncIterator
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TextIteratorStreamer,
)
device = "auto"
model_path = "ibm-granite/granite-3.0-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
async def generate(chat) -> AsyncIterator[str]:
# Apply chat template and tokenize input
chat = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_tokens = tokenizer(chat, add_special_tokens=False, return_tensors="pt").to("cuda")
# Set up the streamer
streamer = TextIteratorStreamer(
tokenizer,
skip_prompt=True,
skip_special_tokens=True,
)
generation_kwargs = dict(
**input_tokens,
streamer=streamer,
max_new_tokens=1024,
)
# Generate response in a separate thread
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for output in streamer:
if not output:
continue
await asyncio.sleep(0)
yield output
# Execute asynchronous generation in the main function
async def main():
chat = [
{ "role": "user", "content": "Please list one IBM Research laboratory located in the United States. You should only output its name and location." },
]
generator = generate(chat)
async for output in generator: # Use async for to retrieve responses sequentially
print(output, end="|")
await main()
Example Output
Running the above code will generate asynchronous responses in the following format:
1. |IBM |Almaden |Research |Center |- |San |Jose, |California|
This example demonstrates successful streaming. Each token is generated asynchronously and displayed sequentially, allowing users to view the generation process in real time.
Summary
Granite 3.0 provides reasonably strong responses even with the 8B model. The Function Calling and Format Specification features also work quite well, indicating its potential for a wide range of applications.