We’re in the final instalment of our 5 part tutorial creating a bookmarker app. In Part 1 we scaffolded out a basic Python serverless app with Spin. This app is built into a WebAssembly binary. Then in Part2 and Part 3, we built all of the logic for storing, retrieving, and displaying our bookmarks. In Part 4 we added a summary field. Now we’re going to take things one more step: We’re going to sprinkle some AI goodness on our app.
Let’s use Spin’s LLM (Large Language Model) support to add some generative AI. Instead of using a page’s <title>
tag as the source of the summary, we’ll collect a little more of the page’s content and then send that into an LLM to summarize it for us.
Here’s the strategy:
-
summarize_page()
will still callsummarize()
- But
summarize()
will now extract a little more data from an HTML page, and then pass it on to an LLM - We’ll still use
title
- But we’ll also add any content found in an
<article>
tag - The summary generated by the LLM will be returned back to the
add_url()
function and stored as thesummary
in the JSON document
A Note on Running AI Workloads
When using LLMs with Spin, you have three options for configuration:
- You can locally install an LLM and use
spin up
orspin watch
to use that version. This involves some downloading and confuguration of an LLM - You can use
spin deploy
and deploy to Fermyon Cloud, using the AI-grade GPUs provided there (for free) - You can install the Spin Cloud GPU plugin and continue local development, but send just the AI inferences to Fermyon Cloud (again, no charge)
I’ll be using spin deploy
since it is by far the easiest.
Granting Permissions to Use LLMs
Once again, Spin’s security model is to restrict by default. So if we want to enable LLM support, we need to add a quick entry to the spin.toml
to tell it which LLM it is allowed to use:
[component.bookmarker]
source = "app.wasm"
key_value_stores = ["default"]
allowed_outbound_hosts = ["https://*:*"]
ai_models = ["llama2-chat"] # <-- Added this line
files = ["index.html"]
[component.bookmarker.build]
command = "spin py2wasm app -o app.wasm"
watch = ["app.py", "Pipfile"]
The ai_models
directive tells Spin it is allowed to load llama2-chat
, which is the model we’ll be working with. As always, when changing spin.toml
, it’s best to restart your spin
instance to make sure the changes propagate all the way through your app.
The Updated HTML Parser
In the last section, we created an HTMLTitleParser
that grabbed the text inside of the <title>
element. Here, we’ll rewrite the logic so that it extracts both the <title>
and the <article>
content.
Here’s what this new parser looks like:
class HTMLTitleParser(HTMLParser):
track_title = False
track_article = False
title_data = ""
article_data = ""
def get_content(self):
return f"{self.title_data}\n{self.article_data}"
def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
if tag == "article":
self.track_article = True
if tag == "title":
self.track_title = True
def handle_endtag(self, tag: str) -> None:
if tag == "article":
self.track_article = False
if tag == "title":
self.track_title = False
def handle_data(self, data: str) -> None:
if self.track_title:
self.title_data += data
if self.track_article:
self.article_data += data
Essentially, what we now do is store the text of the <title>
in self.title_data
and the text for <article>
in self.article_data
. Then we add a new getter function called get_content()
that returns a minimally formatted string containing the title and article text. The goal here is to get that content into a form that is good for the LLM.
Now it’s time to rewrite our summarize()
function to use an LLM to summarize.
Summarizing Text with an AI LLM
In Generative AI, an LLM is a tool that takes some text as input, and generates an appropriate text response. In this case, we are going to use the LLaMa2-Chat
LLM from Meta. This one is built into Fermyon Cloud and well supported in Spin, so it’s an easy default choice.
We are going to perform an inference, which means we are going to send it some text and ask it to do something with that text and return us a result.
In spite of the fact that we think of AI as highly complex, using it is really easy. Here’s what our summarize()
function looks like when we add an LLM inference:
import json
from html.parser import HTMLParser
from http_router import Router
from jinja2 import Environment, FileSystemLoader, select_autoescape
from spin_http import Response, Request, http_send
from spin_key_value import kv_open_default
from spin_llm import LLMInferencingParams, llm_infer # NEW
from urllib.parse import urlparse, parse_qs
# Omitted a bunch of code
# Summarize an HTML document
def summarize(doc):
parser = HTMLTitleParser()
parser.feed(doc)
text = parser.get_content()
# Now we have the HTML body. Let's see if the LLM can handle this:
prompt = f"""<s><<SYS>>You are an academic text summarizer. Your style is concise and minimal. Succinctly summarize the article.<</SYS>>
[[INST]]{text}[[/INST]]
"""
opts = LLMInferencingParams(50000, 1.1, 64, 0.8, 40, 0.9)
return llm_infer("llama2-chat", prompt, options=opts).text
The beginning of the summarize()
function looks familiar from last time: We are creating a new HTMLTitleParser
and parsing our HTML document. This time, though, we are retrieving the content and storing it in text
.
Next, we have the prompt
, which is the instructions that we are going to sent to the LLM. For LLaMa2, the form of the prompt is:
-
<s>
to start the prompt -
<<SYS>>INSTRUCTIONS<</SYS>
to give instructions to the LLM about how to answer - And
[[INST]]TEXT[[/INST]]
to send the LLM the text we want it to process.
💡 I recently wrote a detailed Dev.to post explaining LLaMa2 prompts.
So in the case above, our prompt has instructed the LLM to act like this:
You are an academic text summarizer.
Your style is concise and minimal. Succinctly summarize the article.
Then, inside of [[INST]]
we supply the title and article that we got back from the HTML parser. For example, while testing this, one of my complete prompts looked like this:
<s><<SYS>>You are a succinct text summarizer.
In one or two sentences, summarize the main
content
of an article document.<</SYS>>
[[INST]]Introducing Spin | Fermyon Developer
Introducing Spin
Checklist Sample App
A checklist app that persists data in a key value store
Zola SSG Template
A template for using Zola framework to create a static webpage
AI-assisted News Summarizer
Read an RSS newsfeed and have AI summarize it for you
[[/INST]]
At a glance, you can see how it fetched the <title>
(Introducing Spin | Fermyon Developer
) and some page content and combined them into the prompt.
When it comes to running the AI inference, these are the two important lines:
opts = LLMInferencingParams(50000, 1.1, 64, 0.8, 40, 0.9)
return llm_infer("llama2-chat", prompt, options=opts).text
The first sets up the options for our inference. I needed to do this because the first parameter (max tokens) needs to be set fairly high because we are uploading a potentially large piece of text. The rest of the values are just the defaults and are explained in detail in the documentation.
The the second line just runs the LLM inference using the LLaMa2 chat model, the prompt we created above, and the options. The llm_infer()
function returns an object that has the results of the inference stored in text
and then some performance information. We only need to return the text
.
Note that this function can take a long time to run. When I tested it locally, it took between 2 and 10 minutes (depending on how much text was on the bookmarked page). This is the reason that in the next section I deployed and tested on Fermyon Cloud instead of locally. That cut the time down to seconds.
Testing Out our AI-assisted Bookmarker
That’s all the code we need to change to add AI! Next, we can build and run it. It is possible, as mentioned before, to build and run locally. But on my older M1 MacBook Pro, running these AI inferences sometimes takes 10 minutes or more. So instead, we’ll deploy the app to Fermyon Cloud, where the free tier has a generous allocation of AI-grade GPUs. These same inferences will take only a second or two.
$ spin build
Building component bookmarker with `spin py2wasm app -o app.wasm`
Spin-compatible module built successfully
Finished building all Spin components
$ spin deploy
Uploading bookmarker version 0.1.0 to Fermyon Cloud...
Deploying...
Waiting for application to become ready..................... ready
Available Routes:
bookmarker: https://bookmarker-XXXXXXXX.fermyon.app (wildcard)
The spin deploy
command returns a URL that we can now test. Adding a few bookmarks, we’ll see a page that looks like this:
Our LLaMa2-powered text summarize provides relatively succinct summaries of the pages that we are bookmarking.
If you’d like to try out my version, I have left it running, but I have added some reset logic to protect it from abuse.
Series Conclusion
In this 5 part series, we have built a bookmarking tool using Python, Spin, Jinja2, http_router
, and Spin’s Key Value Store and Serverless AI. We’ve also installed a Key Value Explorer component and looked at a variety of Python APIs. We went from writing a very simple bookmarking tool to creating an AI-assisted one that can summarize pages for us.
All of this code is available in my GitHub repo under the open source Apache 2 license, so feel free to clone, fork, or otherwise repurpose this code for your own needs.
I hope this gets you in the creative mood to go and create your own Spin-powered Python applications. Here in less than 150 lines of code, we’ve built something fairly sophisticated. I’m sure you have some other ideas of things you can create.