Part 4: Adding Summaries to the Python Bookmarker App

Matt Butcher - Jan 4 - - Dev Community

We’re up to Part 4 of our 5 part series on building a bookmarker app with Python, WebAssembly, and the open source Spin framework. In previous parts, we combined Spin’s key value storage with Jinja2 templates and a router to build a fully functional app.

Let’s go one more step. Let’s add a summary to our bookmarks. But instead of requiring the end user to generate the summary, let’s do it automatically. We’ll do this in two phases. First, we’ll just parse the HTML and return some text from the bookmarked page. Then in Part 5 we’ll go one more step and use an AI-powered LLM (Large Language Model) to generate a summary for us.

As we’ll see, even though this sounds sophisticated, it’s not terribly complex. By the end, our total code will still be under 150 lines.

Fetching Web Content

Let’s amp up our bookmarking app by fetching a bookmark and saving a content preview. In a moment we’ll add some LLM support, but to start with, let’s do something a little easier: We’ll fetch the remote page and grab just the title from the HTML document.

To do this, we will change the structure of our KV Store object to look like this:

{
    "title": "SOME TITLE",
    "url": "SOME URL",
    "summary": "THIS IS NEW and will be the summary"
}
Enter fullscreen mode Exit fullscreen mode

So we need to change our add_url() function:

def add_url(request):
    # This gets us the encoded form data
    params = parse_qs(request.body, keep_blank_values=True)
    title = params[b"title"][0].decode()
    url = params[b"url"][0].decode()

    # Open key value storage
    store = kv_open_default()

    # Get the existing bookmarks or initialize an empty bookmark list
    bookmark_entry = store.get("bookmarks") or b"[]"
    bookmarks = json.loads(bookmark_entry)

      # THE NEW PART
    # Generate a page summary
    summary_text = summarize_page(url)

    # Add our new entry.
    bookmarks.append({"title": title, "url": url, "summary": summary_text})
      # THAT'S ALL

    # Store the modified list in key value store
    new_bookmarks = json.dumps(bookmarks)
    store.set("bookmarks", bytes(new_bookmarks, "utf-8"))

    # Direct the client to go back to the index.html
    return Response(303, {"location": "/index.html"})
Enter fullscreen mode Exit fullscreen mode

We only add one line and change one line:

# Generate a page summary
summary_text = summarize_page(url)

# Add our new entry.
bookmarks.append({"title": title, "url": url, "summary": summary_text})
Enter fullscreen mode Exit fullscreen mode

But now we need to write the summarize_page() function. For this first go-around, what we want to do is:

  • Fetch the URL
  • Parse the returned HTML
  • Get just the title tag’s content

Again, this is just our first pass. We’ll make it better in a moment. But to do even this is going to require a couple of functions and a class:

import json
from html.parser import HTMLParser # NEW
from http_router import Router
from jinja2 import Environment, FileSystemLoader, select_autoescape
from spin_http import Response, Request, http_send # NEW
from spin_key_value import kv_open_default
from urllib.parse import urlparse, parse_qs

# Omitted the rest of the code

def summarize_page(url):
    req = Request("GET", url, {}, None)
    res = http_send(req)
    match res.status:
        # This is to support Spin runtimes that don't automatically
        # follow redirects. For Spin itself, it works fine without
        # this case.
        case 301 | 303 | 304 | 307:
            loc = res.headers["location"]
            print(f"following redirect to {loc}")
            return summarize_page(loc)
        case 200:
            return summarize(res.body.decode("utf-8"))
        case _:
            return "Unable to load preview"

def summarize(doc):
    parser = HTMLTitleParser()
    parser.feed(doc)
    return parser.title_data

class HTMLTitleParser(HTMLParser):
    title_data = ""
    track = False

    def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
        if tag == "title":
            self.track = True

    def handle_endtag(self, tag: str) -> None:
        if tag == "title":
            self.track = False

    def handle_data(self, data: str) -> None:
        if self.track:
            self.title_data = data
Enter fullscreen mode Exit fullscreen mode

Let’s start with the summarize_page() function. It is our utility class for fetching the remote URL and then getting the page body. It uses Spin’s built-in HTTP client. Again, Spin’s security model requires us to grant some permissions to the app before it is allowed to make external HTTP requests. So we need to add this to spin.toml:

[component.bookmarker]
source = "app.wasm"
key_value_stores = ["default"]
allowed_outbound_hosts = ["https://*:*"]. # NEW
files = ["index.html"]
[component.bookmarker.build]
command = "spin py2wasm app -o app.wasm"
watch = ["app.py", "Pipfile"]
Enter fullscreen mode Exit fullscreen mode

The allowed_outbound_hosts parameter lets us declare which external hosts our app is allowed to access. Using "https://*:*" lets us access any HTTPS endpoint. The Spin HTTP documentation covers the format in more detail.

In the case where we can successfully fetch the remote URL (and status is 200), we pass the HTML body on to summarize(). In the case of 3XX-level requests (redirects), we follow the redirects. In all other cases (404, 500, 403, etc), we just return a message that says we couldn’t load a preview.

Now let’s take a look at the first version of summarize(). We are going to build a better one later, but for now, it will simply grab the title text out of the HTML:

def summarize(doc):
    parser = HTMLTitleParser()
    parser.feed(doc)
    return parser.title_data
Enter fullscreen mode Exit fullscreen mode

This creates a new HTMLTitleParser, parses the doc, and then returns the title. In a few moments we will rewrite this one. In this version, though, it uses a basic HTML parser that we wrote:

class HTMLTitleParser(HTMLParser):
    title_data = ""
    track = False

    def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None:
        if tag == "title":
            self.track = True

    def handle_endtag(self, tag: str) -> None:
        if tag == "title":
            self.track = False

    def handle_data(self, data: str) -> None:
        if self.track:
            self.title_data = data
Enter fullscreen mode Exit fullscreen mode

Python’s core libraries provide an event-based HTML parser. The way the parser works is that it walks through a document and as it parses, it calls handler functions for each token it parses. By extending that parser, we can intercept three events that we care about:

  • When the parser hits the start of a tag (handle_starttag()
  • When the parser hits the end of a tag (handle_endtag())
  • When the parser gets character data (text) between tags (handle_data())

What we do in our parser extension is check whether we’re in the <title> tag, and if so, get the text data until we hit the </title> tag.

Putting all of this together, each time we add a new bookmark:

  • The add_url() function will call summarize_page() with the URL of the page we want to bookmark
  • summarize_page() will fetch the HTML from the URL, and then pass it to summarize()
  • And summarize() will use HTMLTitleParser to get the title out of the document.
  • That data is then returned back to add_url(), which will store the summary alongside title and url in our JSON document.

All that is left to do now is alter our template to show the summary.

Add Summary to the Template

In our index.html Jinja template, we print a list of all of the bookmarks. To display our new summary field, all we need to do is add it to the output:

{% for bookmark in bookmarks %}
<li><a href="{{bookmark.url}}">{{bookmark.title}}</a>: {{bookmark.summary}}</li>
{% endfor %}
Enter fullscreen mode Exit fullscreen mode

We made a minor formatting change, adding : after the </a>, and then printing the summary with {{bookmark.summary}}.

At this point, if we save a new bookmark, the main index.html page of our app will now look like this:

Screenshot showing our first auto-generated summary

Note that the last, newly added, link now has a summary. The previous ones do not because they were created before we added the new summary() logic.

It's time to move on to part 5, where we'll use AI (specifically, the LLaMa2 LLM) to read a webpage and generate a summary for us.

. . . . . . . . . . . . . . . . . .
Terabox Video Player