Behind the Scenes: Implementing AI to Streamline Retrospectives

Johanna Both - Aug 29 - - Dev Community

In this article, I’ll walk you through our journey of how we implemented artificial intelligence in Power Retro, our Jira Add-on designed to help with facilitating retrospectives. Let me show you what challenges we faced along the way and what we’ve learned in the process.

We as software developers are obsessed with improving and optimizing every little detail in our processes. While developing Power Retro, we realized that we spent a significant part of our retrospectives (using Power Retro, pretty meta, I know) grouping ideas that we’ve already discussed one by one, then creating action items for these groups based on what we’ve discussed half an hour earlier. Lots of redundancy and monotonous work. There had to be a better way, right?

Fixing retrospectives

Last year we set out to fix these problems slowing us down. We knew we wanted to use some kind of AI to do the heavy-lifting. We debated a lot about what technology to use: we could’ve gone the route of training our own model but it would have required way more time and resources than what we had available. While skeptical at first, we agreed on trying out a commercial service. Luckily, the timing was perfect. OpenAI had just released their GPT-4 model that opened the door for a lot of exciting opportunities. We quickly threw together a proof-of-concept of how we could integrate artificial intelligence into a retrospective’s flow. And the results were… mixed at first.

Early challenges

Our initial prompt was very basic, simply asking the AI to group the given cards and return the output as JSON. Even though we gave it some examples of how the JSON array should look like, sometimes it ignored it and made up new fields or completely changed the structure.
For example, the newer versions of the models love ignoring the instructions about how they should insert JSON into the response. Our prompt explicitly asked for the JSON to be wrapped in 3 double quotes (something like """[...]""") but the models will, by default, wrap them in a Markdown code block ([...]).

To fix this, we had to adjust the temperature and top_p values to the make the AI more deterministic and strictly follow the prompt. LLM’s work by guessing what the next token should be (the unit of text they operate on) based on the prompt and previous generated tokens. A low sampling temperature forces the model to to be more focused and deterministic (0 always generates the same output for the given input) and higher values allow it to be more creative but also more unpredictable. top_p controls nucleus sampling which limits the available tokens based on probability. A 0.1 top_p value would only allow the top 10% most likely tokens to be chosen.

Here’s some examples of the same prompt at different sampling temperatures. The prompt for the tests was intentionally kept simple to show show the effect of sampling temperature and I’ve used the gpt-3.5-turbo-16k model since that was the latest model generally available at the time:

Image description

The cards were 10 randomly generated nouns: actor, port, teapot, broccoli, ramen, sneakers, nectarine, archaeologist, plane, ship.
With the temperature value set at 0 the results are pretty boring:

Image description

actor was categorised as “person”, archaeologist as “occupation” and everything else as “object” when there were clearly more connections that could’ve been made. Let’s look at what happens when temperature is set to 1, the default value for the API:

Image description

Getting better, the model now identified that both actor and archaeologist are professions, broccoli, ramen and nectarine are food and put everything else into “object”. But remember, the output is not deterministic anymore: in the following runs it even went back to the same grouping as the previous example, switched between uppercase and lowercase category names, made groups for each individual card and returned the JSON in the wrong format. There’s a tradeoff between creativity and consistency. What if we push creativity too far? It was a challenge to not get a 500 status code when using the maximum temperature value of 2, but here’s the masterpiece generated by the AI:

Image description

It’s generally not recommended to set the temperature any higher than 1, the web interface of ChatGPT, for example, is set at around 0.7, while code generation tools, such as Github Copilot use even lower values.

Back to Power Retro. To avoid these issues we decided to set the temperature to a low value. While making the model deterministic fixed our problem of inconsistent outputs, we limited the AI’s creativity. This meant it couldn’t find more subtle (or even obvious) connections between the cards and we couldn’t retry the grouping since the output was exactly the same each time. We’ll get back to this later.

Why group cards for minutes when the AI can do it in… also minutes?
The other issue was speed. GPT-3.5 Turbo and GPT-4 were both painfully slow in an actual retrospective (we average around 30-50 cards), in extreme cases taking 2 minutes to return the whole output. This isn’t an issue when using ChatGPT since it uses streaming, a technique that allows an HTTP request to stay open and receive the response in chunks instead of waiting for it to finish before processing it. In our case, seeing cards fly across the retro board and magically group themselves might be cool for the first time but it doesn’t improve the user experience at all. We realized we could speed up the grouping process by generating the response ahead of time instead of waiting for user interaction. Think of how YouTube starts uploading and processing a video in the background while you’re filling out the details.

In Power Retro after creating the cards and moving to the “Presentation” step, they cannot be changed so we can start grouping them right away. We introduced a state machine to keep track of the grouping’s state and moved calling the OpenAI API to a background job to not block any other actions. By the time we reach the “Grouping” step the processing will have already finished and clicking on the “AI Grouping” button will feel instantaneous, without a minutes long loading screen.

At this point the grouping felt pretty good and we used it in every single one of our retrospectives, only having to move around a couple of cards, if any at all.

Generating action items

Seeing the success of the grouping feature we began working on the action items. After the grouping step, participants can vote on issues or groups of issues that they feel are important and create action items for resolving them. This is also something that the AI should be able to help with.

Remember how we were struggling with our prompt to get the required output earlier? Just a few days after kicking off work on the action items, the most important feature (at least for us) of the OpenAI API just launched, function calling. Instead of getting a text response and trying to enforce a JSON object that’s usable in our code, we can create “functions” that the AI can call. When the model decides to call a function it returns a special response with the arguments of the function in a format following the schema we provided.

The function’s result can then be fed back into the context of the model to be used later, for example, an analytics chatbot could make a SQL request to the database based on the user’s input and generate a human readable report. For our use case we are only interested in the structured JSON output. Let’s look at a very simplified example:

Image description

When we ask the model in the prompt to generate action items, it returns an object following this schema. This was a game changer. We were able to give more creativity to the model and significantly improve the quality of the results without any downsides. We discarded almost everything we’d made in those few days before function calling, but the new implementation was done quickly, and we started testing it.

There was still one problem that made this feature unreliable (despite being more reliable at returning correct code). The titles of the cards are usually short with very limited context, and the AI had difficulty handling ambiguous titles. For instance, we often had cards related to unit testing both in positive and negative contexts that the model couldn’t differentiate and defaulted to the negative sentiment. The solution was to assign a “feeling” value to each card. The feelings were already present in the application: a retrospective board has different columns for different sentiments (Start/Stop/Continue, What went well?/What didn’t go well? and so on); we just needed to assign numerical values to them. Now our AI responds with correct and (with some prompt-engineering) actually helpful action items. We used these same techniques to give more creativity to the AI in the grouping feature as well.

What’s next?

Thank you for making it all the way through! It’s hard to keep up with LLM’s getting better day by day but it also means we always have something to improve. Since we released these features OpenAI has introduced their GPT-4o and GPT-4o mini models, outperforming GPT-4 in quality, speed and even price. Just as I was writing this post, they also released Structured Outputs, which allows developers to use JSON responses directly, without relying on function calls and even integrating with libraries such as Zod for complete type safety. The future of LLM’s is more exciting than ever and I can’t wait for what’s to come.

More about Power Retro

Power Retro is a Jira extension built to make agile retrospectives more efficient and less time-consuming by automating repetitive tasks and enhancing collaboration. Designed with distributed teams in mind, it helps identify pain points and generate actionable items that integrate seamlessly into Jira. With the addition of AI capabilities, Power Retro automates repetitive tasks, reducing the duration of retros by up to 60% and enabling teams to focus on meaningful discussions and continuous improvement without getting bogged down by process.


Author: László Bucsai | Full Stack Developer @Tappointment
The picture is generated by DALL-E and Canva

.
Terabox Video Player