On building a digital assistant for the rest of us (part 4)

Thomas Künneth - Sep 22 - - Dev Community

Welcome to the fourth part of On building a digital assistant for the rest of us. Last time, we looked at what it takes to become a digital assistant on Android. I introduced you to the RoleManager class, showed you how to check if certain roles are present, and how to acquire them.

The version of viewfAInder accompanying this final part leverages Gemini to extract data from business cards. You will learn what you can do with this on Android. But what's more important, I will show you how I built a conversation with the LLM that adapts along the way. That means, get a business card if one was captured, but remain able to get other important information from the image.

I also cleaned up the source code, enhanced and beautified the user interface. Here's how the app currently looks like:

I never cease to be excited about how easy animations have become when using Jetpack Compose:

@Composable
fun AnimatedCapturedImage(viewModel: MainViewModel) {
  val uiState by viewModel.uiState.collectAsState()
  var enabled by remember { mutableStateOf(true) }
  val weight by animateFloatAsState(
    if (enabled) 1f else WEIGHT_IMAGE,
    label = "weight"
  )
  Column(
    modifier = Modifier
      .fillMaxSize()
      .background(color = MaterialTheme.colorScheme.background)
  ) {
    Box(modifier = Modifier.weight(weight)) {
      CapturedImage(viewModel)
    }
    if (weight < 1F) {
      Spacer(
        modifier = Modifier
          .weight(1F - weight)
      )
    }
  }
  LaunchedEffect(uiState) {
    enabled = false
  }
}
Enter fullscreen mode Exit fullscreen mode

weight (animateFloatAsState()) controls the size of an image and the area below. Both are children of a Column() composable. The sum of all weights usually is 1F, so I can get the other one using 1F - weight. Please note that if (weight < 1F) { is required to avoid an IllegalArgumentException at runtime (invalid weight 0.0; must be greater than zero. The animation is started inside a LaunchedEffect() by setting enabled to false.

All viewfAInder versions are available on GitHub. Development takes place on the main branch. But you can also check out corresponding tags to access the codebase of a particular part: part_one, part_two, part_three, part_four

Analysing business cards

Scanning business cards is nothing new. We have seen this for many years. While individual implementations may vary greatly, the general approach has been to use OCR (Optical Character Recognition) and carefully trained models. While viewfAInder does none of this by itself, it produces amazing results. See for yourself:

Isn't that cool? This pretty nicely illustrates the power of Gemini 1.5 Pro. But how does it work? Here's the first prompt viewfAInder sends:

private val prompt_01 = """
 Describe what is contained inside the thick red line inside the
 image. Give a short description, followed by a bullet point list
 with all important details. Add web links with additional 
 information for each bullet point items when available.
 Choose Wikipedia if possible. If there are details related to
 appointments, locations, addresses, mention these explicitly
""".trimIndent()
Enter fullscreen mode Exit fullscreen mode

This returns a nice description of the image the user has taken, with a focus on the area inside the red shape.

Here's how we continue:

private val prompt_02 = """
  Does the following text contain information that looks like
  a business card? Please answer only with yes or no.
  Here is the text: %s
""".trimIndent()
Enter fullscreen mode Exit fullscreen mode

%s is a placeholder that will receive the answer to the first prompt. I do this to avoid resubmitting the image, but also to make it easier for the LLM to remember what it has told us. To understand what will be next, let me show the code that drives the conversation:

private fun sendPrompt(bitmap: Bitmap) {
  _uiState.update { UiState.Loading }
  viewModelScope.launch(Dispatchers.IO) {
    try {
      val actions = mutableListOf<Pair<Action, String>>()
      // First step: send the bitmap and get the description
      val description = generativeModel.generateContent(content {
        image(bitmap)
        text(prompt_01)
      }).text ?: ""
      // Second step: Does the description
      // contain appointment info?
      with(generativeModel.generateContent(content {
        text(String.format(prompt_02, description))
      })) {
        if (text?.toLowerCase(Locale.current)?.contains("yes")
                == true) {
          with(generativeModel.generateContent(content {
            text(String.format(prompt_03, description))
          }).text) {
            val data = this?.replace("```vcard", "")
                           ?.replace("```", "") ?: ""
            if (data.isNotEmpty()) {
              actions.add(Pair(Action.VCARD, data))
            }
          }
        }
      }
      // Final step: update ui
      _uiState.value = UiState.Success(
        description = description,
        actions = actions.toImmutableList()
      )
    } catch (e: Exception) {
      _uiState.value = UiState.Error(e.localizedMessage ?: "")
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Communication with the LLM happens through invocations of generativeModel.generateContent(). For simplicity, I am using good old String.format() to fill previous answers into a new prompt. Checking if Gemini found business card-related data can be done with a simple if:

if (text?.toLowerCase(Locale.current)?.contains("yes")
Enter fullscreen mode Exit fullscreen mode

A yes triggers the following prompt:

private val prompt_03 = """
  Please create a data structure in VCARD format.
  Do not add any explanations. Instead, make sure that
  your answer only contains the VCARD data structure, nothing else.
  Use the information that follows after the colon: %s
""".trimIndent()
Enter fullscreen mode Exit fullscreen mode

At this point, you may be wondering why I add pretty detailed commands like not adding any explanations. Well, during my testing I found these be essential to always get the desired answers.

That's basically it. The com.google.ai.client.generativeai library offers way more features than viewfAInder is leveraging. I encourage you to dig deeper. To conclude this part, and the series as a whole, let's look at what my app does with the business card data.

Business cards on Android

Originally, I was hoping that I could just send business card data to other apps using the share sheet. Unfortunately, none of the standard apps felt responsible for text/vcard. At least, Google Contacts can import VCARDs from files. That's why viewfAInder saves the data to the shared documents folder. Importing then can be triggered by opening the file in Files. Here's the code that achieves this:

uiState.actions.forEach { action ->
  when (action.first) {
    Action.VCARD -> {
      TextButton(
        onClick = {
          scope.launch { saveVCF(action.second) }
        },
        modifier = Modifier
          .padding(top = 16.dp)
          .align(Alignment.CenterHorizontally)
      ) {
        Text(text = stringResource(id = R.string.save_as_vcf_file))
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

To allow viewfAInder to ask Gemini about more special content (like events or appointments) in a later version, the MainViewModel maintains a list of Actions. For now, the app knows just one action, VCARD. That's why we iterate over the actions list and build the user interface depending on the contents of that list.

Saving the file is pretty mundane:

private fun saveVCF(data: String) {
  val parent = Environment.getExternalStoragePublicDirectory(
    Environment.DIRECTORY_DOCUMENTS
  ).also {
    it.mkdirs()
  }
  File(
    parent, "vcard_${currentDateAndTimeAsString()}.vcf"
  ).also { file ->
    FileOutputStream(file).use { fos ->
      BufferedOutputStream(fos).use { bos ->
        bos.write(data.toByteArray())
        bos.flush()
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

Using Gemini to build a basic digital assistant on Android has been pure fun. When I started the project I did not expect to be able to build something cool like a (limited) circle to search clone with so few effort. Possible additions might include the appointment or event detection I already mentioned. What else would you like to see? Please share your thoughts in the comments.

This series of articles and the viewfAInder participate in Google's #AISprint 2024. Google Cloud credits are provided for this project.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player