On building a digital assistant for the rest of us (part 2)

Thomas Künneth - Sep 10 - - Dev Community

Welcome to the second part of On building a digital assistant for the rest of us. Last time, I explained a couple of terms and showed you the first increment of viewfAInder, the project we will be building in this series. That version uses camerax to obtain continuous preview images. Tapping on the screen sends the image to Gemini and ask the LLM to provide a detailed description. In this part of the series, we refine the user experience: the user will be able to highlight an area of the image. If they do so, the app will ask Gemini to focus on the selection.

Before we dive in, allow me to highlight the power of Gemini by pointing out an omission in my source code. The following snippet shows how camerax is used to obtain a preview and an image analyzer (which provides the images we are sending to the LLM).

val previewView = PreviewView(ctx)
    val executor = ContextCompat.getMainExecutor(ctx)
    cameraProviderFuture.addListener({
      val cameraProvider = cameraProviderFuture.get()
      val preview = Preview.Builder().build().also {
        it.setSurfaceProvider(previewView.surfaceProvider)
      }

      val imageAnalyzer = ImageAnalysis.Builder()
        .setBackpressureStrategy(
            ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST).build().also {
          it.setAnalyzer(executor) { imageProxy ->
            setBitmap(imageProxy.toBitmap())
            imageProxy.close()
          }
        }

      try {
        cameraProvider.unbindAll()
        cameraProvider.bindToLifecycle(
          lifecycleOwner, CameraSelector.DEFAULT_BACK_CAMERA,
          preview, imageAnalyzer
        )
      } catch (e: Exception) {
        // Handle exceptions, e.g., log the error
      }
    }, executor)
Enter fullscreen mode Exit fullscreen mode

It works pretty well. However, while debugging the ViewModel, I found out that I missed something:

Screenshot of a debugging session in Android Studio

While the preview works flawlessly, the analysis image isn't rotated properly. Still, Gemini can describe its contents. Pretty cool. Anyway, let's fix it.

Aligning the device orientation with the camera sensor orientation isn't one of the most self-explanatory tasks on Android. The appearance of foldables makes it even more challenging. Google provides great guidance on how to deal with it properly. To make sure our fix works, we can add a preview of the analysis image. The ViewModel needs just one line of code.

val bitmap = _bitmap.asStateFlow()
Enter fullscreen mode Exit fullscreen mode

We can make use of this property in MainScreen(). Instead of

else -> {}
Enter fullscreen mode Exit fullscreen mode

we would have

is UiState.Initial -> {
  val bitmap by viewModel.bitmap.collectAsState()
  bitmap?.let {
    Image(
      bitmap = it.asImageBitmap(),
      contentDescription = null,
      contentScale = ContentScale.Inside,
      modifier = Modifier
        .align(Alignment.TopStart)
        .safeContentPadding()
        .size(200.dp)
    )
  }
}
Enter fullscreen mode Exit fullscreen mode

Now let's turn to the rotation fix. While browsing through my source code, you may have wondered why quite a bit of camerax setup is done inside the CameraPreview() composable. Doing a tutorial is always a trade-off. Sometimes you need to cut corners to keep things comprehendable. While configuring camerax inside the factory function of AndroidView works, onCreate() is a more natural place. Here's the refactored code:

override fun onCreate(savedInstanceState: Bundle?) {
  super.onCreate(savedInstanceState)
  val executor = ContextCompat.getMainExecutor(this)
  val previewView = PreviewView(this)
  val future = ProcessCameraProvider.getInstance(this)
  enableEdgeToEdge()
  setContent {
    MaterialTheme {
      Surface(
        modifier = Modifier.fillMaxSize(),
        color = MaterialTheme.colorScheme.background,
      ) {
        val hasCameraPermission by
                cameraPermissionFlow.collectAsState()
        val mainViewModel: MainViewModel = viewModel()
        val uiState by mainViewModel.uiState.collectAsState()
        LaunchedEffect(future) {
          setupCamera(
            future = future,
            lifecycleOwner = this@MainActivity,
            previewView = previewView,
            executor = executor,
            rotation = display.rotation
          ) { mainViewModel.setBitmap(it) }
        }
        MainScreen(uiState = uiState,
          hasCameraPermission = hasCameraPermission,
          previewView = previewView,
          askGemini = { mainViewModel.askGemini() },
          reset = { mainViewModel.reset() })
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

All variables related to camerax are now defined inside onCreate() and passed to setupCamera().

private fun setupCamera(
  future: ListenableFuture<ProcessCameraProvider>,
  lifecycleOwner: LifecycleOwner,
  previewView: PreviewView,
  executor: Executor,
  rotation: Int,
  setBitmap: (Bitmap?) -> Unit
) {
  future.addListener({
    val cameraProvider = future.get()
    val preview = Preview.Builder().build().also {
      it.setSurfaceProvider(previewView.surfaceProvider)
    }
    val imageAnalyzer = ImageAnalysis.Builder()
      .setBackpressureStrategy(
          ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)
      .build()
      .also {
        it.targetRotation = rotation
        it.setAnalyzer(executor) { imageProxy ->
          val matrix = Matrix().also { matrix ->
            matrix.postRotate(
                imageProxy.imageInfo.rotationDegrees.toFloat())
          }
          val bitmap = imageProxy.toBitmap()
          val rotatedBitmap = Bitmap.createBitmap(
            bitmap, 0, 0, bitmap.width, bitmap.height, matrix, true
          )
          setBitmap(rotatedBitmap)
          imageProxy.close()
        }
      }
    try {
      cameraProvider.unbindAll()
      cameraProvider.bindToLifecycle(
        lifecycleOwner,
        CameraSelector.DEFAULT_BACK_CAMERA,
        preview, imageAnalyzer
      )
    } catch (e: Exception) {
      // Handle exceptions, e.g., log the error
    }
  }, executor)
}
Enter fullscreen mode Exit fullscreen mode

Can you spot the rotation fix? Passing the device rotation (rotation) to setupCamera() and assigning it to targetRotation makes sure that imageProxy.imageInfo.rotationDegrees provides the correct value for matrix.postRotate(). The final step is to create a new bitmap based on the old one, but applying a matrix that does the rotation.

Drawing shapes

To implement a Circle to search-like feature, the user must be able to draw on screen. When the object (circle, box) is complete, the app must apply it to the analysis bitmap and then send it to Gemini. Here's how a simple drawing area could be implemented:

@Composable
fun DrawingArea(drawComplete: (IntSize, List<Offset>) -> Unit) {
  val points = remember { mutableStateListOf<Offset>() }
  Canvas(modifier = Modifier
    .fillMaxSize()
    .pointerInput(Unit) {
      awaitPointerEventScope {
        while (true) {
          val event = awaitPointerEvent()
          val touch = event.changes.first()
          points.add(touch.position)
          if (!touch.pressed) {
            if (points.size > 2) {
              drawComplete(size, points.toList())
            }
            points.clear()
          }
        }
      }
    }) {
    if (points.size > 2) {
      drawPath(
        path = Path().apply {
          moveTo(points[0].x, points[0].y)
          for (i in 1..points.lastIndex) {
            lineTo(points[i].x, points[i].y)
          }
          close()
        },
        color = DRAWING_COLOR,
        style = Stroke(width = STROKE_WIDTH)
      )
    } else {
      points.forEach { point ->
        drawCircle(
          color = DRAWING_COLOR,
          center = point,
          radius = STROKE_WIDTH / 2F
        )
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Using Canvas(), we draw a closed path (drawPath()) once we have received at least three points (represented by Offset instances. Until then we plot individual circles (drawCircle()). When the user stops drawing, we pass the list to a callback (drawComplete). In addition, we need to pass the measured size of the pointer input region, because camerax analysis bitmaps may have other sizes than the preview. Here's how the drawing area is composed and what happens with the list of Offsets:

DrawingArea { size, offsets ->
  viewModel.getCopyOfBitmap()?.let {
    val xRatio = it.width.toFloat() / size.width.toFloat()
    val yRatio = it.height.toFloat() / size.height.toFloat()
    val scaledOffsets = offsets.map { point ->
      PointF(point.x * xRatio, point.y * yRatio)
    }
    val canvas = Canvas(it)
    val path = android.graphics.Path()
    if (scaledOffsets.isNotEmpty()) {
      path.moveTo(scaledOffsets[0].x, scaledOffsets[0].y)
      for (i in 1 until scaledOffsets.size) {
        path.lineTo(scaledOffsets[i].x, scaledOffsets[i].y)
      }
      path.close()
    }
    canvas.drawPath(path, Paint().apply {
      style = Paint.Style.STROKE
      strokeWidth = STROKE_WIDTH
      color = DRAWING_COLOR.toArgb()
    })
    viewModel.askGemini(it)
  }
}
Enter fullscreen mode Exit fullscreen mode

Once the list of Offsets has been scaled to fit the bitmap size, a Path instance is created, populated (moveTo(), lineTo()) and closed. It is then drawn onto a Canvas which contains the bitmap. askGemini() sends the bitmap including the drawing to Gemini.

Talking to Gemini

The askGemini() function immediately calls this one:

private fun sendPrompt(bitmap: Bitmap) {
  _uiState.update { UiState.Loading }
  viewModelScope.launch(Dispatchers.IO) {
    try {
      val response = generativeModel.generateContent(content {
        image(bitmap)
        text(prompt)
      })
      response.text?.let { outputContent ->
        _uiState.value = UiState.Success(outputContent)
      }
    } catch (e: Exception) {
      _uiState.value = UiState.Error(e.localizedMessage ?: "")
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

We just pass two things, the bitmap and the prompt. The latter one looks like this:

private const val prompt = """
  Please describe what is contained inside the area of the image
  that is surrounded by a red line. If possible, add web links with
  additional information
"""
Enter fullscreen mode Exit fullscreen mode

This concludes the second part of this series about building a digital assistant for the rest of us. The GitHub repo contains two tags, part_one and part_two. Development takes place on main. The next part will further refine the user interface and install the app as an assistant on Android.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player