how i built a local first audio transcription: building a privacy-first voice processing pipeline

Louis Beaumont - Oct 30 - - Dev Community

I implemented a sophisticated local-first audio processing pipeline that captures, processes, and transcribes audio while respecting privacy, written in rust. here's how it works:

🎤 audio capture & device management

  • supports both input (microphones) and output devices (system audio)
  • handles multi-channel audio devices through smart channel mixing
  • implements device hot-plugging and graceful error handling
  • uses tokio channels for efficient async communication ### 🔊 audio processing pipeline
  1. channel conversion  - converts multi-channel audio to mono using weighted averaging  - handles various sample formats (f32, i16, i32, i8)  - implements real-time resampling to 16khz for whisper compatibility
  2. signal processing  - normalizes audio using RMS and peak normalization  - implements spectral subtraction for noise reduction  - uses realfft for efficient fourier transforms  - maintains audio quality while reducing background noise
  3. voice activity detection (vad)  - dual vad engine support: webrtc (lightweight) and silero (more accurate)  - configurable sensitivity levels (low/medium/high)  - uses sliding window analysis for robust speech detection  - implements frame history for better context awareness ### 🤖 transcription engine
  • primary: whisper (tiny/large-v3/large-v3-turbo)
  • fallback: deepgram api integration
  • smart overlap handling:
// handles cases where audio chunks might cut sentences
 if let Some((prev_idx, cur_idx)) = longest_common_word_substring(previous, current) {
 // strip overlapping content and merge transcripts
 }
Enter fullscreen mode Exit fullscreen mode

đź’ľ storage & optimization

  • uses h265 encoding for efficient audio storage
  • implements a local sqlite database for metadata
  • stores raw audio chunks with timestamps
  • maintains reference to original audio for verification

    🔒 privacy features

  • completely local processing by default

  • optional pii removal

  • configurable data retention policies

  • no cloud dependencies unless explicitly enabled

    🧠 experimental features

  • context-aware post-processing using llama-3.2–1b

  • speaker diarization using voice embeddings

  • local vector db for speaker identification

  • adaptive noise profiling

    🔧 technical stack

  • rust + tokio for async processing

  • tauri for cross-platform support

  • onnx runtime for ml inference

  • crossbeam channels for thread communication

    đź“Š performance considerations

  • efficient memory usage through streaming processing

  • minimal cpu overhead through smart buffering

  • configurable quality/performance tradeoffs

  • automatic resource management

result:

it's open source btw!

https://github.com/mediar-ai/screenpipe

drop any question!

. . .
Terabox Video Player