I implemented a sophisticated local-first audio processing pipeline that captures, processes, and transcribes audio while respecting privacy, written in rust. here's how it works:
🎤 audio capture & device management
- supports both input (microphones) and output devices (system audio)
- handles multi-channel audio devices through smart channel mixing
- implements device hot-plugging and graceful error handling
- uses tokio channels for efficient async communication ### 🔊 audio processing pipeline
- channel conversion  - converts multi-channel audio to mono using weighted averaging  - handles various sample formats (f32, i16, i32, i8)  - implements real-time resampling to 16khz for whisper compatibility
- signal processing  - normalizes audio using RMS and peak normalization  - implements spectral subtraction for noise reduction  - uses realfft for efficient fourier transforms  - maintains audio quality while reducing background noise
- voice activity detection (vad)  - dual vad engine support: webrtc (lightweight) and silero (more accurate)  - configurable sensitivity levels (low/medium/high)  - uses sliding window analysis for robust speech detection  - implements frame history for better context awareness ### 🤖 transcription engine
- primary: whisper (tiny/large-v3/large-v3-turbo)
- fallback: deepgram api integration
- smart overlap handling:
// handles cases where audio chunks might cut sentences
if let Some((prev_idx, cur_idx)) = longest_common_word_substring(previous, current) {
// strip overlapping content and merge transcripts
}
đź’ľ storage & optimization
- uses h265 encoding for efficient audio storage
- implements a local sqlite database for metadata
- stores raw audio chunks with timestamps
-
maintains reference to original audio for verification
🔒 privacy features
completely local processing by default
optional pii removal
configurable data retention policies
-
no cloud dependencies unless explicitly enabled
🧠experimental features
context-aware post-processing using llama-3.2–1b
speaker diarization using voice embeddings
local vector db for speaker identification
-
adaptive noise profiling
🔧 technical stack
rust + tokio for async processing
tauri for cross-platform support
onnx runtime for ml inference
-
crossbeam channels for thread communication
đź“Š performance considerations
efficient memory usage through streaming processing
minimal cpu overhead through smart buffering
configurable quality/performance tradeoffs
automatic resource management
result:
it's open source btw!
https://github.com/mediar-ai/screenpipe
drop any question!