1) I took a list of 50 difficult words in Portuguese:

Filantropo, Filaucioso, Graçolar, Hebdomadário, Horrípilo, Iconoclasta, 
Idiossincrasia, Inócuo, Jocoso, Juvenilizante, Kafkaesco, Lancinante, 
Loquaz, Mendacioso, Modorrento, Nitidificar, Numismática, Odiento, 
Ósculo, Prognóstico, Putrefato, Quimera, Quintessência, Recôndito, 
Rufião, Sectário, Sumidade, Taciturno, Tergiversar, Ufanismo, 
Urdidura, Verossimilhança, Vicissitude, Vitupério, Warrantagem, Xaropear, 
Xifópago, Yanomami, Zaragatoa, Zeugma, Zoomórfico

https://www.todamateria.com.br/palavras-dificeis

2) I created the 16000Hz audio.wav

Recording Google translator's voice with Audacity

3) Whisper's 'speech to text' medium model

Model: "models/ggml-medium.bin"
Size: 1.497Mb

Result:

filantropo, filáucioso, grassolar, ebdomadário, orripilo, iconoclasta, 
idiosincrasia, inoco, jocoso, juvenilizante, cafcaesco, lancinante, 
loquaz, mendacioso, modorrento, nitidificar, numismática, odiento, 
ósculo, prognóstico, putrefato, quimera, quinta essência, recôndito,
rufião, sectário, sumidade, taciturno, tergiversar, ufanismo, 
urdidura, verossimilhança, vicissitude, vitupério, uorantagem, charopia, 
chifópago, yanomame, zaragatoa, zeugma, zoomórfico.

Total errors: 179
Accuracy: -336.585%
Total time: 127835.84 ms

4) Whisper's small model

Model: "models/ggml-small.bin"
Size: 476Mb

Result:

filantropo, filaucioso, graçolar, hebidomadário, orripilo, iconoclasta, 
idiosincrasia, inocuo, jocoso, juvenilizante, cafkaesco, lancinante, 
loquaz, mendacioso, modorrento, nitidificar, numismática, odiento, 
ósculo, prognóstico, putrefato, quimera, quintessência, recôndito, 
rufião, sectário, sumidade, taciturno, tergiversar, ufanismo, 
urdidura, verossimilhança, vicissitude, vitupério, o orantagem, xaropia, 
xifópago, ianomame, zaragatoa, zeugima, zoomórfico.

Total errors: 66
Accuracy: -60.9756%
Total time: 72895.73 ms

5) Whisper's small q4_1 model

Model: "models/ggml-small-q4_1.bin"
Size: 156Mb

Result:

filantropo, filaucioso, graçolar, hebidomadário, orripilo, iconoclasta, 
idiosincrasia, inocuo, jocoso, juvenilizante, cafcaesco, lancinante, 
loquaz, mendacioso, modorrento, nitidificar, numismática, odiento, 
ósculo, prognóstico, putrefato, quimera, quintessência, recôndito, 
rufião, sectário, sumidade, taciturno, tergiversar,  ufanismo, 
urdidura, verossimilhança, vicissitude, vitupério, o orantagem, xaropia, 
xifópago, ianomame, zaragatoa, zeugima, zoomórfico.

Total errors: 67
Accuracy: -63.4146%
Total time: 37100.50 ms

6) Whisper's tiny q8_0 model

Model: "models/ggml-tiny.q8_0.bin"
Size: 42Mb

Result:

filan tropo, fila o sioso, graçoolar, é bedomadário, oripilo, e conoclasta, 
idiosincrasia, inoco, jocoso, juvenilizante, cafcaesco, lancinante, 
loucoas, mendacioso, modo rento, nitideficar, emumismática, odiento, 
osculo, prognóstico, putrefato, chimera, quintessência, reconditú,
cofiam, sectário, sumidade, taciturno, ter diversar, o fanismo, 
hurdidura, verosimiliança, visi-se tude, vitopério, o orantagem, 
xaropia, xifopago, e anomâmi, xaragatoa, zeugima, zomorfico.

Total errors: 362
Accuracy: -782.927%
Total time: 8692.68 ms

Now the real magic...

I put the list of words in a txt file as a reference
I used the levenshtein Distance algorithm to correct the inaccuracy

int levenshteinDistance(const std::string& s1, const std::string& s2) {
    const size_t len1 = s1.size(), len2 = s2.size();
    std::vector<std::vector<int>> d(len1 + 1, std::vector<int>(len2 + 1));

    for (int i = 0; i <= len1; ++i)
        d[i][0] = i;
    for (int i = 0; i <= len2; ++i)
        d[0][i] = i;

    for (int i = 1; i <= len1; ++i) {
        for (int j = 1; j <= len2; ++j) {
            d[i][j] = std::min({
                d[i - 1][j] + 1,
                d[i][j - 1] + 1,
                d[i - 1][j - 1] + (s1[i - 1] != s2[j - 1])
            });
        }
    }
    return d[len1][len2];
}

As a result, this poor 42Mb model achieved 100% accuracy
Adding only the milliseconds of algorithm execution

In other words, it runs 10x faster than the lightweight model, with a size 11x smaller. And 15x faster than the average model, with a size 35x smaller.

Other use cases

If I have a playlist

Aquarela do Brasil - Ary Barroso
Garota de Ipanema - Tom Jobim
Construção - Chico Buarque
Águas de Março - Elis Regina e Tom Jobim
Carinhoso - Pixinguinha
Asa Branca - Luiz Gonzaga

In "word search", distance or similarity algorithms between two strings may not be the most appropriate

A better algorithm "head and tail splitting"

1) Split words by " ", remove periods, accents, convert to lowcase, etc.
2) Divide the word in half:
"aquarela" == { head: "aqua", tail: "rela"}

3) Compare only with other words, with a maximum length of 2 digits larger or 2 digits smaller. To avoid inappropriate comparisons: "Tom" with "Tomate"

4) Exact match: "aquarela" == "aquarela" -> 3 points
5) Prefix matching: "aquarelo" == "aqua" -> 1 point
6) Suffix Match: "amarela" == "rela" -> 1 point

This way, the presence of the wrong word "amarela" in the "search" is enough to return the intended search: "Aquarela do Brasil - Ary Barroso"

Conclusion

Using a medium, large, etc. model for a voice assistant to execute "known commands" like: do this or do that, can be about 10x or 100x inefficient, heavy and slow.

What is even more critical in embedded and IoT applications,
that memory and processing are even scarcer resources.

Obviously it would be better to train a model to recognize only the necessary words. And even in this case, the model does not need to reach 95% accuracy.

Because these algorithms that measure the distance, similarity between two strings and etc, can correct the inaccuracy much better
by providing the list of most used words.

It's always good to rediscover the wheel.

Models
https://huggingface.co/ggerganov/whisper.cpp/tree/main
Whisper.cpp
https://github.com/ggerganov/whisper.cpp
Algorithms:
Levenshtein Distance, Jaro–Winkler Similarity, Jaccard Similarity, etc

When smart algorithms beat artificial intelligence -brute force