1) I took a list of 50 difficult words in Portuguese:
Filantropo, Filaucioso, Graçolar, Hebdomadário, Horrípilo, Iconoclasta,
Idiossincrasia, Inócuo, Jocoso, Juvenilizante, Kafkaesco, Lancinante,
Loquaz, Mendacioso, Modorrento, Nitidificar, Numismática, Odiento,
Ósculo, Prognóstico, Putrefato, Quimera, Quintessência, Recôndito,
Rufião, Sectário, Sumidade, Taciturno, Tergiversar, Ufanismo,
Urdidura, Verossimilhança, Vicissitude, Vitupério, Warrantagem, Xaropear,
Xifópago, Yanomami, Zaragatoa, Zeugma, Zoomórfico
https://www.todamateria.com.br/palavras-dificeis
2) I created the 16000Hz audio.wav
Recording Google translator's voice with Audacity
3) Whisper's 'speech to text' medium model
Model: "models/ggml-medium.bin"
Size: 1.497Mb
Result:
filantropo, filáucioso, grassolar, ebdomadário, orripilo, iconoclasta,
idiosincrasia, inoco, jocoso, juvenilizante, cafcaesco, lancinante,
loquaz, mendacioso, modorrento, nitidificar, numismática, odiento,
ósculo, prognóstico, putrefato, quimera, quinta essência, recôndito,
rufião, sectário, sumidade, taciturno, tergiversar, ufanismo,
urdidura, verossimilhança, vicissitude, vitupério, uorantagem, charopia,
chifópago, yanomame, zaragatoa, zeugma, zoomórfico.
Total errors: 179
Accuracy: -336.585%
Total time: 127835.84 ms
4) Whisper's small model
Model: "models/ggml-small.bin"
Size: 476Mb
Result:
filantropo, filaucioso, graçolar, hebidomadário, orripilo, iconoclasta,
idiosincrasia, inocuo, jocoso, juvenilizante, cafkaesco, lancinante,
loquaz, mendacioso, modorrento, nitidificar, numismática, odiento,
ósculo, prognóstico, putrefato, quimera, quintessência, recôndito,
rufião, sectário, sumidade, taciturno, tergiversar, ufanismo,
urdidura, verossimilhança, vicissitude, vitupério, o orantagem, xaropia,
xifópago, ianomame, zaragatoa, zeugima, zoomórfico.
Total errors: 66
Accuracy: -60.9756%
Total time: 72895.73 ms
5) Whisper's small q4_1 model
Model: "models/ggml-small-q4_1.bin"
Size: 156Mb
Result:
filantropo, filaucioso, graçolar, hebidomadário, orripilo, iconoclasta,
idiosincrasia, inocuo, jocoso, juvenilizante, cafcaesco, lancinante,
loquaz, mendacioso, modorrento, nitidificar, numismática, odiento,
ósculo, prognóstico, putrefato, quimera, quintessência, recôndito,
rufião, sectário, sumidade, taciturno, tergiversar, ufanismo,
urdidura, verossimilhança, vicissitude, vitupério, o orantagem, xaropia,
xifópago, ianomame, zaragatoa, zeugima, zoomórfico.
Total errors: 67
Accuracy: -63.4146%
Total time: 37100.50 ms
6) Whisper's tiny q8_0 model
Model: "models/ggml-tiny.q8_0.bin"
Size: 42Mb
Result:
filan tropo, fila o sioso, graçoolar, é bedomadário, oripilo, e conoclasta,
idiosincrasia, inoco, jocoso, juvenilizante, cafcaesco, lancinante,
loucoas, mendacioso, modo rento, nitideficar, emumismática, odiento,
osculo, prognóstico, putrefato, chimera, quintessência, reconditú,
cofiam, sectário, sumidade, taciturno, ter diversar, o fanismo,
hurdidura, verosimiliança, visi-se tude, vitopério, o orantagem,
xaropia, xifopago, e anomâmi, xaragatoa, zeugima, zomorfico.
Total errors: 362
Accuracy: -782.927%
Total time: 8692.68 ms
Now the real magic...
I put the list of words in a txt file as a reference
I used the levenshtein Distance algorithm to correct the inaccuracy
int levenshteinDistance(const std::string& s1, const std::string& s2) {
const size_t len1 = s1.size(), len2 = s2.size();
std::vector<std::vector<int>> d(len1 + 1, std::vector<int>(len2 + 1));
for (int i = 0; i <= len1; ++i)
d[i][0] = i;
for (int i = 0; i <= len2; ++i)
d[0][i] = i;
for (int i = 1; i <= len1; ++i) {
for (int j = 1; j <= len2; ++j) {
d[i][j] = std::min({
d[i - 1][j] + 1,
d[i][j - 1] + 1,
d[i - 1][j - 1] + (s1[i - 1] != s2[j - 1])
});
}
}
return d[len1][len2];
}
As a result, this poor 42Mb model achieved 100% accuracy
Adding only the milliseconds of algorithm execution
In other words, it runs 10x faster than the lightweight model, with a size 11x smaller. And 15x faster than the average model, with a size 35x smaller.
Other use cases
If I have a playlist
Aquarela do Brasil - Ary Barroso
Garota de Ipanema - Tom Jobim
Construção - Chico Buarque
Águas de Março - Elis Regina e Tom Jobim
Carinhoso - Pixinguinha
Asa Branca - Luiz Gonzaga
In "word search", distance or similarity algorithms between two strings may not be the most appropriate
A better algorithm "head and tail splitting"
1) Split words by " ", remove periods, accents, convert to lowcase, etc.
2) Divide the word in half:
"aquarela" == { head: "aqua", tail: "rela"}
3) Compare only with other words, with a maximum length of 2 digits larger or 2 digits smaller. To avoid inappropriate comparisons: "Tom" with "Tomate"
4) Exact match: "aquarela" == "aquarela" -> 3 points
5) Prefix matching: "aquarelo" == "aqua" -> 1 point
6) Suffix Match: "amarela" == "rela" -> 1 point
This way, the presence of the wrong word "amarela" in the "search" is enough to return the intended search: "Aquarela do Brasil - Ary Barroso"
Conclusion
Using a medium, large, etc. model for a voice assistant to execute "known commands" like: do this or do that, can be about 10x or 100x inefficient, heavy and slow.
What is even more critical in embedded and IoT applications,
that memory and processing are even scarcer resources.
Obviously it would be better to train a model to recognize only the necessary words. And even in this case, the model does not need to reach 95% accuracy.
Because these algorithms that measure the distance, similarity between two strings and etc, can correct the inaccuracy much better
by providing the list of most used words.
It's always good to rediscover the wheel.
Models
https://huggingface.co/ggerganov/whisper.cpp/tree/main
Whisper.cpp
https://github.com/ggerganov/whisper.cpp
Algorithms:
Levenshtein Distance, Jaro–Winkler Similarity, Jaccard Similarity, etc