Multi-modal transformers are rising fast. A great example is the Audio Spectrogram Transformer, an audio classification model that was just added to the Hugging Face Transformers library. This model first creates a spectrogram image of an audio clip and then classifies the image with a Vision Transformer model. Amazing results!
✅ Spaces demo: https://huggingface.co/spaces/juliensimon/keyword-spotting
✅ Model: https://huggingface.co/MIT/ast-finetuned-speech-commands-v2
✅ Paper: https://arxiv.org/abs/2104.01778