Community chat: https://t.me/hamster_kombat_chat_2
Twitter: x.com/hamster_kombat
YouTube: https://www.youtube.com/@HamsterKombat_Official
Bot: https://t.me/hamster_kombat_bot
Game: https://t.me/hamster_kombat_bot/
Last updated 4 months, 2 weeks ago
Your easy, fun crypto trading app for buying and trading any crypto on the market.
📱 App: @Blum
🆘 Help: @BlumSupport
ℹ️ Chat: @BlumCrypto_Chat
Last updated 4 months, 2 weeks ago
Turn your endless taps into a financial tool.
Join @tapswap_bot
Collaboration - @taping_Guru
Last updated 3 weeks, 5 days ago
https://arxiv.org/abs/2408.13106
NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks
He Huang, Taejin Park, Kunal Dhawan, Ivan Medennikov, Krishna C. Puvvada, Nithin Rao Koluguri, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg
Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that \model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints will be publicly available via NVIDIA NeMo framework.
New Whisper model large v3 turbo
Its interesting how non-optimal decisions require much more compute. For example, speech events are clearly non-uniformly arranged in time. One has to model time offsets. Given that it is strange modern discrete codecs use large uniform shifts. A proper encoder should have blank tokens or something like that and higher sampling rate.
https://github.com/theodorblackbird/lina-speech
GitHub
GitHub - theodorblackbird/lina-speech: lina-speech : linear attention based text-to-speech
lina-speech : linear attention based text-to-speech - theodorblackbird/lina-speech
Fun fact, the amount of fake content grows. For example a user just sent me an article which mentions Vosk Indonesian model and even gives a link to it. The problem is we never had one! The article is clearly autogenerated!
CHIME Challenge is way more dense than Interespeech. Chime 8 workshop just ended
https://www.chimechallenge.org/current/workshop/index
Congrats to STC team, as usual they demonstrate top performance on Chime tasks.
No publications yet, but even keynote talk is interesting
Teaching New Skills to Foundation Models: Insights and Experiences
Speaker: Hung-yi Lee
National Taiwan University (NTU)
https://www.chimechallenge.org/current/workshop/CHiME2024_Lee.pdf
CHiME Challenges and Workshops
CHiME 2024 Workshop
September 6, 2024, Kos International Convention Centre (KICC), Kos Island, Greece
Train Long and Test Long:Leveraging Full Document Contexts in Speech Processing
https://ieeexplore.ieee.org/document/10446727
William Chen; Takatomo Kano; Atsunori Ogawa; Marc Delcroix; Shinji Watanabe
The quadratic memory complexity of self-attention has generally restricted Transformer-based models to utterance-based speech processing, preventing models from leveraging long-form contexts. A common solution has been to formulate long-form speech processing into a streaming problem, only using limited prior context. We propose a new and simple paradigm, encoding entire documents at once, which has been unexplored in Automatic Speech Recognition (ASR) and Speech Translation (ST) due to its technical infeasibility. We exploit developments in efficient attention mechanisms, such as Flash Attention, and show that Transformer-based models can be easily adapted to document-level processing. We experiment with methods to address the quadratic complexity of attention by replacing it with simpler alternatives. As such, our models can handle up to 30 minutes of speech during both training and testing. We evaluate our models on ASR, ST, and Speech Summarization (SSUM) using How2, TEDLIUM3, and SLUE-TED. With document-level context, our ASR models achieve 33.3% and 6.5% relative improvements in WER on How2 and TEDLIUM3 over prior work. Finally, we use our findings to propose a new attention-free self-supervised model, LongHuBERT, capable of handling long inputs. In doing so, we achieve state-of-the-art performance on SLUE-TED SSUM, outperforming cascaded systems that have dominated the benchmark.
People say this vocoder has a point by joining signal processing with neural tech
https://ast-astrec.nict.go.jp/demo_samples/firnet_icassp2024/
FIRNet: Fast and pitch controllable neural vocoder with trainable finite impulse response filter
Some neural vocoders with fundamental frequency (f0) control have succeeded in performing real-time inference on a single CPU while preserving the quality of the synthetic speech. However, compared with legacy vocoders based on signal processing, their inference speeds are still low. This paper proposes a neural vocoder based on the source-filter model with trainable time-variant finite impulse response (FIR) filters, to achieve a similar inference speed to legacy vocoders. In the proposed model, FIRNet, multiple FIR coefficients are predicted using the neural networks, and the speech waveform is then generated by convolving a mixed excitation signal with these FIR coefficients. Experimental results show that FIRNet can achieve an inference speed similar to legacy vocoders while maintaining f0 controllability and natural speech quality.
https://ast-astrec.nict.go.jp/release/preprints/preprint_icassp_2024_ohtani.pdf
I tested Google ASR recently for English - Chirp, Conformer (latest version) and Gemini. Conformer is not good. Chirp is ok, somewhat better than Whisper V3.
Community chat: https://t.me/hamster_kombat_chat_2
Twitter: x.com/hamster_kombat
YouTube: https://www.youtube.com/@HamsterKombat_Official
Bot: https://t.me/hamster_kombat_bot
Game: https://t.me/hamster_kombat_bot/
Last updated 4 months, 2 weeks ago
Your easy, fun crypto trading app for buying and trading any crypto on the market.
📱 App: @Blum
🆘 Help: @BlumSupport
ℹ️ Chat: @BlumCrypto_Chat
Last updated 4 months, 2 weeks ago
Turn your endless taps into a financial tool.
Join @tapswap_bot
Collaboration - @taping_Guru
Last updated 3 weeks, 5 days ago