Speech Technology / Cryptocurrencies / Telegram Index

Open in telegram

☆☆☆☆☆

0 ratings and 0 comments

⚑ Report channel

1,403 @speechtech

Description

We recommend to visit

Hamster Kombat Announcement

42,184,808 @hamster_kombat

Community chat: https://t.me/hamster_kombat_chat_2

Website: https://hamster.network

Twitter: x.com/hamster_kombat

YouTube: https://www.youtube.com/@HamsterKombat_Official

Bot: https://t.me/hamster_kombat_bot

Last updated 3 months, 3 weeks ago

Blum: All Crypto – One App

29,762,949 @blumcrypto

Your easy, fun crypto trading app for buying and trading any crypto on the market.
📱 App: @Blum
🤖 Trading Bot: @BlumCryptoTradingBot
🆘 Help: @BlumSupport
💬 Chat: @BlumCrypto_Chat

Last updated 9 months, 2 weeks ago

tapswap community

20,317,793 @tapswapai

Turn your endless taps into a financial tool.
Join @tapswap_bot

Collaboration - @taping_Guru

Last updated 4 months, 1 week ago

8 months, 4 weeks ago

https://arxiv.org/abs/2408.13106

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

He Huang, Taejin Park, Kunal Dhawan, Ivan Medennikov, Krishna C. Puvvada, Nithin Rao Koluguri, Weiqing Wang, Jagadeesh Balam, Boris Ginsburg

Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that \model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints will be publicly available via NVIDIA NeMo framework.

320 #

9 months ago

New Whisper model large v3 turbo

https://github.com/openai/whisper/pull/2361

1,200 #

9 months ago

Its interesting how non-optimal decisions require much more compute. For example, speech events are clearly non-uniformly arranged in time. One has to model time offsets. Given that it is strange modern discrete codecs use large uniform shifts. A proper encoder should have blank tokens or something like that and higher sampling rate.

1,000 #

9 months, 1 week ago

https://github.com/theodorblackbird/lina-speech

GitHub

GitHub - theodorblackbird/lina-speech: lina-speech : linear attention based text-to-speech

lina-speech : linear attention based text-to-speech - theodorblackbird/lina-speech

360 #

9 months, 3 weeks ago

Fun fact, the amount of fake content grows. For example a user just sent me an article which mentions Vosk Indonesian model and even gives a link to it. The problem is we never had one! The article is clearly autogenerated!

704 #

9 months, 3 weeks ago

CHIME Challenge is way more dense than Interespeech. Chime 8 workshop just ended

https://www.chimechallenge.org/current/workshop/index

Congrats to STC team, as usual they demonstrate top performance on Chime tasks.

No publications yet, but even keynote talk is interesting

Teaching New Skills to Foundation Models: Insights and Experiences
Speaker: Hung-yi Lee
National Taiwan University (NTU)

https://www.chimechallenge.org/current/workshop/CHiME2024_Lee.pdf

CHiME Challenges and Workshops

CHiME 2024 Workshop

September 6, 2024, Kos International Convention Centre (KICC), Kos Island, Greece

699 #

1 year, 2 months ago

Train Long and Test Long:Leveraging Full Document Contexts in Speech Processing

https://ieeexplore.ieee.org/document/10446727

William Chen; Takatomo Kano; Atsunori Ogawa; Marc Delcroix; Shinji Watanabe

The quadratic memory complexity of self-attention has generally restricted Transformer-based models to utterance-based speech processing, preventing models from leveraging long-form contexts. A common solution has been to formulate long-form speech processing into a streaming problem, only using limited prior context. We propose a new and simple paradigm, encoding entire documents at once, which has been unexplored in Automatic Speech Recognition (ASR) and Speech Translation (ST) due to its technical infeasibility. We exploit developments in efficient attention mechanisms, such as Flash Attention, and show that Transformer-based models can be easily adapted to document-level processing. We experiment with methods to address the quadratic complexity of attention by replacing it with simpler alternatives. As such, our models can handle up to 30 minutes of speech during both training and testing. We evaluate our models on ASR, ST, and Speech Summarization (SSUM) using How2, TEDLIUM3, and SLUE-TED. With document-level context, our ASR models achieve 33.3% and 6.5% relative improvements in WER on How2 and TEDLIUM3 over prior work. Finally, we use our findings to propose a new attention-free self-supervised model, LongHuBERT, capable of handling long inputs. In doing so, we achieve state-of-the-art performance on SLUE-TED SSUM, outperforming cascaded systems that have dominated the benchmark.

661 #

1 year, 2 months ago

People say this vocoder has a point by joining signal processing with neural tech

https://ast-astrec.nict.go.jp/demo_samples/firnet_icassp2024/

FIRNet: Fast and pitch controllable neural vocoder with trainable finite impulse response filter

Some neural vocoders with fundamental frequency (f0) control have succeeded in performing real-time inference on a single CPU while preserving the quality of the synthetic speech. However, compared with legacy vocoders based on signal processing, their inference speeds are still low. This paper proposes a neural vocoder based on the source-filter model with trainable time-variant finite impulse response (FIR) filters, to achieve a similar inference speed to legacy vocoders. In the proposed model, FIRNet, multiple FIR coefficients are predicted using the neural networks, and the speech waveform is then generated by convolving a mixed excitation signal with these FIR coefficients. Experimental results show that FIRNet can achieve an inference speed similar to legacy vocoders while maintaining f0 controllability and natural speech quality.

https://ast-astrec.nict.go.jp/release/preprints/preprint_icassp_2024_ohtani.pdf

766 #

1 year, 2 months ago

I tested Google ASR recently for English - Chirp, Conformer (latest version) and Gemini. Conformer is not good. Chirp is ok, somewhat better than Whisper V3.

748 #

We recommend to visit

Hamster Kombat Announcement

42,184,808 @hamster_kombat

Last updated 3 months, 3 weeks ago

Blum: All Crypto – One App

29,762,949 @blumcrypto

Your easy, fun crypto trading app for buying and trading any crypto on the market.
📱 App: @Blum
🤖 Trading Bot: @BlumCryptoTradingBot
🆘 Help: @BlumSupport
💬 Chat: @BlumCrypto_Chat

Last updated 9 months, 2 weeks ago

tapswap community

20,317,793 @tapswapai

Turn your endless taps into a financial tool.
Join @tapswap_bot

Collaboration - @taping_Guru

Last updated 4 months, 1 week ago