Apple's papers are always very practical. This one is also good, many in-depth experiments and practical cases. Note that biasing effect is minimal (usually WER goes down a little 17% -> 15%).

Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

Nikolaos Flemotomos, Roger Hsiao, Pawel Swietojanski, Takaaki Hori, Dogan Can, Xiaodan Zhuang

Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.

11/7/2024, 9:35:58 PM

It is simply bad

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esaú Villatoro-Tello, Manjunath K E, Kadri Hacioğlu, Pradeep Rangappa, Petr Motlicek, Aravind Ganapathiraju, Andreas Stolcke

Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and different speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that the SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations within in-domain data, such as changes in speed or the presence of additive noise, can significantly impact performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.

11/7/2024, 7:26:06 PM

Overall, we find no evidence that multiscale aspects of MR-HuBERT lead to improved acquisition of high level concepts. The question now is how to build an architecture that does leverage this hierarchy?🤔 (4/5)

11/3/2024, 2:59:42 PM

Fish Agent V0.1 3B is a groundbreaking Voice-to-Voice model capable of capturing and generating environmental audio information with unprecedented accuracy. What sets it apart is its semantic-token-free architecture, eliminating the need for traditional semantic encoders/decoders like Whisper and CosyVoice.

Additionally, it stands as a state-of-the-art text-to-speech (TTS) model, trained on an extensive dataset of 700,000 hours of multilingual audio content.

This model is a continue-pretrained version of Qwen-2.5-3B-Instruct for 200B voice & text tokens.

11/3/2024, 8:16:47 AM

Even with our new speech codec, producing a 2-minute dialogue requires generating over 5000 tokens. To model these long sequences, we developed a specialized Transformer architecture that can efficiently handle hierarchies of information, matching the structure of our acoustic tokens.

11/2/2024, 8:14:53 AM

Nice paper with few interesting details. Extra CTC head for Whisper stabilization is interesting for example.

Target Speaker ASR with Whisper

Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

We propose a novel approach to enable the use of large, single speaker ASR models, such as Whisper, for target speaker ASR. The key insight of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs, than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single speaker ASR models, into target speaker ASR models. Our target-speaker ASR model can be used for speaker attributed ASR by producing, in sequence, a transcript for each hypothesized speaker in a diarization output. This simplified model for speaker attributed ASR using only a single microphone outperforms cascades of speech separation and diarization by 11% absolute ORC-WER on the NOTSOFAR-1 dataset.

10/30/2024, 9:45:15 PM

A Survey on Speech Large Language Models

Jing Peng, Yucheng Wang, Yu Xi, Xu Li, Xizhuo Zhang, Kai Yu

Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multi-task performance. Therefore, researchers have been seeking to integrate LLMs in the broad sense of Spoken Language Understanding (SLU) field. Different from the traditional method of cascading LLMs to process text generated by Automatic Speech Recognition(ASR), new efforts have focused on designing architectures centered around Audio Feature Extraction - Multimodal Information Fusion - LLM Inference(Speech LLMs). This approach enables richer audio feature extraction while simultaneously facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures and training strategies. Through extensive research and a series of targeted experiments, the paper assesses Speech LLMs' advancements in Rich Audio Transcription and its potential for Cross-task Integration within the SLU field. Additionally, it indicates key challenges uncovered through experimentation, such as the Dormancy of LLMs under certain conditions. The paper further delves into the training strategies for Speech LLMs, proposing potential solutions based on these findings, and offering valuable insights and references for future research in this domain, as well as LLM applications in multimodal contexts.

10/28/2024, 10:14:48 PM

"We don't want 200ms latency, that's just not useful" Will Williams is CTO of Speechmatics in Cambridge. In this sponsored episode - he shares deep technical insights into modern speech recognition technology and system architecture. The episode covers several…

10/25/2024, 8:22:22 AM

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
Samuele Cornell, Jordan Darefsky, Zhiyao Duan, Shinji Watanabe

Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be problematic, especially for sensitive domains and conversational speech scenarios, due to both privacy issues and annotation costs. To address this, synthetic data generation using single speaker datasets has been employed. Yet, for multi-speaker cases, such an approach often requires extensive manual effort and is prone to domain mismatches. In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings, using both in-domain data and generated synthetic data. Our results show that the proposed method is able to significantly outperform classical multi-speaker generation approaches that use external, non-conversational speech datasets.

10/24/2024, 6:01:01 PM

State space model for realtime TTS

In experiments so far, we've found that we can simultaneously improve model quality, inference speed, throughput, and latency compared to widely used Transformer implementations for audio generation. A parameter-matched Cartesia model trained on Multilingual Librispeech for one epoch achieves 20% lower validation perplexity. On downstream evaluations, this results in a 2x lower word error rate and a 1 point higher quality score (out of 5, as measured on the NISQA evaluation). At inference, it achieves lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor), and higher throughput (4x).

5/29/2024, 10:49:54 PM

From Apple quite in-depth paper on alternative for LM rescoring. Feels like it is going to be a generic direction for coming years

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Zijin Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a scaled error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several key ingredients: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on test-clean and 3.3% WER on test-other on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.

5/27/2024, 2:26:47 PM

LLMs are frontier in TTS too (true ones, not gpt). SLAM can do it too btw. Microsoft paper

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from 5.6% (without reranking) and 1.7% (with reranking) to 2.5% and 1.0%, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%.

5/25/2024, 1:01:24 AM

Sometimes world tells you something. Three unrelated sources on sonification I've met last week

Photo sonification
(from Russian )

Images that Sound: Composing Images and Sounds on a Single Canvas
(demo )

Sound training platform applied to astronomy

Time to remember Myst and Pythagoreans

5/25/2024, 12:55:42 AM

WeNet trained 1B mixture of experts model with good results

U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF

Xingchen Song, Di Wu, Binbin Zhang, Dinghao Zhou, Zhendong Peng, Bo Dang, Fuping Pan, Chao Yang

Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.

4/29/2024, 1:07:44 AM

Unofficial implementation of wavenext neural vocoder(WIP)

WaveNext: ConvNext-Based fast neural vocoder without ISTFT layer

WaveNext proposed to replace the ISTFT final layer of Vocos with a linear layer without bias followed by a reshape op. As this is a slight modification of vocos we're just using the official vocos implementation and adding the WaveNext head in wavenext_pytorch/vocos/

4/23/2024, 12:44:05 PM

📣 We are delighted to announce that we will be hosting the Text-dependent Speaker Verification (TdSV) Challenge 2024 in conjunction with SLT 2024. Following the success of two previous Short-duration Speaker Verification Challenges, the TdSV Challenge 2024 aims to focus on the relevance of recent training strategies, such as self-supervised learning.

🏆 The challenge evaluates the TdSV task in two practical scenarios, namely, conventional TdSV using predefined Passphrases (Task 1) and TdSV using user-defined passphrases (Task 2). Three cash prizes will be given away for each task ($7000 in total) based on the results of the evaluation dataset and other qualitative factors.

🌐
🌐

4/22/2024, 6:49:02 AM

A guy recently shared 4-bit versions of Whisper V3 models that made me return to Whisper libraries and retest.

Overall state is that there is a lot of work still

But 4-bit Whisper is recommended, works pretty well

4/20/2024, 11:33:20 PM

Train Long and Test Long:Leveraging Full Document Contexts in Speech Processing

William Chen; Takatomo Kano; Atsunori Ogawa; Marc Delcroix; Shinji Watanabe

The quadratic memory complexity of self-attention has generally restricted Transformer-based models to utterance-based speech processing, preventing models from leveraging long-form contexts. A common solution has been to formulate long-form speech processing into a streaming problem, only using limited prior context. We propose a new and simple paradigm, encoding entire documents at once, which has been unexplored in Automatic Speech Recognition (ASR) and Speech Translation (ST) due to its technical infeasibility. We exploit developments in efficient attention mechanisms, such as Flash Attention, and show that Transformer-based models can be easily adapted to document-level processing. We experiment with methods to address the quadratic complexity of attention by replacing it with simpler alternatives. As such, our models can handle up to 30 minutes of speech during both training and testing. We evaluate our models on ASR, ST, and Speech Summarization (SSUM) using How2, TEDLIUM3, and SLUE-TED. With document-level context, our ASR models achieve 33.3% and 6.5% relative improvements in WER on How2 and TEDLIUM3 over prior work. Finally, we use our findings to propose a new attention-free self-supervised model, LongHuBERT, capable of handling long inputs. In doing so, we achieve state-of-the-art performance on SLUE-TED SSUM, outperforming cascaded systems that have dominated the benchmark.

4/19/2024, 12:22:27 PM

Pleiasfr releases a massive open corpus of 2 million Youtube videos in Creative Commons (CC-By) on
Huggingface. Youtube-Commons features 30 billion words of audio transcriptions in multiple languages, and soon other modalities

4/19/2024, 12:10:10 PM

People say this vocoder has a point by joining signal processing with neural tech

FIRNet: Fast and pitch controllable neural vocoder with trainable finite impulse response filter

Some neural vocoders with fundamental frequency (f0) control have succeeded in performing real-time inference on a single CPU while preserving the quality of the synthetic speech. However, compared with legacy vocoders based on signal processing, their inference speeds are still low. This paper proposes a neural vocoder based on the source-filter model with trainable time-variant finite impulse response (FIR) filters, to achieve a similar inference speed to legacy vocoders. In the proposed model, FIRNet, multiple FIR coefficients are predicted using the neural networks, and the speech waveform is then generated by convolving a mixed excitation signal with these FIR coefficients. Experimental results show that FIRNet can achieve an inference speed similar to legacy vocoders while maintaining f0 controllability and natural speech quality.

4/16/2024, 11:48:17 AM

Speech Technology

Link: speechtech

Locale: en

Subscribers:1.32K

Category: technology

Tags: technologies

Description:

11/7/2024, 9:35:58 PM

11/7/2024, 7:26:06 PM

11/3/2024, 2:59:42 PM

11/3/2024, 8:16:47 AM

11/2/2024, 8:14:53 AM

10/30/2024, 9:45:15 PM

10/28/2024, 10:14:48 PM

10/25/2024, 8:22:22 AM

10/24/2024, 6:01:01 PM

5/29/2024, 10:49:54 PM

5/27/2024, 2:26:47 PM

5/25/2024, 1:01:24 AM

5/25/2024, 12:55:42 AM

4/29/2024, 1:07:44 AM

4/23/2024, 12:44:05 PM

4/22/2024, 6:49:02 AM

4/20/2024, 11:33:20 PM

4/19/2024, 12:22:27 PM

4/19/2024, 12:10:10 PM

4/16/2024, 11:48:17 AM

Related Groups

GPTVerse

technology28.66K

GPTVerse, AI Hub and multi-platform gateway to a next-level DAPP experience. Powered by cutting-edge AI technology, we aim to develop AI tools that transform the way users engage, learn, generate revenue, and transact within a virtual ecosytem? We're thrilled to announce a game-changing partnership with OneAM Capital! ? Their investment in GPTVerse marks an exciting new chapter in our journey towards revolutionizing the AI and blockchain landscape. With OneAM Capital by our side, we're poised to reach new heights of innovation and impact. ? Stay tuned for more updates! Exciting News! ?? We're thrilled to announce that GPTVerse will be listed on MEXC Exchange on April 30th at 11:00 UTC! ? Get ready for a new chapter of growth and opportunity as we expand our reach to the MEXC community. Stay tuned for more updates!

Ashish Technical Services | OFFICIAL ?️ ?️

technology3.79K

Pixel OS 14.0 Official For Mi 11X & POCO F3 | Android 14 QPR2 | Refreshed Feature | Security UpdatePOCO F1 - Project Elixir 4.2 Official - Android 14 QPR2 - Redesign Settings & New FeaturesHyper OS 1.0.1.0 Update For Redmi Note 5 Pro | Android 13 | Depth Wallpaper | Full Detailed ReviewMIUI SR 13 War Edition For Redmi Note 5 Pro | Android 12 | Bugs & Features | Full Detailed ReviewPOCO F1 - Pixel Experience Plus (EOL) Update - Android 13 - New Changes & April Security PatchHyperArt 1.0.5.0 V2 Port For Mi 11X & POCO F3 | Android 14 | Smoothness | Full Detailed ReviewCrDroid 10.4 For Redmi Note 4 | Android 14 QPR2 | New Features & Security UpdateUpdates & NewsChannel :- @Ashishts007

Solidus AI Chat Group

technology0.13M

? $AITECH now listed on these exchanges:Kucoin | Gate_io | HTX | MEXC | BitGet | BitMart | Pancakeswap✅ $AITECH Official Contract Address:0x2d060ef4d6bf7f9e5edde373ab735513c0e4f944? What are Reserve Bonds? ? Reserve Bonds allow users to purchase tokens at a discounted rate, that vest over time. Each Reserve Bond is represented by an NFT and is exchanged for a single underlying asset. The discounted tokens gradually vest over a specific duration, becoming claimable to the holder of the Reserve Bond NFT incrementally.? What are Liquidity Bonds?✨ Liquidity Bonds offer users the opportunity to purchase tokens at a discount that vest over time, represented by an NFT, in exchange for Liquidity Provider (LP) tokens. The discounted tokens vest over a certain amount of time, becoming claimable to the holder of the Liquidity Bond NFT incrementally.? AITECH Pad Update!⭐️ AITECH Pad's latest update enables users to effortlessly transition between managing investments and claiming them. This new update optimizes portfolio management, improves accessibility, and simplifies the asset claiming process, all while prioritizing top-notch security standards.

Aura Finance

technology660

The network of applications using blockchain technology built on Fantom Opera to bring community experiences and benefitsWebsite: aurafi.orgCEO: @chrisaurafiGood new!!! Farming works perfectly now!! No error in code but developer forgot to set the Farming contract to be the "AuraManager" which can mint $AURA.You can join Farming now!!!

Tech Talk by TnS

technology254

This is a Mobile technology related discussion channel. You can chat with members of various regions and get a solution for your tech related problems, ask questions,etc.,TG Channel:- @technspiceWelcome to Tech Talk by TnS! ?Boost Your PC's Speed with These 3 Simple Tricks! You Won't Believe the Results!National Quantum Science and Technology Symposium (NQSTS) Event HighlightsBharatGPT - CoRover.ai and Google Cloud to Unveil India's cutting-edge language modeNetflix Alters Streaming Policy for Indian Films Worldwide

Whale Coin Talk

technology26.43K

Moby Media’s Discussion Group ?News & Educational Content | FinTech | Web3 | DeFi | TradFi | Gaming | Technology ?Stripe Reenters Crypto, Supports USDC Payments on Multiple BlockchainsStripe, a global payments giant, has announced its reentry into the cryptocurrency space with a focus on stablecoin transactions. This marks a significant shift a decade after Stripe’s initial foray into Bitcoin payments. The company now plans to enable merchants to accept payments in USD Coin (USDC), the second-largest stablecoin by market capitalization and the sixth-largest cryptocurrency overall.?⚡️Just dropped a new video exploring the hype around Stanley Pup, a hot, new meme coin! ? Find out why crypto enthusiasts are talking about StanleyPup and how you can join the movement. ?? Introducing SOLGUN ?In the dynamic landscape of decentralized finance (DeFi) on the Solana network, the need for advanced trading solutions has never been more pronounced. Enter SolGun, a groundbreaking platform designed to redefine the way traders navigate and execute trades in the decentralized marketplace.✅ Why Solgun stands out ✅Here are some of the reasons why Solgun stands out:    ? Liquidity Snipping    ? Faster trading experience    ? Copy trading feature    ? Lowest fees    ? 100% revenue sharing? Explore the Snipper bot: @Solgun_snipe_bot☑️ AMA & Promo: @WCTMaster