LASPA: Breaking Language Barriers in Speaker Recognition with Prefix-Tuned Cross-Attention

BLOGS

LASPA: Breaking Language Barriers in Speaker Recognition with Prefix-Tuned Cross-Attention

Aditya Srinivas Menon*, Raj Prakash Gohil*, Kumud Tripathi, Pankaj Wasnik

30^th September 2024

Raj Prakash Gohil summarizes paper titled LASPA: Breaking Language Barriers in Speaker Recognition with Prefix-Tuned Cross-Attention co-authored by Aditya Srinivas Menon, Raj Prakash Gohil, Kumud Tripathi and Pankaj Wasnik accepted at the 26th edition of the Interspeech 2025 Conference | August 17-21, 2025.

Introduction

Imagine a world where your voice unlocks your devices, regardless of the language you speak. Speaker recognition systems are everywhere—from smart assistants to security systems. But what happens when you switch languages? Traditional systems often stumble, confusing language-induced changes in your voice of a different person.

LASPA: a novel approach that disentangles speaker identity from language, enabling robust, language-agnostic speaker recognition—even when you switch between languages.

The Challenge: Language Entanglement in Speaker Recognition

Speaker recognition models extract “embeddings”—compact representations of your voice. But these embeddings often mix up two things:

Speaker traits (your unique vocal anatomy, timbre, etc.)
Linguistic traits (accent, phonetic structure, intonation)

When you speak a different language, your voice changes—not because you’re a different person, but because the language demands different sounds and rhythms. Traditional models can’t always tell the difference, leading to errors.

The Solution: LASPA’s Disentanglement Strategy

LASPA (Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention) introduces a joint learning strategy that separates speaker and language information.

Key Innovations

Prefix-Tuned Cross-Attention:Instead of updating all model parameters, LASPA uses “prefix vectors” to guide attention,efficiently fusing speaker and language features.
Dual Encoders:
- Speaker Encoder: Extracts speaker-specific features.
- Language Encoder: Extracts language-specific features.
Prefix-Tuners:
- Facilitate focused interaction between speaker and language features.
- Enable the model to “pay attention” to the right information for each task.
Multi-Task Loss Functions:
- AAM Softmax: For speaker classification.
- Negative Log Likelihood (NLL): For language classification.
- Mean Absolute Pearson’s Correlation (MAPC): To encourage disentanglement.
- Mean Squared Error (MSE): For reconstructing the input spectrogram

How LASPA Works: Architecture Overview

Fig. 1: Block-diagram of the propose approach in Diffusion-based Voice Conversion

Step-by-step:

Input: Audio waveform is converted to a mel-spectrogram.
Speaker & Language Encoders: Extract separate embeddings.
Prefix-Tuners:
- Two cross-feature prefix-tuners blend speaker and language information using multi-head attention.
Decoder:
- Concatenated embeddings are used to reconstruct the original mel-spectrogram.
Training:
- Multiple loss functions ensure the model learns to separate speaker and language features.
Inference:
- Only the speaker encoder is used, ensuring fast and language-agnostic speaker embedding extraction.

Datasets Used:

VoxCeleb1, VoxCeleb2: Large-scale speaker recognition datasets.
VoxSRC 2020/2021: Multilingual speaker recognition challenges.
NISP-B: Multilingual dataset with English, Hindi, Kannada, Malayalam, Tamil, Telugu.
DISPLACE: For speaker diarization (who spoke when).

Performance Metrics

EER (Equal Error Rate): Lower is better.
minDCF (Minimum Decision Cost Function): Lower is better.
SLR Accuracy: Lower means less language information in speaker embeddings.
Cosine Similarity: Higher means more consistent speaker embeddings across languages.
DER (Diarization Error Rate): Lower is better.

RESULTS:

Why Does LASPA Matter?

Language Agnostic: Recognizes you, not your language.
Efficient: Prefix-tuners add just 1.16% to model parameters.
Generalizes Well: Works on unseen languages and challenging multilingual datasets.
Better Diarization: Outperforms baselines in “who spoke when” tasks.

Conclusion & Future Directions

LASPA is a leap forward for speaker recognition in our multilingual world. By disentangling speaker and language information using prefix-tuned cross-attention, it delivers robust, efficient, and language-agnostic performance.

What’s next? The authors plan to further refine prefix-tuning and explore its applications in speaker verification and related tasks.

Citation

@misc{menon2025laspalanguageagnosticspeaker,

title={LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention},

author={Aditya Srinivas Menon and Raj Prakash Gohil and Kumud Tripathi and Pankaj Wasnik},

year={2025},

eprint={2506.02083},

archivePrefix={arXiv},

primaryClass={cs.SD},

url={https://arxiv.org/abs/2506.02083},

}

In most of the cases, it has been found that Content Driven sessions outperform the time driven sessions. The results are obtained on 6 baselines: STAMP, NARM, GRU4Rec, CD-HRNN, Tr4Rec on datasets like Movielens (Movies), GoodRead Book, LastFM (Music), Amazon (e-commerce).