Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
Submitted to Interspeech 2026
ABX error rate (%) by layer for all 17 S3Ms and 2 baselines. Lower is better. Labeled dots mark each model's best layer. The human baseline is the dashed vermillion line. Click legend entries to isolate or hide individual models.
Best-layer ABX error rate (%), ranked best to worst. The human baseline is the dashed vertical line.
Layer-wise performance on natural vs. synthesized speech for a selected model. For English, both G-TTS (solid) and Kokoro (dotted) are shown.
Models ordered best to worst by natural-speech error rate.
Out-of-context ABX clips the target word before encoding. In-context encodes the full utterance, then compares frames at the target word position. Only the Japanese pitch accent dataset has identical carrier sentences enabling this comparison.
Models ordered best to worst by out-of-context error rate.