Prosodic ABX — Supplementary Material

Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
Submitted to Interspeech 2026

Layer-wise Error Curves

ABX error rate (%) by layer for all 17 S3Ms and 2 baselines. Lower is better. Labeled dots mark each model's best layer. The human baseline is the dashed vermillion line. Click legend entries to isolate or hide individual models.

Pretraining language (line color)
Size
Pretraining language
Architecture

Per-Task Model Rankings

Best-layer ABX error rate (%), ranked best to worst. The human baseline is the dashed vertical line.

Pretraining language (bar color)

Paired Layer-wise Error Curves on Synthesized and Natural Speech

Layer-wise performance on natural vs. synthesized speech for a selected model. For English, both G-TTS (solid) and Kokoro (dotted) are shown.

Models ordered best to worst by natural-speech error rate.

In-Context Effects (Japanese Pitch Accent)

Out-of-context ABX clips the target word before encoding. In-context encodes the full utterance, then compares frames at the target word position. Only the Japanese pitch accent dataset has identical carrier sentences enabling this comparison.

Models ordered best to worst by out-of-context error rate.