Prosodic ABX — Supplementary Material

Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
Submitted to Interspeech 2026

Layer-wise Error Curves

ABX error rate (%) by layer for all 17 S3Ms and 2 baselines. Lower is better. Labeled dots mark each model's best layer. The human baseline is the dashed vermillion line. Click legend entries to isolate or hide individual models.

Per-Task Model Rankings

Best-layer ABX error rate (%), ranked best to worst. The human baseline is the dashed vertical line.

Paired Layer-wise Error Curves on Synthesized and Natural Speech

Layer-wise performance on natural vs. synthesized speech for a selected model. For English, both G-TTS (solid) and Kokoro (dotted) are shown.

Model:

Models ordered best to worst by natural-speech error rate.

In-Context Effects (Japanese Pitch Accent)

Out-of-context ABX clips the target word before encoding. In-context encodes the full utterance, then compares frames at the target word position. Only the Japanese pitch accent dataset has identical carrier sentences enabling this comparison.

Model:

Models ordered best to worst by out-of-context error rate.