Symbal AI Research Spotlight: CORN and Reliable Speech Quality Scoring

Symbal AI Research Spotlight: CORN and a More Reliable Way to Score Speech Quality, With or Without a Reference

‍

San Francisco, CA — As Symbal AI continues expanding its enterprise-grade voice intelligence capabilities, our product roadmap increasingly depends on one foundational capability: measuring speech quality reliably at scale—across real-world calls, variable microphones, and inconsistent environments.

That’s why we’re highlighting a recent research contribution from Symbal AI CTO Pranay Manocha: CORN (Co-trained Full- and No-Reference Speech Quality Assessment), a framework designed to produce two independent speech-quality models from one training regime:

a full-reference (FR) model (compares a degraded signal to a clean reference), and
a no-reference (NR) model (scores quality with no reference available).

This matters for modern voice systems (especially in recruiting) because we often have no “clean” reference for an interview call, but we still need to understand whether poor downstream performance is coming from the model or simply from poor audio.

‍The Core Problem: FR and NR Each Solve Half the Reality

‍Speech-quality assessment is typically split into two camps:

Full-reference metrics (intrusive/similarity-based) require a clean reference. They can correlate well with perception, but they break down when the “clean” reference differs in environment or when tiny imperceptible differences are over-penalized.
No-reference metrics score audio without any reference, but training is notoriously noisy when labels are subjective (e.g., MOS variance), and progress often lags FR approaches.

CORN’s thesis is straightforward and practical: train both together so each task regularizes the other through a shared representation, and then deploy either branch independently depending on whether a reference is available.

‍What CORN Actually Does: Co-Training Through a Shared Bottleneck

‍At a high level, CORN consists of:

A shared base model block (B) that produces an embedding for an input waveform.
Two lightweight, task-specific output heads:
- Hf (FR head): takes the embedding of the degraded recording and concatenates it with the embedding of the reference before scoring.
- Hn (NR head): takes only the degraded recording embedding and scores directly.

Critically, the FR and NR models share weights in the base block, so gradients from both tasks shape a single representation that is encouraged to discard non-essential information (especially content) and keep what matters for quality.

‍A Technical Snapshot: The Base Architecture Choices

‍One of the paper’s strengths is that it is not “a trick that only works in one architecture.” In the reference implementation, the base model uses a waveform pipeline with:

a trainable µ-law front-end (initialized to µ=4),
convolutional pooling blocks with BN/ReLU and BlurPool,
three residual blocks with a parametric residual mixing,
timewise statistics (channel-wise mean/std) collapsed into a fixed vector, followed by an MLP.

They also test swapping in an alternate architecture (using magnitude + phase spectra) and still observe consistent gains, supporting the framework’s portability.

‍Training Without Human Labels: Objective Metrics as a Scalable Proxy

‍CORN is trained without perceptual MOS labels. Instead, it learns to predict objective measures like:

SI-SDR (primary training target in the core setup),
plus experiments showing robustness to SNR and PESQ as alternative targets.

Why this matters: objective metrics allow training at scale on unlimited programmatically generated degradations, while avoiding label noise common in human ratings.

‍Results: Both Branches Get Better, Even the FR Model

‍The expected outcome is that NR improves when it “borrows” stability from FR signals during training. CORN shows that.

The more surprising outcome is that the FR model also improves, even though it uses the same architecture and training data as its standalone FR baseline. The paper attributes this to the NR loss helping prevent FR overfitting to content-specific cues, increasing content invariance.

Quantitatively, CORN reports:

~16% improvement for NR over an independently trained NR model, and
~11% improvement for FR over an independently trained FR model.

They also evaluate embedding quality directly (e.g., content invariance and sensitivity to small signal shifts), with CORN showing stronger robustness overall.

‍Why This Matters for Our Voice Intelligence Stack

‍In enterprise voice workflows, recruiting included, audio is messy:

inconsistent mics,
background noise,
remote environments,
compression artifacts,
and variance by device and platform.

CORN’s practical value for Symbal is that it gives us:

A deployable NR quality score when no reference exists (the common case).
An FR option for controlled evaluations (benchmarks, synthetic tests, regression harnesses) when we do have a reference.
A shared embedding space that is more content-invariant, which is particularly important when we want quality assessment to reflect acoustics, not what someone said.

This aligns directly with Symbal’s larger objective: delivering voice intelligence that improves accuracy, speed, and workflow automation for communication-intensive teams at scale.

‍