San Francisco, CA — As Symbal AI continues expanding its enterprise-grade voice intelligence capabilities, our product roadmap increasingly depends on one foundational capability: measuring speech quality reliably at scale—across real-world calls, variable microphones, and inconsistent environments.
That’s why we’re highlighting a recent research contribution from Symbal AI CTO Pranay Manocha: CORN (Co-trained Full- and No-Reference Speech Quality Assessment), a framework designed to produce two independent speech-quality models from one training regime:
This matters for modern voice systems (especially in recruiting) because we often have no “clean” reference for an interview call, but we still need to understand whether poor downstream performance is coming from the model or simply from poor audio.
The Core Problem: FR and NR Each Solve Half the Reality
Speech-quality assessment is typically split into two camps:
CORN’s thesis is straightforward and practical: train both together so each task regularizes the other through a shared representation, and then deploy either branch independently depending on whether a reference is available.
What CORN Actually Does: Co-Training Through a Shared Bottleneck
At a high level, CORN consists of:
Critically, the FR and NR models share weights in the base block, so gradients from both tasks shape a single representation that is encouraged to discard non-essential information (especially content) and keep what matters for quality.
A Technical Snapshot: The Base Architecture Choices
One of the paper’s strengths is that it is not “a trick that only works in one architecture.” In the reference implementation, the base model uses a waveform pipeline with:
They also test swapping in an alternate architecture (using magnitude + phase spectra) and still observe consistent gains, supporting the framework’s portability.
Training Without Human Labels: Objective Metrics as a Scalable Proxy
CORN is trained without perceptual MOS labels. Instead, it learns to predict objective measures like:
Why this matters: objective metrics allow training at scale on unlimited programmatically generated degradations, while avoiding label noise common in human ratings.
Results: Both Branches Get Better, Even the FR Model
The expected outcome is that NR improves when it “borrows” stability from FR signals during training. CORN shows that.
The more surprising outcome is that the FR model also improves, even though it uses the same architecture and training data as its standalone FR baseline. The paper attributes this to the NR loss helping prevent FR overfitting to content-specific cues, increasing content invariance.
Quantitatively, CORN reports:
They also evaluate embedding quality directly (e.g., content invariance and sensitivity to small signal shifts), with CORN showing stronger robustness overall.
Why This Matters for Our Voice Intelligence Stack
In enterprise voice workflows, recruiting included, audio is messy:
CORN’s practical value for Symbal is that it gives us:
This aligns directly with Symbal’s larger objective: delivering voice intelligence that improves accuracy, speed, and workflow automation for communication-intensive teams at scale.