2.1 System Architecture Overview
IVoice follows a modular dual-process architecture:
Unity (Controller, UI, Feedback, Logging)
|─ Captures live microphone audio
|─ Streams PCM frames to Python
|─ Receives real-time metrics
|─ Computes deviations + panel stats
|─ Runs alert policy
|─ Renders adaptive feedback
Python (Calibration, Estimation)
|─ Parses incoming audio frames
|─ Performs pitch/loudness estimation
|─ Performs VAD gating
|─ Computes jitter/shimmer/HNR/CPP/CPPS
|─ Emits one JSON result per frame
Communication is unidirectional:
- Unity -> Python: Binary 30ms PCM frames
- Python -> Unity: Per-frame JSON with metrics
All components in Unity are wired through well-defined interfaces inside Abstractions.cs, allowing developers to replace internal modules without touching the rest of the system.
2.2 Unity Side-Architecture
Unity implements real-time streaming, calibration, visualization, and alert policy. Key components live in:
Assets/Scripts/Unity/
|─ Core/ # Voice capturing, streaming, policy determination, logging
|─ Interfaces/ # Core abstractions
|─ UI/ # Visualization
Core/
2.2.1.1 RealTimeVoiceAnalyzer.cs (Primary Controller)
Responsibilities:
- Initializes
IVoiceCapture
- Forwards each
VoiceFrame -> IVoiceAnalysisEngine
- Receives
AnalysisFrame events
- Computes:
- relative deviations
- cumulative time-in-target
- averaged metrics
- panel-level Inside/Outside ratios
- Invokes the current
IAlertPolicy
- Dispatches resulting alerts to
IFeedbackPresenter
- Emits logs to
IRunLogger
This is the central orchestrator of the Unity pipeline.
2.2.1.2 VoiceCalibration.cs
Responsibilities:
- Recording of three modes:
- background noise
- sustained vowel
- sentence
- Running Python
calibration_analysis.py script
- Producing:
- WAV files (raw + clean)
calibration_data.json used by the engine
Calibration ensures that pitch/loudness deviations reflect user-specific norms.
2.2.1.3 VoiceSessionLogger.cs
Writes per-frame metrics
Interfaces/
2.2.2.1 Abstractions.cs
The file defines:
IVoiceCapture
Abstraction of microphone input:
- Emits
VoiceFrame (float[] samples, ~30 ms)
- Guarantees stready rate and mono 16 kHz output
RealTimeVoiceAnalyzer uses only this interface
→ Developers can plug in USB mic, WebRTC capture, prerecorded clips, sythetic sources, etc.
IVoiceAnalysisEngine
Backend-agnostic estimator:
- Accepts
VoiceFrame
- Returns
AnalysisFrame
- Default implementation uses
real_time_stream.py
→ Developers can replace the Python engine with C++, TensorFlow, cloud APIs, or neural models.
ICalibrationProcessor
Defines calibration steps:
- Takes AudioClip
- Produces CalibrationData (JSON + WAV)
→ Current implementation wraps calibration_analysis.py
IAlertPolicy
Encapsulates decision logic:
- Consumes panel-level statistics
- Outputs: {None, TooHigh, TooLow, Praise}
→ Developers can insert RL-based policies, personalized adaptation, robot-friendly timing, etc.
IFeedbackPresenter
Defines how alerts appear:
- Shows high/low/praise states
→ Default is Unity UI; can be replaced with robot gestures.
IRunLogger
For experiment data logging:
- CSV rows for every
AnalysisFrame
→ Supports reproducibility & ML training datasets.
2.3 Python-Side Architecture
Performs all the signal processing and heavy feature extraction.
Assets/Scripts/Python/
2.3.1 real_time_stream.py
Handles:
- Binary packet parsing
- Silero VAD gating
- Low-latency SWIPE pitch detection
- K-weighted loudness estimation
- HNR/Jitter/Shimmer
- CPP/CPPS via PowerCepstrogram
- JSON output (one per frame)
Also uses metric throttling:
- HNR/Jitter/Shimmer: every 150 ms
- CPP/CPPS: every 450 ms
2.3.2 calibration_analysis.py
Provides:
- background noise estimation
- sustained vowel trimming
- sentence-level VAD via Silero multi-pass
- SWIPE + loudness over calibrated segments
- Used to populate
calibration_data.json