Technical Documentation

2.1 System Architecture Overview

IVoice follows a modular dual-process architecture:

Unity (Controller, UI, Feedback, Logging)
  |─ Captures live microphone audio
  |─ Streams PCM frames to Python
  |─ Receives real-time metrics
  |─ Computes deviations + panel stats
  |─ Runs alert policy
  |─ Renders adaptive feedback

Python (Calibration, Estimation)
  |─ Parses incoming audio frames
  |─ Performs pitch/loudness estimation
  |─ Performs VAD gating
  |─ Computes jitter/shimmer/HNR/CPP/CPPS
  |─ Emits one JSON result per frame

Communication is unidirectional:

Unity -> Python: Binary 30ms PCM frames
Python -> Unity: Per-frame JSON with metrics

All components in Unity are wired through well-defined interfaces inside Abstractions.cs, allowing developers to replace internal modules without touching the rest of the system.

2.2 Unity Side-Architecture

Unity implements real-time streaming, calibration, visualization, and alert policy. Key components live in:

Assets/Scripts/Unity/
        |─ Core/         # Voice capturing, streaming, policy determination, logging
        |─ Interfaces/   # Core abstractions
        |─ UI/           # Visualization

Core/

2.2.1.1 RealTimeVoiceAnalyzer.cs (Primary Controller)

Responsibilities:

Initializes IVoiceCapture
Forwards each VoiceFrame -> IVoiceAnalysisEngine
Receives AnalysisFrame events
Computes:
- relative deviations
- cumulative time-in-target
- averaged metrics
- panel-level Inside/Outside ratios
Invokes the current IAlertPolicy
Dispatches resulting alerts to IFeedbackPresenter
Emits logs to IRunLogger

This is the central orchestrator of the Unity pipeline.

2.2.1.2 VoiceCalibration.cs

Responsibilities:

Recording of three modes:
- background noise
- sustained vowel
- sentence
Running Python calibration_analysis.py script
Producing:
- WAV files (raw + clean)
- calibration_data.json used by the engine

Calibration ensures that pitch/loudness deviations reflect user-specific norms.

2.2.1.3 VoiceSessionLogger.cs

Writes per-frame metrics

Interfaces/

2.2.2.1 Abstractions.cs

The file defines:

IVoiceCapture
Abstraction of microphone input:
- Emits VoiceFrame (float[] samples, ~30 ms)
- Guarantees stready rate and mono 16 kHz output
- RealTimeVoiceAnalyzer uses only this interface
→ Developers can plug in USB mic, WebRTC capture, prerecorded clips, sythetic sources, etc.
IVoiceAnalysisEngine
Backend-agnostic estimator:
- Accepts VoiceFrame
- Returns AnalysisFrame
- Default implementation uses real_time_stream.py
→ Developers can replace the Python engine with C++, TensorFlow, cloud APIs, or neural models.
ICalibrationProcessor
Defines calibration steps:
- Takes AudioClip
- Produces CalibrationData (JSON + WAV)
→ Current implementation wraps calibration_analysis.py
IAlertPolicy
Encapsulates decision logic:
- Consumes panel-level statistics
- Outputs: {None, TooHigh, TooLow, Praise}
→ Developers can insert RL-based policies, personalized adaptation, robot-friendly timing, etc.
IFeedbackPresenter
Defines how alerts appear:
- Shows high/low/praise states
→ Default is Unity UI; can be replaced with robot gestures.
IRunLogger
For experiment data logging:
- CSV rows for every AnalysisFrame
→ Supports reproducibility & ML training datasets.

2.3 Python-Side Architecture

Performs all the signal processing and heavy feature extraction.

Assets/Scripts/Python/

2.3.1 real_time_stream.py

Handles:

Binary packet parsing
Silero VAD gating
Low-latency SWIPE pitch detection
K-weighted loudness estimation
HNR/Jitter/Shimmer
CPP/CPPS via PowerCepstrogram
JSON output (one per frame)
Also uses metric throttling:
- HNR/Jitter/Shimmer: every 150 ms
- CPP/CPPS: every 450 ms

2.3.2 calibration_analysis.py

Provides:

background noise estimation
sustained vowel trimming
sentence-level VAD via Silero multi-pass
SWIPE + loudness over calibrated segments
Used to populate calibration_data.json

← Back to IVoice