A real-time cascading speech-to-speech chatbot that combines advanced speech recognition, AI reasoning, and neural text-to-speech capabilities. Built for seamless voice interactions with web integration and extensible tool system.
- ποΈ Real-time Speech Recognition - Powered by Whisper + Silero VAD for accurate voice input
- π€ Intelligent AI Reasoning - Multimodal reasoning with Llama 3.1 8B through Agno agent
- π Web Integration - Access to Google Search, Wikipedia, and Arxiv for real-time information
- π£οΈ Natural Voice Synthesis - High-quality voice output using Kokoro-82M ONNX
- β‘ Low-latency Processing - Optimized audio pipeline for responsive interactions
- π§ Extensible Tool System - Easy to add new capabilities to the agent
- π οΈ Cross-platform Support - Works on macOS, Linux, and Windows
Component | Technology | Description |
---|---|---|
Speech-to-Text | Whisper (large-v1) + Silero VAD | Real-time transcription with voice activity detection |
Language Model | Llama 3.1 8B via Ollama | Local AI reasoning and conversation |
Text-to-Speech | Kokoro-82M ONNX | Natural voice synthesis |
Agent Framework | Agno LLM Agent | Extensible tool-calling capabilities |
Audio Processing | SoundDevice + SoundFile | Real-time audio I/O |
- Python 3.9+
- Ollama - Local LLM server
- espeak-ng - Text-to-speech engine
- Microphone and Speakers - For voice interaction
macOS:
# Download from https://ollama.com/download/mac
# Or use Homebrew
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows:
- Download from Ollama Windows download page
# Clone the repository
git clone https://github.com/tarun7r/Vocal-Agent.git
cd Vocal-Agent
# Install Python dependencies
pip install -r requirements.txt
# Install Kokoro TTS (separate installation)
pip install --no-deps kokoro-onnx==0.4.7
macOS:
brew install espeak-ng
Linux:
sudo apt-get install espeak-ng
Windows:
- Visit eSpeak NG Releases
- Download the latest
.msi
file (e.g.,espeak-ng-20191129-b702b03-x64.msi
) - Run the installer
- Add to PATH if needed
Llama 3.1 8B:
ollama pull llama3.1:8b
Kokoro TTS Models:
- Download
kokoro-v1.0.onnx
andvoices-v1.0.bin
from kokoro-onnx releases - Place them in the project root directory
Start Ollama:
ollama serve
In a new terminal, run the agent:
python main.py
- Start the application - Run
python main.py
- Wait for initialization - The system will load models and start listening
- Speak naturally - Ask questions, request information, or have conversations
- Listen to responses - The AI will respond with synthesized speech
Listening... Press Ctrl+C to exit β
speak now - Recording started β Έ
recording - Recording stopped
Transcribed: Who won the 2022 FIFA World Cup?
LLM Tool calls...
Response from the knowledge agent: The 2022 FIFA World Cup was won by Argentina, led by Lionel Messi. They defeated France in the final on December 18, 2022.
[Audio starts playing]
Key settings in main.py
:
# Audio processing
SAMPLE_RATE = 16000
MAX_PHONEME_LENGTH = 500
# Voice synthesis
SPEED = 1.2 # Adjust speech rate (0.5-2.0)
VOICE_PROFILE = "af_heart" # Choose from voices-v1.0.bin
# Agent settings
MAX_THREADS = 2 # Parallel processing threads
The voices-v1.0.bin
file contains multiple voice profiles. You can change the VOICE_PROFILE
setting to use different voices.
Vocal-Agent/
βββ main.py # Core application logic
βββ agent_client.py # LLM agent integration
βββ kokoro-v1.0.onnx # TTS model file
βββ voices-v1.0.bin # Voice profiles
βββ requirements.txt # Python dependencies
βββ vocal_agent_mac.sh # macOS setup script
βββ demo.png # Demo screenshot
βββ README.md # This file
For macOS users, we provide an automated setup script:
# Make the script executable
chmod +x vocal_agent_mac.sh
# Run the setup script
./vocal_agent_mac.sh
The script will:
- Install Homebrew dependencies
- Download Kokoro models
- Set up the environment
- Start Ollama service
- Launch the application
- Use a GPU for faster LLM inference
- Adjust
MAX_THREADS
based on your CPU cores - Modify
SPEED
setting for preferred speech rate - Close other audio applications to avoid conflicts
The agent uses the Agno framework, which supports extensible tool calling. To add new capabilities:
- Check the Agno Toolkits documentation
- Implement your tool following the Agno framework
- Register the tool with the agent in
agent_client.py
This project is licensed under the MIT License - see the LICENSE file for details.
- RealtimeSTT - Real-time speech recognition and VAD
- Kokoro-ONNX - Efficient text-to-speech synthesis
- Agno - LLM agent framework
- Ollama - Local LLM serving
- Weebo - Project inspiration
Made with β€οΈ for the open-source community