Skip to content

tarun7r/Vocal-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Real-Time Cascading Speech-to-Speech Chatbot πŸ€–

A real-time cascading speech-to-speech chatbot that combines advanced speech recognition, AI reasoning, and neural text-to-speech capabilities. Built for seamless voice interactions with web integration and extensible tool system.

✨ Features

  • πŸŽ™οΈ Real-time Speech Recognition - Powered by Whisper + Silero VAD for accurate voice input
  • πŸ€– Intelligent AI Reasoning - Multimodal reasoning with Llama 3.1 8B through Agno agent
  • 🌐 Web Integration - Access to Google Search, Wikipedia, and Arxiv for real-time information
  • πŸ—£οΈ Natural Voice Synthesis - High-quality voice output using Kokoro-82M ONNX
  • ⚑ Low-latency Processing - Optimized audio pipeline for responsive interactions
  • πŸ”§ Extensible Tool System - Easy to add new capabilities to the agent
  • πŸ› οΈ Cross-platform Support - Works on macOS, Linux, and Windows

πŸ› οΈ Tech Stack

Component Technology Description
Speech-to-Text Whisper (large-v1) + Silero VAD Real-time transcription with voice activity detection
Language Model Llama 3.1 8B via Ollama Local AI reasoning and conversation
Text-to-Speech Kokoro-82M ONNX Natural voice synthesis
Agent Framework Agno LLM Agent Extensible tool-calling capabilities
Audio Processing SoundDevice + SoundFile Real-time audio I/O

πŸ“‹ Prerequisites

  • Python 3.9+
  • Ollama - Local LLM server
  • espeak-ng - Text-to-speech engine
  • Microphone and Speakers - For voice interaction

πŸš€ Quick Start

1. Install Ollama

macOS:

# Download from https://ollama.com/download/mac
# Or use Homebrew
brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

2. Clone and Setup

# Clone the repository
git clone https://github.com/tarun7r/Vocal-Agent.git
cd Vocal-Agent

# Install Python dependencies
pip install -r requirements.txt

# Install Kokoro TTS (separate installation)
pip install --no-deps kokoro-onnx==0.4.7

3. Install System Dependencies

macOS:

brew install espeak-ng

Linux:

sudo apt-get install espeak-ng

Windows:

  1. Visit eSpeak NG Releases
  2. Download the latest .msi file (e.g., espeak-ng-20191129-b702b03-x64.msi)
  3. Run the installer
  4. Add to PATH if needed

4. Download Models

Llama 3.1 8B:

ollama pull llama3.1:8b

Kokoro TTS Models:

  • Download kokoro-v1.0.onnx and voices-v1.0.bin from kokoro-onnx releases
  • Place them in the project root directory

5. Run the Application

Start Ollama:

ollama serve

In a new terminal, run the agent:

python main.py

🎯 Usage

  1. Start the application - Run python main.py
  2. Wait for initialization - The system will load models and start listening
  3. Speak naturally - Ask questions, request information, or have conversations
  4. Listen to responses - The AI will respond with synthesized speech

Example Interaction Flow:

Listening... Press Ctrl+C to exit β ‹
speak now - Recording started β Έ
recording - Recording stopped

Transcribed: Who won the 2022 FIFA World Cup?
LLM Tool calls...

Response from the knowledge agent: The 2022 FIFA World Cup was won by Argentina, led by Lionel Messi. They defeated France in the final on December 18, 2022.

[Audio starts playing]

Vocal Agent Demo

βš™οΈ Configuration

Key settings in main.py:

# Audio processing
SAMPLE_RATE = 16000
MAX_PHONEME_LENGTH = 500

# Voice synthesis
SPEED = 1.2  # Adjust speech rate (0.5-2.0)
VOICE_PROFILE = "af_heart"  # Choose from voices-v1.0.bin

# Agent settings
MAX_THREADS = 2  # Parallel processing threads

Available Voice Profiles

The voices-v1.0.bin file contains multiple voice profiles. You can change the VOICE_PROFILE setting to use different voices.

πŸ“ Project Structure

Vocal-Agent/
β”œβ”€β”€ main.py                 # Core application logic
β”œβ”€β”€ agent_client.py         # LLM agent integration
β”œβ”€β”€ kokoro-v1.0.onnx       # TTS model file
β”œβ”€β”€ voices-v1.0.bin        # Voice profiles
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ vocal_agent_mac.sh     # macOS setup script
β”œβ”€β”€ demo.png               # Demo screenshot
└── README.md              # This file

macOS Setup Script

For macOS users, we provide an automated setup script:

# Make the script executable
chmod +x vocal_agent_mac.sh

# Run the setup script
./vocal_agent_mac.sh

The script will:

  • Install Homebrew dependencies
  • Download Kokoro models
  • Set up the environment
  • Start Ollama service
  • Launch the application

Performance Tips

  • Use a GPU for faster LLM inference
  • Adjust MAX_THREADS based on your CPU cores
  • Modify SPEED setting for preferred speech rate
  • Close other audio applications to avoid conflicts

Adding New Tools

The agent uses the Agno framework, which supports extensible tool calling. To add new capabilities:

  1. Check the Agno Toolkits documentation
  2. Implement your tool following the Agno framework
  3. Register the tool with the agent in agent_client.py

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements


Made with ❀️ for the open-source community

About

Cascading voice assistant combining real-time speech recognition, AI reasoning, and neural text-to-speech capabilities.

Topics

Resources

License

Stars

Watchers

Forks

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy