Sanskrit Speech-to-Text Translation System

Advanced real-time pipeline for Sanskrit audio processing using DeepFilterNet, Faster Whisper, and specialized text-to-speech synthesis

Research Project • Real-time Processing • Sanskrit Language Technology

Abstract

This research project presents an advanced real-time pipeline for processing Sanskrit speech through automated transcription, translation, and synthesis. Our current approach integrates DeepFilterNet for noise reduction, Faster Whisper Large-v3 for real-time speech recognition, Gemini API for intelligent text correction, and specialized Sanskrit TTS synthesis. The system is designed to handle Sanskrit linguistic complexities while maintaining real-time performance capabilities.

The pipeline addresses unique challenges in Sanskrit speech processing by implementing state-of-the-art noise filtering and leveraging large language models for contextual correction of transcription errors. This work contributes to the development of practical Sanskrit language technology tools for education, research, and cultural preservation.

Current Methodology

Real-time Processing Pipeline

Our current approach focuses on real-time processing capabilities with enhanced accuracy through multi-stage correction and specialized Sanskrit synthesis.

Noise Reduction

DeepFilterNet provides real-time noise suppression specifically optimized for speech clarity and processing efficiency.

Implemented

Speech Recognition

Faster Whisper Large-v3 model enables real-time transcription with improved Sanskrit language support.

Implemented

Text Correction

Gemini API provides intelligent post-processing correction for Sanskrit transcription accuracy.

Implemented

Sanskrit TTS

Specialized Sanskrit text-to-speech using sanskrit-tts package with future Coqui xTTS-v2 Hindi fine-tuning.

In Development

Processing Pipeline

  1. Audio Preprocessing

    DeepFilterNet performs real-time noise reduction and speech enhancement, removing background noise while preserving Sanskrit phonetic characteristics essential for accurate recognition.

  2. Real-time Speech Recognition

    Faster Whisper Large-v3 model processes the filtered audio stream, providing low-latency transcription with improved Sanskrit language understanding compared to standard Whisper models.

  3. Intelligent Text Correction

    Gemini API analyzes the transcribed text for Sanskrit linguistic accuracy, correcting common ASR errors and ensuring proper Sanskrit grammar and vocabulary usage.

  4. Sanskrit Speech Synthesis

    Current implementation uses sanskrit-tts package for generating Sanskrit audio output, with ongoing development of Coqui xTTS-v2 fine-tuned on Hindi voice models for improved naturalness.

Performance Analysis

Pipeline Performance Evaluation

यदा यदा हि धर्मस्य ग्लानिर्भवति भारत

Bhagavad Gita, Chapter 4, Verse 7 - Classical Sanskrit verse used for system evaluation

Processing Pipeline Comparison

Processing Stage Output Quality Real-time Performance(Tested on Colab-T4)
DeepFilterNet Filtering Excellent noise reduction with preserved speech clarity 100ms-200ms latency for real-time processing
Faster Whisper Large-v3 Improved Sanskrit recognition accuracy Real-time transcription with streaming capability
Gemini Text Correction Contextual Sanskrit grammar and vocabulary correction ~200ms processing time per correction batch
Sanskrit TTS Functional Sanskrit pronunciation Moderate quality, planned improvements with xTTS-v2

Key Improvements

Real-time Capability: The current pipeline achieves end-to-end processing latency under 500ms for 100ms-200ms audio, making it suitable for interactive applications and live Sanskrit conversation systems.

Enhanced Accuracy: DeepFilterNet preprocessing combined with Faster Whisper Large-v3 and Gemini correction provides significantly improved transcription accuracy compared to previous approaches using basic vocal separation.

Scalability: The modular design allows for easy replacement and upgrading of individual components as better models become available.

Implementation Details

Repository Structure

Current Technical Stack

# Core Dependencies - DeepFilterNet (Real-time noise reduction) - Faster Whisper Large-v3 (Real-time ASR) - Google Gemini API (Text correction) - sanskrit-tts (Sanskrit text-to-speech) - Coqui xTTS-v2 (Planned Hindi fine-tuning) # Key Features - Real-time processing capability - Sanskrit-specific language corrections - Modular pipeline architecture - Streaming audio support

Sanskrit TTS Implementation

# Current TTS Solution Repository: https://github.com/SameeraMurthy/sanskrit-tts.git - Node.js package for Sanskrit pronunciation - Phonetic mapping for Sanskrit characters - Audio generation for Sanskrit text # Planned Enhancement - Coqui xTTS-v2 fine-tuning on Hindi voices - Improved naturalness and pronunciation - Better Sanskrit phonetic handling

Installation and Usage

git clone https://github.com/Rstar-910/SamskritaBharati cd SamskritaBharati pip install -r requirements.txt # Install sanskrit-tts git clone https://github.com/SameeraMurthy/sanskrit-tts.git cd sanskrit-tts npm install

Future Development

  • Coqui xTTS-v2 Integration: Fine-tune xTTS-v2 model on Hindi voice datasets to improve Sanskrit TTS naturalness and pronunciation accuracy.
  • Real-time Optimization: Further reduce processing latency through model optimization and hardware acceleration for mobile and edge deployment.
  • Sanskrit-specific ASR Fine-tuning: Fine-tune Faster Whisper on Sanskrit-specific datasets to improve phonetic recognition accuracy.
  • Voice Cloning Capabilities: Implement personalized voice synthesis for specific Sanskrit speakers or traditional recitation styles.

References & Resources