Abstract
This research project presents an advanced real-time pipeline for processing Sanskrit speech through automated transcription, translation, and synthesis. Our current approach integrates DeepFilterNet for noise reduction, Faster Whisper Large-v3 for real-time speech recognition, Gemini API for intelligent text correction, and specialized Sanskrit TTS synthesis. The system is designed to handle Sanskrit linguistic complexities while maintaining real-time performance capabilities.
The pipeline addresses unique challenges in Sanskrit speech processing by implementing state-of-the-art noise filtering and leveraging large language models for contextual correction of transcription errors. This work contributes to the development of practical Sanskrit language technology tools for education, research, and cultural preservation.
Current Methodology
Real-time Processing Pipeline
Our current approach focuses on real-time processing capabilities with enhanced accuracy through multi-stage correction and specialized Sanskrit synthesis.
Noise Reduction
DeepFilterNet provides real-time noise suppression specifically optimized for speech clarity and processing efficiency.
ImplementedSpeech Recognition
Faster Whisper Large-v3 model enables real-time transcription with improved Sanskrit language support.
ImplementedText Correction
Gemini API provides intelligent post-processing correction for Sanskrit transcription accuracy.
ImplementedSanskrit TTS
Specialized Sanskrit text-to-speech using sanskrit-tts package with future Coqui xTTS-v2 Hindi fine-tuning.
In DevelopmentProcessing Pipeline
-
Audio Preprocessing
DeepFilterNet performs real-time noise reduction and speech enhancement, removing background noise while preserving Sanskrit phonetic characteristics essential for accurate recognition.
-
Real-time Speech Recognition
Faster Whisper Large-v3 model processes the filtered audio stream, providing low-latency transcription with improved Sanskrit language understanding compared to standard Whisper models.
-
Intelligent Text Correction
Gemini API analyzes the transcribed text for Sanskrit linguistic accuracy, correcting common ASR errors and ensuring proper Sanskrit grammar and vocabulary usage.
-
Sanskrit Speech Synthesis
Current implementation uses sanskrit-tts package for generating Sanskrit audio output, with ongoing development of Coqui xTTS-v2 fine-tuned on Hindi voice models for improved naturalness.
Performance Analysis
Pipeline Performance Evaluation
Bhagavad Gita, Chapter 4, Verse 7 - Classical Sanskrit verse used for system evaluation
Processing Pipeline Comparison
Processing Stage | Output Quality | Real-time Performance(Tested on Colab-T4) |
---|---|---|
DeepFilterNet Filtering | Excellent noise reduction with preserved speech clarity | 100ms-200ms latency for real-time processing |
Faster Whisper Large-v3 | Improved Sanskrit recognition accuracy | Real-time transcription with streaming capability |
Gemini Text Correction | Contextual Sanskrit grammar and vocabulary correction | ~200ms processing time per correction batch |
Sanskrit TTS | Functional Sanskrit pronunciation | Moderate quality, planned improvements with xTTS-v2 |
Key Improvements
Real-time Capability: The current pipeline achieves end-to-end processing latency under 500ms for 100ms-200ms audio, making it suitable for interactive applications and live Sanskrit conversation systems.
Enhanced Accuracy: DeepFilterNet preprocessing combined with Faster Whisper Large-v3 and Gemini correction provides significantly improved transcription accuracy compared to previous approaches using basic vocal separation.
Scalability: The modular design allows for easy replacement and upgrading of individual components as better models become available.
Implementation Details
Repository Structure
- realtime_pipeline.ipynb - Main real-time processing pipeline
- deepfilter_preprocessing.ipynb - DeepFilterNet integration and testing
- audio/ - Test datasets and processed audio samples
- README.md - Complete setup and usage documentation
Current Technical Stack
Sanskrit TTS Implementation
Installation and Usage
Future Development
- Coqui xTTS-v2 Integration: Fine-tune xTTS-v2 model on Hindi voice datasets to improve Sanskrit TTS naturalness and pronunciation accuracy.
- Real-time Optimization: Further reduce processing latency through model optimization and hardware acceleration for mobile and edge deployment.
- Sanskrit-specific ASR Fine-tuning: Fine-tune Faster Whisper on Sanskrit-specific datasets to improve phonetic recognition accuracy.
- Voice Cloning Capabilities: Implement personalized voice synthesis for specific Sanskrit speakers or traditional recitation styles.
References & Resources
- Schröter, H., et al. (2022). DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio. ICASSP 2022.
- Kln, G. (2023). Faster Whisper: Faster implementation of OpenAI's Whisper model using CTranslate2.
- Murthy, S. (2023). Sanskrit TTS: A Node.js package for Sanskrit text-to-speech synthesis.
- Coqui TTS Team. (2023). Coqui xTTS-v2: Multilingual Text-to-Speech with Voice Cloning.
- Google DeepMind. (2023). Gemini API: Large Language Model for Text Generation and Analysis.