Sanskrit Speech-to-Text Translation Research

Abstract

This research project presents an advanced real-time pipeline for processing Sanskrit speech through automated transcription, translation, and synthesis. Our current approach integrates DeepFilterNet for noise reduction, Faster Whisper Large-v3 for real-time speech recognition, Gemini API for intelligent text correction, and specialized Sanskrit TTS synthesis. The system is designed to handle Sanskrit linguistic complexities while maintaining real-time performance capabilities.

The pipeline addresses unique challenges in Sanskrit speech processing by implementing state-of-the-art noise filtering and leveraging large language models for contextual correction of transcription errors. This work contributes to the development of practical Sanskrit language technology tools for education, research, and cultural preservation.

Current Methodology

Real-time Processing Pipeline

Our current approach focuses on real-time processing capabilities with enhanced accuracy through multi-stage correction and specialized Sanskrit synthesis.

Noise Reduction

DeepFilterNet provides real-time noise suppression specifically optimized for speech clarity and processing efficiency.

Implemented

Speech Recognition

Faster Whisper Large-v3 model enables real-time transcription with improved Sanskrit language support.

Implemented

Text Correction

Gemini API provides intelligent post-processing correction for Sanskrit transcription accuracy.

Implemented

Sanskrit TTS

Specialized Sanskrit text-to-speech using sanskrit-tts package with future Coqui xTTS-v2 Hindi fine-tuning.

In Development

Processing Pipeline

Audio Preprocessing

DeepFilterNet performs real-time noise reduction and speech enhancement, removing background noise while preserving Sanskrit phonetic characteristics essential for accurate recognition.
Real-time Speech Recognition

Faster Whisper Large-v3 model processes the filtered audio stream, providing low-latency transcription with improved Sanskrit language understanding compared to standard Whisper models.
Intelligent Text Correction

Gemini API analyzes the transcribed text for Sanskrit linguistic accuracy, correcting common ASR errors and ensuring proper Sanskrit grammar and vocabulary usage.
Sanskrit Speech Synthesis

Current implementation uses sanskrit-tts package for generating Sanskrit audio output, with ongoing development of Coqui xTTS-v2 fine-tuned on Hindi voice models for improved naturalness.

Performance Analysis

Pipeline Performance Evaluation

यदा यदा हि धर्मस्य ग्लानिर्भवति भारत

Bhagavad Gita, Chapter 4, Verse 7 - Classical Sanskrit verse used for system evaluation

Noisy Audio(generated by the node package mentioned)

DeepFilterNet Processed:

Processing Pipeline Comparison

Processing Stage	Output Quality	Real-time Performance(Tested on Colab-T4)
DeepFilterNet Filtering	Excellent noise reduction with preserved speech clarity	100ms-200ms latency for real-time processing
Faster Whisper Large-v3	Improved Sanskrit recognition accuracy	Real-time transcription with streaming capability
Gemini Text Correction	Contextual Sanskrit grammar and vocabulary correction	~200ms processing time per correction batch
Sanskrit TTS	Functional Sanskrit pronunciation	Moderate quality, planned improvements with xTTS-v2

Key Improvements

Real-time Capability: The current pipeline achieves end-to-end processing latency under 500ms for 100ms-200ms audio, making it suitable for interactive applications and live Sanskrit conversation systems.

Enhanced Accuracy: DeepFilterNet preprocessing combined with Faster Whisper Large-v3 and Gemini correction provides significantly improved transcription accuracy compared to previous approaches using basic vocal separation.

Scalability: The modular design allows for easy replacement and upgrading of individual components as better models become available.

Implementation Details

Repository Structure

realtime_pipeline.ipynb - Main real-time processing pipeline
deepfilter_preprocessing.ipynb - DeepFilterNet integration and testing
audio/ - Test datasets and processed audio samples
README.md - Complete setup and usage documentation

Current Technical Stack

# Core Dependencies
- DeepFilterNet (Real-time noise reduction)
- Faster Whisper Large-v3 (Real-time ASR)
- Google Gemini API (Text correction)
- sanskrit-tts (Sanskrit text-to-speech)
- Coqui xTTS-v2 (Planned Hindi fine-tuning)

# Key Features
- Real-time processing capability
- Sanskrit-specific language corrections
- Modular pipeline architecture
- Streaming audio support
            

Sanskrit TTS Implementation

# Current TTS Solution
Repository: https://github.com/SameeraMurthy/sanskrit-tts.git
- Node.js package for Sanskrit pronunciation
- Phonetic mapping for Sanskrit characters
- Audio generation for Sanskrit text

# Planned Enhancement
- Coqui xTTS-v2 fine-tuning on Hindi voices
- Improved naturalness and pronunciation
- Better Sanskrit phonetic handling
        

Installation and Usage

git clone https://github.com/Rstar-910/SamskritaBharati
cd SamskritaBharati
pip install -r requirements.txt

# Install sanskrit-tts
git clone https://github.com/SameeraMurthy/sanskrit-tts.git
cd sanskrit-tts
npm install
        

Future Development

Coqui xTTS-v2 Integration: Fine-tune xTTS-v2 model on Hindi voice datasets to improve Sanskrit TTS naturalness and pronunciation accuracy.
Real-time Optimization: Further reduce processing latency through model optimization and hardware acceleration for mobile and edge deployment.
Sanskrit-specific ASR Fine-tuning: Fine-tune Faster Whisper on Sanskrit-specific datasets to improve phonetic recognition accuracy.
Voice Cloning Capabilities: Implement personalized voice synthesis for specific Sanskrit speakers or traditional recitation styles.

Sanskrit Speech-to-Text Translation System

Abstract

Current Methodology

Real-time Processing Pipeline

Noise Reduction

Speech Recognition

Text Correction

Sanskrit TTS

Processing Pipeline

Audio Preprocessing

Real-time Speech Recognition

Intelligent Text Correction

Sanskrit Speech Synthesis

Performance Analysis

Pipeline Performance Evaluation

Processing Pipeline Comparison

Key Improvements

Implementation Details

Repository Structure

Current Technical Stack

Sanskrit TTS Implementation

Installation and Usage

Future Development

References & Resources