Project Details

Project ID BITS-SRIP/F409FC/2026
Project Title Project Vaani: Agentic Hindi Audiobook Creator
Project Description Project Overview

Current Text-to-Speech tools often sound robotic, especially for Hindi literature, which requires emotion and specific reciting styles. Project Vaani aims to build a software tool that automatically converts Hindi books (PDF/Text) into high-quality audiobooks using a specific author’s voice.

The core engineering challenge is Voice Cloning and Emotion Control. You will build a system where an AI agent reads the text, decides if a sentence should be spoken with anger, sadness, or joy, and then generates the audio using a cloned voice model.

Scope of Work Engineering Modules

1. Technology Evaluation (The Experiment):

A key part of this internship is the Build vs. Buy evaluation. You will set up two parallel pipelines to test voice cloning quality for Hindi:

Pipeline A (Open Source): Deploy models like F5-TTS (excellent for zero-shot cloning), Coqui XTTS-v2 (strong multilingual support), or Kokoro (lightweight/fast) to test self-hosted performance.

Pipeline B (Commercial APIs): Integrate APIs like ElevenLabs (industry standard for emotion) or Azure AI Speech to set a quality benchmark.

Goal: Produce a technical report comparing them on Hindi pronunciation, emotional range, speed, and cost.

2. Intelligent Script Processing (The Logic):

Build a Python backend where an LLM (like Llama 3) analyses the Hindi text before it is converted to audio. The LLM will insert direction tags into the text (e.g., [Pause for 2 seconds], [Tone: Sorrowful]) to guide the voice generation.

3. Audio Synthesis Pipeline (The Engine):

Develop the code that takes the tagged text and the authors reference audio sample to generate the final speech. You will implement Forced Alignment (using OpenAI Whisper) to ensure the generated audio matches the text perfectly, removing hallucinations or skipped words.

4. Web Application (The Interface):

Create a simple dashboard that allows users to upload a PDF and an MP3 sample. The app should display the text, allow the user to edit the emotion tags manually if needed, and download the final audiobook chapter.

Expected Tangible Outcomes

A comparative study report: Open Source vs. Commercial Voice Cloning for Hindi.

A working Python application that generates an audiobook chapter from a PDF.

A fine-tuned open-source voice model capable of mimicking the target author.

A browser-based interface to demonstrate the tool to non-technical users.
Project Discipline Computer Science, Artificial Intelligence, Machine Learning, Signal Processing, Software Engineering
Faculty Name Ashutosh Bhatia
Department Department of Computer Science & Information Systems