Automating audio generation

From text , generate audio files and publishing them to webapp

Using Text-to-Audio Conversion Service to Publish Audio Content to a Web App

In this post, we'll walk through an approach to convert text into audio using Azure AI Speech Service and serve the generated audio files from a web application.


Method Provided by LLM

The following approach was generated with the assistance of an LLM (GitHub Copilot). The code implementation is left to the reader.

Architecture

Text Input → Azure AI Speech Service → .wav/.mp3 file → Azure Blob Storage → Web App

Approach — Step by Step

1. Provision an Azure AI Speech Resource

  • Go to the Azure PortalCreate a resource → search Speech → select Speech by Microsoft
  • Choose a Resource group (or create one, e.g. rg-audio-gen)
  • Pick a Region close to you (e.g. East US)
  • Select Pricing tier: Free F0 for testing, or Standard S0 for production
  • After deployment, navigate to the resource and note down the Key and Region from the Keys and Endpoint blade
  • Alternatively, use the Azure CLI: az cognitiveservices account create with --kind SpeechServices

2. Set Up Your Python Environment

  • Install the Azure Speech SDK package: azure-cognitiveservices-speech
  • Store your Speech key and region as environment variables (AZURE_SPEECH_KEY, AZURE_SPEECH_REGION) — never hard-code secrets

3. Write the Text-to-Audio Conversion Logic

  • Create a SpeechConfig object using your key and region
  • Set the voice using speech_synthesis_voice_name — Azure supports 400+ neural voices across 140+ languages (see voice gallery)
  • Create an AudioOutputConfig pointing to a local .wav file
  • Instantiate a SpeechSynthesizer and call speak_text_async() with your input text
  • Handle both the success case (SynthesizingAudioCompleted) and the cancellation/error case
  • The SDK writes the audio directly to the specified output file

4. Provision Azure Blob Storage

  • Create a Storage Account in the same resource group (e.g. via az storage account create)
  • Create a Blob Container (e.g. audio-files) with public blob-level read access so audio URLs are directly accessible
  • Install the azure-storage-blob Python package
  • Store the storage connection string as an environment variable (AZURE_STORAGE_CONNECTION_STRING)

5. Upload the Generated Audio File

  • Use BlobServiceClient.from_connection_string() to connect
  • Get a BlobClient for your container and target blob name
  • Open the local .wav file in binary mode and call upload_blob() with overwrite=True
  • The blob client exposes a .url property — this is the public URL of your audio file
  • Optionally delete the local file after upload to save disk space

6. Serve the Audio in Your Web App

  • The uploaded blob will have a public URL like: https://<account>.blob.core.windows.net/audio-files/output.wav
  • Embed it in HTML using the <audio controls> element with a <source> tag
  • In Flask (this blog's stack), you can create a route that renders the audio player with the blob URL passed as a template variable

7. End-to-End Flow

Combine everything into a single function that: 1. Takes text input as a parameter 2. Synthesizes it to a local .wav file via Azure Speech 3. Uploads the file to Blob Storage 4. Cleans up the local file 5. Returns the public URL

Key Azure Services Used

Service Purpose
Azure AI Speech Converts text to natural-sounding audio using neural voices
Azure Blob Storage Hosts the generated audio files with public URL access

Cost Estimate

Service Free Tier Pay-as-you-go
Azure AI Speech 500K chars/month (F0) ~$1 per 1M chars (S0)
Azure Blob Storage 5 GB free for 12 months ~$0.02/GB/month

Important Notes

  • Security: Use environment variables for all keys and connection strings. Consider Azure Key Vault for production.
  • Voice selection: Experiment with different speech_synthesis_voice_name values. Multi-lingual voices (e.g. en-US-JennyMultilingualNeural) can handle mixed-language text.
  • Output format: The SDK defaults to WAV. For smaller files, configure MP3 output via speech_config.set_speech_synthesis_output_format().
  • SSML: For finer control over pronunciation, pauses, pitch, and speed, use SSML markup instead of plain text with speak_ssml_async().
  • Batch processing: For large volumes of text, consider splitting into chunks and processing in parallel.

Cleanup

Delete the resource group when done to avoid charges:

az group delete --name rg-audio-gen --yes --no-wait

References


Background & Prerequisites — What You Need to Know Before Writing This Blog

To build a text-to-audio pipeline on Azure, you need to understand speech synthesis, cloud storage, and web serving. Below are the foundational topics.


1. Text-to-Speech (TTS) Fundamentals

Why: The core of this project is converting text to audio — you need to understand how modern TTS works. - Concatenative TTS — Older approach: record a voice actor speaking thousands of phoneme combinations, then concatenate matching segments at runtime. Sounds robotic at boundaries. - Neural TTS — Modern approach: deep neural networks (WaveNet, Tacotron, VITS) generate speech waveforms directly from text. Produces natural, human-like speech with proper intonation and prosody. - Azure Neural Voices — Microsoft's neural TTS offering. 400+ voices across 140+ languages. Voices are trained on hours of recorded speech from voice actors. Styles include "cheerful," "sad," "newscast," "customer service." - Phonemes & Prosody — Phonemes are the individual units of sound. Prosody covers rhythm, stress, and intonation. Neural TTS models learn prosody from training data but can be controlled via SSML. - Vocoder — Converts the model's intermediate representation (mel spectrogram) into an audio waveform. HiFi-GAN and WaveRNN are common vocoders.

2. SSML (Speech Synthesis Markup Language)

Why: For production-quality audio, plain text is not enough — SSML gives fine-grained control. - What is SSML — An XML-based markup language that controls how text is spoken. Supported by all major TTS engines (Azure, Google, AWS Polly). - Key tags — - <speak> — Root element - <voice> — Select a specific voice - <prosody> — Control rate, pitch, volume (e.g., <prosody rate="slow" pitch="+10%">) - <break> — Insert pauses (e.g., <break time="500ms"/>) - <emphasis> — Stress certain words - <say-as> — Control interpretation (dates, numbers, abbreviations: <say-as interpret-as="date">2026-02-22</say-as>) - <phoneme> — Specify exact pronunciation using IPA (International Phonetic Alphabet) - Multi-voice conversations — Use multiple <voice> tags within one SSML document to create dialogue-style audio.

3. Audio Formats & Encoding

Why: Choosing the right audio format affects file size, quality, and browser compatibility. - WAV (Waveform Audio) — Uncompressed, lossless. Large files (~10MB per minute of speech). Best quality but impractical for web serving. - MP3 — Compressed, lossy. ~1MB per minute at 128kbps. Universal browser support. Best for web delivery. - OGG/Opus — Open-source, excellent compression. Slightly better quality than MP3 at same bitrate. Not universally supported (no Safari on iOS). - Azure SDK output formats — Configurable via SpeechSynthesisOutputFormat enum. Options include Audio16Khz32KBitRateMonoMp3, Riff24Khz16BitMonoPcm, etc. Choose based on quality vs size tradeoff. - Sample rate — 16kHz is fine for speech. 24kHz or 48kHz for higher quality. Higher sample rate = larger file.

4. Azure Blob Storage Fundamentals

Why: Audio files need to be stored and served — Blob Storage is the hosting layer. - Storage account types — General-purpose v2 (recommended), BlobStorage (legacy). Always use GPv2. - Blob types — Block blobs (for files like audio), Append blobs (logs), Page blobs (VM disks). Audio files are block blobs. - Access tiers — Hot (frequent access, higher storage cost, lower access cost), Cool (infrequent, 30-day minimum), Archive (rare, hours to rehydrate). Audio served on a web app should be Hot. - Access levels — Private (default, requires SAS token or auth), Blob (public read for blobs only), Container (public read for container + blobs). For a simple web app, Blob-level access enables direct URL access. - SAS (Shared Access Signature) — Time-limited, scoped tokens for accessing private blobs. Better than making blobs public in production. - CDN integration — Azure CDN can cache blob content at edge locations worldwide. Reduces latency for audio playback. Configure a CDN profile pointing to the storage endpoint.

5. Azure AI Speech SDK

Why: The SDK is the primary tool for implementing text-to-audio conversion. - SpeechConfig — Configuration object holding the subscription key, region, voice name, and output format. Created once, reused across synthesis calls. - SpeechSynthesizer — The main class. Methods: speak_text_async() (plain text), speak_ssml_async() (SSML input). Can output to a file, audio stream, or speaker. - AudioOutputConfig — Controls where audio goes: file path, audio stream, or default speaker. For server-side processing, output to a file or memory stream. - Event-driven architecture — The SDK fires events: synthesis_started, synthesizing (partial audio chunks), synthesis_completed, synthesis_canceled. Use these for progress tracking and error handling. - Error handling — Check SpeechSynthesisResult.reason. Values: ResultReason.SynthesizingAudioCompleted (success), ResultReason.Canceled (failure). On cancellation, inspect CancellationDetails for error code and message. - Batch synthesis — For large documents, split into paragraphs, synthesize each, then concatenate audio files (using pydub or ffmpeg).

6. Web App Integration

Why: The generated audio needs to be served and played in a browser. - HTML5 <audio> element — Native browser audio player. Supports MP3, WAV, OGG. Use controls attribute for play/pause/seek UI. Use preload="metadata" to load duration without downloading full file. - Streaming vs Download — For short clips (<5 min), direct blob URL works. For longer audio, consider Azure Media Services or progressive download with range requests. - Flask integration — Create a route that accepts text input (POST), calls the synthesis pipeline, uploads to Blob Storage, and returns the audio URL. Render the audio player in a Jinja2 template. - Responsive design — The audio player should work on mobile. HTML5 <audio> is responsive by default but style with CSS for consistency.


TODO / Remaining Work

  • [ ] Implement the text-to-audio conversion script using Azure Speech SDK
  • [ ] Test with different neural voices and compare quality
  • [ ] Implement SSML support for fine-grained control
  • [ ] Set up Azure Blob Storage and implement upload logic
  • [ ] Build the Flask route and audio player page
  • [ ] Add batch processing for long documents
  • [ ] Document cost analysis with real usage numbers
  • [ ] Add architecture diagram (Mermaid) of the full pipeline
  • [ ] Add screenshots of the working web app
  • [ ] Change status from workinprogress to published

Reference Implementation

A self-contained Python module that implements steps 3-7 of the plan above. Drop it into a Flask route or call from a background worker.

# tts_pipeline.py
import os, uuid, tempfile
import azure.cognitiveservices.speech as speechsdk
from azure.storage.blob import BlobServiceClient, ContentSettings

SPEECH_KEY    = os.environ["AZURE_SPEECH_KEY"]
SPEECH_REGION = os.environ["AZURE_SPEECH_REGION"]
STORAGE_CONN  = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
CONTAINER     = os.environ.get("AUDIO_CONTAINER", "audio-files")

def synthesize(text: str, voice: str = "en-US-JennyNeural") -> str:
    """Synthesize text to MP3 and upload to Blob Storage. Returns the public URL."""
    cfg = speechsdk.SpeechConfig(subscription=SPEECH_KEY, region=SPEECH_REGION)
    cfg.speech_synthesis_voice_name = voice
    cfg.set_speech_synthesis_output_format(
        speechsdk.SpeechSynthesisOutputFormat.Audio24Khz96KBitRateMonoMp3
    )

    with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
        audio_cfg = speechsdk.audio.AudioOutputConfig(filename=tmp.name)
        synth = speechsdk.SpeechSynthesizer(speech_config=cfg, audio_config=audio_cfg)
        result = synth.speak_text_async(text).get()
        if result.reason != speechsdk.ResultReason.SynthesizingAudioCompleted:
            raise RuntimeError(f"TTS failed: {result.cancellation_details.reason}")

    blob_name = f"{uuid.uuid4().hex}.mp3"
    svc = BlobServiceClient.from_connection_string(STORAGE_CONN)
    blob = svc.get_blob_client(container=CONTAINER, blob=blob_name)
    with open(tmp.name, "rb") as f:
        blob.upload_blob(
            f, overwrite=True,
            content_settings=ContentSettings(content_type="audio/mpeg")
        )
    os.unlink(tmp.name)
    return blob.url

def synthesize_ssml(ssml: str) -> str:
    """Same as synthesize() but for SSML input — use for fine prosody control."""
    cfg = speechsdk.SpeechConfig(subscription=SPEECH_KEY, region=SPEECH_REGION)
    cfg.set_speech_synthesis_output_format(
        speechsdk.SpeechSynthesisOutputFormat.Audio24Khz96KBitRateMonoMp3
    )
    with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
        synth = speechsdk.SpeechSynthesizer(
            speech_config=cfg,
            audio_config=speechsdk.audio.AudioOutputConfig(filename=tmp.name),
        )
        result = synth.speak_ssml_async(ssml).get()
        if result.reason != speechsdk.ResultReason.SynthesizingAudioCompleted:
            raise RuntimeError("TTS failed")
    # (upload identical to above — omitted for brevity)

Flask Route

from flask import Blueprint, request, render_template
from tts_pipeline import synthesize

tts_bp = Blueprint("tts", __name__)

@tts_bp.post("/tts")
def tts():
    text = request.form["text"][:5000]  # cap input
    voice = request.form.get("voice", "en-US-JennyNeural")
    url = synthesize(text, voice)
    return render_template("audio_player.html", audio_url=url)

Template Snippet

<audio controls preload="metadata" src="{{ audio_url }}">
  Your browser doesn't support HTML5 audio.
</audio>

When every TODO is ticked and the pipeline reliably produces audio for 10 representative inputs, flip this post to status: published.