>Borrowed with love from Serge Retkowsky at https://github.com/retkowsky/Azure-OpenAI-demos

# Text to Speech avatar

### From Azure Speech Service
Custom text to speech avatar allows you to create a customized, one-of-a-kind synthetic talking avatar for your application. With custom text to speech avatar, you can build a unique and natural-looking avatar for your product or brand by providing video recording data of your selected actors. If you also create a custom neural voice for the same actor and use it as the avatar's voice, the avatar will be even more realistic.

<img src="https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/media/custom-avatar-workflow.png#lightbox">

* https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/batch-synthesis-avatar
* https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/what-is-custom-text-to-speech-avatar 
* https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-speech-announces-public-preview-of-text-to-speech/ba-p/3981448

In [1]:
#%pip install requests

In [None]:
# Assumes you have an .env file in the current directory
# It should look like this:
#
# SPEECH_KEY=266xxxxxxxxxxxxxxxxe4f
# SPEECH_REGION=southeastasia

%reload_ext dotenv
%dotenv

In [2]:
import datetime
import json
import requests
import sys
import time
import logging
import os

from IPython.display import display, Video
from pathlib import Path

In [3]:
logging.basicConfig(
    stream=sys.stdout,
    level=logging.INFO,
    format="[%(asctime)s] %(message)s",
    datefmt="%m/%d/%Y %I:%M:%S %p %Z",
)
logger = logging.getLogger(__name__)

In [4]:
# Azure Speech config
azure_speech_key = os.getenv('SPEECH_KEY')
azure_speech_region = os.getenv('SPEECH_REGION')

In [5]:
service_host = "customvoice.api.speech.microsoft.com"  # Do not change

## Functions

In [6]:
def submit_synthesis(prompt):
    url = f"https://{azure_speech_region}.{service_host}/api/texttospeech/3.1-preview1/batchsynthesis/talkingavatar"

    header = {
        "Ocp-Apim-Subscription-Key": azure_speech_key,
        "Content-Type": "application/json",
    }

    payload = {
        "displayName": "Simple avatar synthesis",
        "description": "Simple avatar synthesis description",
        "textType": "PlainText",
        "synthesisConfig": {
            "voice": "ro-RO-AlinaNeural",
        },
        "customVoices": {
            # "YOUR_CUSTOM_VOICE_NAME": "YOUR_CUSTOM_VOICE_ID"
        },
        "inputs": [
            {
                "text": prompt,
            },
        ],
        "properties": {
            "customized": False,  # set to True if you want to use customized avatar
            "talkingAvatarCharacter": "lisa",  # talking avatar character
            "talkingAvatarStyle": "technical-standing",  # talking avatar style, required for prebuilt avatar, optional for custom avatar
            "videoFormat": "webm",  # mp4 or webm, webm is required for transparent background
            "videoCodec": "vp9",  # hevc, h264 or vp9, vp9 is required for transparent background; default is hevc
            "subtitleType": "soft_embedded",
            "backgroundColor": "transparent",
        },
    }

    response = requests.post(url, json.dumps(payload), headers=header)

    if response.status_code < 400:
        logger.info("Batch avatar synthesis job submitted successfully")
        logger.info(f'Job ID: {response.json()["id"]}')
        return response.json()["id"]

    else:
        logger.error(f"Failed to submit batch avatar synthesis job: {response.text}")

In [7]:
def get_synthesis(job_id):
    global avatar_url
    url = f"https://{azure_speech_region}.{service_host}/api/texttospeech/3.1-preview1/batchsynthesis/talkingavatar/{job_id}"

    header = {"Ocp-Apim-Subscription-Key": azure_speech_key}

    response = requests.get(url, headers=header)

    if response.status_code < 400:
        logger.debug("Get batch synthesis job successfully")
        logger.debug(response.json())

        status = response.json()["status"]

        if status == "Succeeded":
            avatar_url = response.json()["outputs"]["result"]
            logger.info(f"Batch synthesis job succeeded, download URL: {avatar_url}")

        return status
    else:
        logger.error(f"Failed to get batch synthesis job: {response.text}")

In [8]:
def list_synthesis_jobs(skip: int = 0, top: int = 100):
    """List all batch synthesis jobs in the subscription"""

    url = f"https://{azure_speech_region}.{service_host}/api/texttospeech/3.1-preview1/batchsynthesis/talkingavatar?skip={skip}&top={top}"

    header = {"Ocp-Apim-Subscription-Key": azure_speech_key}

    response = requests.get(url, headers=header)

    if response.status_code < 400:
        logger.info(
            f'List batch synthesis jobs successfully, got {len(response.json()["values"])} jobs'
        )
        logger.info(response.json())
    else:
        logger.error(f"Failed to list batch synthesis jobs: {response.text}")

## Test

In [15]:
from datetime import date

# Define the day names in Romanian
day_names_ro = ["Luni", "Marți", "Miercuri", "Joi", "Vineri", "Sâmbătă", "Duminică"]

# Get the current day of the week (0 = Monday, 1 = Tuesday, ..., 6 = Sunday)
today = date.today()
day_index = today.weekday()

# Get the corresponding day name in Romanian
day_name_ro = day_names_ro[day_index]

prompt = f"""
Buna, eu sunt Lisa, un avatar al serviciului Azure Speech.
Astăzi este {day_name_ro}.

Avatarul Text to Speech convertește un text într-un videoclip digital al unui om fotorealist (fie un avatar preconstruit, fie un avatar personalizat) vorbind cu o voce naturală.
Videoclipul avatar text to speech poate fi sintetizat asincron sau în timp real.
Dezvoltatorii pot construi aplicații integrate cu avatarul text to speech printr-un Api sau pot utiliza Speech Studio pentru a crea conținut video fără a scrie cod.

Cu modelele avansate de rețele neuronale ale avatarului text to speech, funcția permite utilizatorilor să ofere videoclipuri avatar de vorbire sintetica de înaltă calitate pentru diverse aplicații, aplicând practici AI responsabile.

Avatar text to speech este disponibil numai în următoarele regiuni Azure:
West US 2, Europa de Vest și Asia de Sud-Est.
"""

In [16]:
print(prompt)


Buna, eu sunt Lisa, un avatar al serviciului Azure Speech.
Astăzi este Marți.

Avatarul Text to Speech convertește un text într-un videoclip digital al unui om fotorealist (fie un avatar preconstruit, fie un avatar personalizat) vorbind cu o voce naturală.
Videoclipul avatar text to speech poate fi sintetizat asincron sau în timp real.
Dezvoltatorii pot construi aplicații integrate cu avatarul text to speech printr-un Api sau pot utiliza Speech Studio pentru a crea conținut video fără a scrie cod.

Cu modelele avansate de rețele neuronale ale avatarului text to speech, funcția permite utilizatorilor să ofere videoclipuri avatar de vorbire sintetica de înaltă calitate pentru diverse aplicații, aplicând practici AI responsabile.

Avatar text to speech este disponibil numai în următoarele regiuni Azure:
West US 2, Europa de Vest și Asia de Sud-Est.



## Avatar batch generation

In [17]:
start = time.time()

job_id = submit_synthesis(prompt)

if job_id is not None:
    while True:
        status = get_synthesis(job_id)
        if status == "Succeeded":
            logger.info("Done! Azure batch avatar synthesis job succeeded.")
            elapsed = time.time() - start
            print("Elapsed time: " + time.strftime("%H:%M:%S.{}".format(str(elapsed % 1)[2:])[:15],
                                                   time.gmtime(elapsed)))

            break
        elif status == "Failed":
            logger.error("Failed")
            break
        else:
            logger.info(f"Please wait. Status: [{status}]")
            time.sleep(30)

[04/23/2024 01:15:57 AM EEST] Batch avatar synthesis job submitted successfully
[04/23/2024 01:15:57 AM EEST] Job ID: 454bbf7f-a797-4d62-876b-fedac15fa366
[04/23/2024 01:15:58 AM EEST] Please wait. Status: [Running]
[04/23/2024 01:16:29 AM EEST] Please wait. Status: [Running]
[04/23/2024 01:17:01 AM EEST] Please wait. Status: [Running]
[04/23/2024 01:17:32 AM EEST] Please wait. Status: [Running]
[04/23/2024 01:18:04 AM EEST] Please wait. Status: [Running]
[04/23/2024 01:18:35 AM EEST] Please wait. Status: [Running]
[04/23/2024 01:19:06 AM EEST] Please wait. Status: [Running]
[04/23/2024 01:19:38 AM EEST] Please wait. Status: [Running]
[04/23/2024 01:20:09 AM EEST] Please wait. Status: [Running]
[04/23/2024 01:20:41 AM EEST] Please wait. Status: [Running]
[04/23/2024 01:21:12 AM EEST] Batch synthesis job succeeded, download URL: https://cvoiceprodsea.blob.core.windows.net/batch-synthesis-output/454bbf7f-a797-4d62-876b-fedac15fa366/0001.webm?skoid=85130dbe-2390-4897-a9e9-5c88bb59daff&skt

## Avatar video file

In [18]:
print(f"\033[1;31;34mThis is the prompt to speak:\n {prompt}")

[1;31;34mThis is the prompt to speak:
 
Buna, eu sunt Lisa, un avatar al serviciului Azure Speech.
Astăzi este Marți.

Avatarul Text to Speech convertește un text într-un videoclip digital al unui om fotorealist (fie un avatar preconstruit, fie un avatar personalizat) vorbind cu o voce naturală.
Videoclipul avatar text to speech poate fi sintetizat asincron sau în timp real.
Dezvoltatorii pot construi aplicații integrate cu avatarul text to speech printr-un Api sau pot utiliza Speech Studio pentru a crea conținut video fără a scrie cod.

Cu modelele avansate de rețele neuronale ale avatarului text to speech, funcția permite utilizatorilor să ofere videoclipuri avatar de vorbire sintetica de înaltă calitate pentru diverse aplicații, aplicând practici AI responsabile.

Avatar text to speech este disponibil numai în următoarele regiuni Azure:
West US 2, Europa de Vest și Asia de Sud-Est.



## Download avatar video

In [19]:
!wget "{avatar_url}" -O avatar.mp4

--2024-04-23 01:21:12--  https://cvoiceprodsea.blob.core.windows.net/batch-synthesis-output/454bbf7f-a797-4d62-876b-fedac15fa366/0001.webm?skoid=85130dbe-2390-4897-a9e9-5c88bb59daff&sktid=33e01921-4d64-4f8c-a055-5bdaffd5e33d&skt=2024-04-22T22%3A16%3A12Z&ske=2024-04-28T22%3A21%3A12Z&sks=b&skv=2023-11-03&sv=2023-11-03&st=2024-04-22T22%3A16%3A12Z&se=2024-04-23T22%3A21%3A12Z&sr=b&sp=rl&sig=bfWiGGKrdwgV%2FjCMiSWnX%2BEW3QHkqW2DqyuDgEArHg0%3D
Resolving cvoiceprodsea.blob.core.windows.net (cvoiceprodsea.blob.core.windows.net)... 20.60.139.161, 20.60.136.97, 20.150.127.161
Connecting to cvoiceprodsea.blob.core.windows.net (cvoiceprodsea.blob.core.windows.net)|20.60.139.161|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19709830 (19M) [application/octet-stream]
Saving to: ‘avatar.mp4’


2024-04-23 01:21:34 (940 KB/s) - ‘avatar.mp4’ saved [19709830/19709830]



## Play video

In [20]:
Video('avatar.mp4', width=960)