vtml in audition text to speech?

Report · Apr 16, 2018

Just read about vtml tags that can add emphasis and pauses to make TTS sound less like a robot.

I've tried it out in Audition CC and it seems to have no effect.

Example:

<vtml_pitch value ="150>Fact</vtml_pitch>: <vtml_pause time="500"/>Employees who are auditory learners, train faster and more effectively, by listening to iSpeech text to speech, inside of e-learning courses.

Am I missing something to make this work? If the feature isn't available in Audition, what other tools would provide the service?

Report · Apr 16, 2018

Corrected to <vtml_pitch value ="150"> but still no effect

Report · Apr 16, 2018

TTS is very primitive in Audition and relies mainly on your operating system for what can be done. It uses the built in TTS provided by the OS to generate speech. So depending whether are on a Mac or a PC the functionality will be different. I don't know if inputting text into Audition's TTS will actually pass on any vtml tags to the OS speech generator.

Report · Apr 16, 2018

VTML probably won't work, but as far as I'm aware, VoiceXML does... see Speech Synthesis Markup Language (SSML) Version 1.1

Report · Apr 17, 2018

I tried a VoiceXML text (this is on Audition CC on Windows 10) and it has no effect on the output. Below is the the text I entered (copied from emphasis | VoiceXML Language Reference—Part 1 | InformIT ) . It sounds identical to just "The most important thing. The least important thing." :

<?xml version="1.0" encoding="iso-8859-1"?>

<block>

The <emphasis level="strong"> most </emphasis> important thing.<break/>The <emphasis level="reduced"> least </emphasis> important thing. </prompt>

</block>

</form>

</vxml>

Report · Apr 17, 2018

You may have to alter the narrate' settings in the OS - voices can be set not to respond. I should say though that when I tried this before, it took quite a while to get it to work correctly, because the OS seems to be rather intransigent when it comes to changing settings...

Report · Apr 17, 2018

Hi @SteveG,

Did you manage to get vtml or voiceXML working correctly? Was it in Windows 10? Can you give some indication as to where to look? Would an iPhone work better for this task?

Report · Apr 17, 2018

I tried this ages ago (actually when it was first introduced) and had a lot of trouble trying to get it to recognise anything other than the two voices that appear as defaults, and those ones I don't seem to be able to alter. On the present release, there seems to be something of a disconnect between what Audition is prepared to work with, and what the OS thinks is available. This has never been an entirely successful feature, and I suspect that it's dropped a little lower in the 'must fix' pile for several reasons; not the least of those being that you can't necessarily legally release anything you create with it.

What should happen is that in the OS 'settings' page there's an option called 'narrator' and this is where you can select a default voice and control aspects of its behaviour. At present, Audition doesn't appear to be responding to any changes made here at all. Further investigation required...

Report · Sep 30, 2018

I have been playing with Generate Speech in Audition CC 11.1 for the Mac. I am running 10.12.6 on a MacPro. Many of the Embedded Speech Command described in the Apple Speech Synthesis Programming Guide don’t seem to work—or I don’t know enough about the coding to get them to work. But are a few that do work:

slnc: does work well to insert silence. The format is [[slnc xxx]] where xxx represents the length of the pause in milliseconds. I find [[slnc 120]] works as a slight pause while [[slnc 300]] is good to separate sentences. I’ve gone as high as [[slnc 650]] in some dialog.

rate: is very good to slow down a few words or a single word for emphasis. [[rate xxx]] where xxx is words per minute. Here is an example: [[rate 120]] He [[rate 158 ]] is the victim. This elongates and emphasizes “He.” Once it is used it must be cancelled by a second rate command to return to the normal wpm. In this case the wpm set in the Generate Speech tab was set at 158.

volm: Initial volume setting is determined by the setting in the Generate Speech tab slider and volume without any specification is determined by that percentage. The code [[volm 0.x]] setting moves up and down from 10% to 100%. The form is [[volm 0.x]] where x is an integer between 1 and 9. It also works as 1.0. The code must have the 0 before number as in 0.7. Partial fractions such as 0.75 are not recognized. The setting 1.0 is higher than the 100% set in the Generate Speech tab. Depending on the voice a 100% setting in the tab corresponds to about [[volm 0.9.]] The volume setting is not just for the next word but stays in place until another [[volm 0.x]] command is given. I have not found that [[volm + 0.1]] or just [[volm +]] works on my machine. It is supposed to increase or decrease relative to its current value, but no go for me.

pbas: (pitch modulation) works but varies depending on the voice chosen. The form is [[pbas xxx]] where xxx can have a low of about 45 and a high of about 350. I find that a return to the normal pitch varies by voice but is generally within the 100 to 150 region. Since I an using this in Audition I find this easier to use Pitch Bender on the actual file.

emph: The format here is [[emph +/-]] but it does nothing on my computer. I have used other emphasis commands such a vol, rate, and punctuation.

punctuation: depending on the text changing the conventional punctuation often helps enormously. For example add a period in the middle of a sentence, try commas and both semi colons and colons, exclamation points and question marks.

I’d be delighted to hear: how to implement the other listed, OS X embedded [[slnc 150]] speech commands.

Report · Sep 01, 2021

Thanks for sharing. This rare bit of info has helped me a lot in an experiement I'm trying for text to speech.

Kevin

Kevin Monahan - Sr. Community & Engagement Strategist – Pro Video and Audio