Feature Focus: Transcript-based Lip Sync

Report · Jul 20, 2021

Get better lip sync with improved Adobe Sensei machine-learning technology. Use a transcript to produce a more accurate result.

Using a transcript to improve computed lip sync

Open Character Animator (Beta).
Create a scene from an example puppet on the Home screen (e.g., Chloe (Photoshop)) or open a scene containing one of your puppets.
Choose File > Import, then select the Toothsome Meme.wav audio file (from the Toothsome Meme.zip archive) to import it into the project.
With the audio selected in the Project panel, the Properties panel shows the Transcript text area where you can import or type in the text matching the spoken words and phrases in the audio. For this example, click Import in the Properties panel, then select the Toothsome Meme.txt text file (from the Toothsome Meme.zip archive).
The audio file’s icon in the Project panel changes to to indicate it has a text transcript associated with it. The Type column in the Project panel shows Audio+Transcript for this file.
Drag the Toothsome Meme.wav file from the Project panel into the Timeline panel to add it to the scene.
Select the puppet track in the Timeline panel, hold down the Shift key as you select the audio track so that both tracks are selected, then choose the Timeline > Compute Lip Sync Take from Scene Audio and Transcript menu command.

Character Animator analyzes the audio and, using the associated transcript text, should produce more accurate visemes for the Lip Sync take than if no transcript was used.

If you need to make corrections to the transcript, update the text in the Transcript text area, and then choose the Compute Lip Sync Take from Scene Audio and Transcript command again.

Troubleshooting

If transcript-based lip sync fails:

Check your transcript for typos, missing words, or other mismatch errors.
Add timecodes to the transcript to allow the process to skip over sections with errors. For example, see the Toothsome Meme.srt file (from the Toothsome Meme.zip archive). You can then run standard audio-only lip sync to fill in the gaps.

You can type the timecodes manually or use a transcription program to generate an SRT file (.srt extension) with timecodes. For an .srt file, change its extension to .txt to select it for import, or copy and paste the text directly into the Transcript text area in the Properties panel.
Splice your audio file and transcript into shorter clips. This essentially does the same thing as the timecode approach above, allowing the process to fail for a limited section. You can either:

Splice the file in an audio editing program, import the tracks as separate files, and then import or paste your matching transcript sections in the Properties panel.
Import multiple copies of the same audio file and trim each one within the Character Animator scene. You need to import multiple audio files because the transcripts are linked at the file level, but your transcript text in the Properties panel should match the trimmed track (not the entire audio file).

Known issues and limitations

Transcript-based Lip Sync is still in development.

In this first public Beta version (v22.0.0.31), please note the following:

For audio files longer than about two minutes, add timecode to the transcript at least every two minutes or so, or use an SRT file.
Currently, only English is supported, though if another language was transcribed into text made up of words with typical English phonetics (even if they are not really words), you might be able to get reasonable results.
Avoid abbreviations and spell out symbols and acronyms if they are going to be spoken in the audio file. For example: dollar ($), four point five (4.5), Graphics (GFX).
There is a per-audio-segment progress bar, but it only shows progress for the process of resampling the audio, not for the actual phoneme alignment processing step, so lip sync computation might appear stuck for a bit at the end, particularly with clips in the 2 to 3 minute range.
While the transcript is associated with audio clips, the processing is performed on the rendered scene audio, so if audio overlaps clips being processed, they might interfere with getting good phoneme alignment results.

Transcript text files must use UTF-8 encoding.
Currently, Lip Sync preferences, and specifically the Viseme Detection setting, are not supported for transcript-based lip sync.

What we want to know

We want to hear about your experience with Transcript-based Lip Sync:

What are your overall impressions?
Are you able to get more accurate lip sync results with a transcript?
Are there specific words or phrases that the computed lip sync is failing on?
How can we improve Transcript-based Lip Sync?

Also, we’d love to see what you create with Transcript-based Lip Sync. Share your animations on social media with the #CharacterAnimator hashtag.

Thank you! We’re looking forward to your feedback.

(Use this Beta forum thread to discuss Transcript-based LIp Sync and share your feedback with the Character Animator team and other Beta users. If you encounter a bug, let us know by posting a reply here or choosing Report a bug from the Provide feedback icon in the top-right corner of the app.)

Report · Aug 28, 2021

Has anyone had success with this? It seems to work on 5% of the audio, no matter the length/volume/stereo/mono etc...

Just always fails for me apart from on the demo file.

Report · Aug 30, 2021

Hi, thanks for giving it a try. Sorry it doesn't seem to be working on your audio. Does a marker get created in the timeline? What does the error message in it say?

Do you have an audio and transcript file you'd be willing to share? If so, zip the audio and transcript up and private message me a download link I'd be happy to give it a try and see if I can figure out why it is failing.

Hopefully we can get to the bottom of the error you're hitting. Thanks for reporting back!

Dan Tull

Adobe Character Animator Team

Report · Sep 27, 2021

I have had the same issue where it didn't work well for me either with a SRT file.
What I ended up doing is removing the numbers, but leaving the timecodes, then breaking up the file in pieces.

I.e. copy only the first 5-6 seconds of text, converting it to the Visemes, then doing the next 5-6 seconds until I got the entire audio working! Not ideal, but it got me what I needed.

I am not sure if there was something in the transcript causing the issue, or something else

Report · Sep 27, 2021

I've seen some cases where when transcript lip sync can "drift" a bit for longer segments of text. The usual failure mode is that there's a gap toward the end of the segment that doesn't get any visemes because it aligned too agressively and ran out of text before it ran out of audio. This is really common if there's other sounds mixed in with the audio, but can happen even for just speech, too.

The current implementation has some initialization that happens per segment so it batches segments up to process more at once (by default it tries to make segments about 45-60 seconds long) to strike a balance between performance and precision.

When I lift that init code out so it runs once per invocation (or maybe even once per app session) it should be faster and based on your explanation above, it should get a better result for the SRT case, too. Another advantage is that if there's a piece that it struggles to align, it'll only lose the one timecode range (which for SRT can be just one short phrase).

That was probably too much technical detail, but hey it's a beta program and I figured I'd be as transparent about what's probably happening as possible. Thanks for the feedback!

Dan Tull

Adobe Character Animator Team

Report · Sep 30, 2021

It misses W mouth shapes often when the are at the start of a word. When there is silence the first detected sound it picks up form the next word will fill where silence should be before that word. It seems to pick up accents quite well, for the word Warm it went" ah r b" (again missing the first w) but it did get the ah in warm that the accent had. i find this happens often which is very nice.

It FOR SURE saved time. A friend and i use to animte side by site and we would often debate which was faster, editing a generated lipsync or laying one out fresh on a blank file as you hear it. and they were often about the same lenght of time. so given that it fixes many issues, i feel it is much faster than before.

for three letter words like and, the, its, (etc) i find it does 3 mouthshapes. it could just be my puppet, but 2 mouthshapes for 3 letter words is perfect. any more and you get a muppet-style flappy mouth.

Report · Sep 30, 2021

Thanks for the feedback. I'll have to look at the starting W characters and short words. One thing we have definitely run into is that this way of generating lipsync can be a little too literal/exact. We've looked a little bit at filtering the result to try to make it less "chattery", but those methods still need work. They just reduce the maximum frequency of viseme changes, but aren't very smart about exactly which visemes are superfluous.

Glad to know it is at least helpful in the time saving sense, that's a start. :o)

Dan Tull

Report · Oct 03, 2021

I am currently using this new feature with 2 puppets. one has a mouth that is a cycle layers for each mouthshape with 3 or 4 layers in each cycle. The "Chattery-ness" seems far worse with this style of puppet. My other puppet is one mouth-shape per sound and there is a lot less clean-up involved. It would be nice to have a slider where we can control how many viseme's show up per syallable/word. (btw thanks for all the hard work everyone puts into this software. you guys continue to blow my mind.)

Report · Oct 04, 2021

We have a simple implementation of supporting a preference for how many visemes are produced. It isn't very smart yet about which visemes it skips, but it might help for a case like this.

Thanks for the feedback and kind words. :o)

Dan Tull

Adobe Character Animator Team

Report · Oct 02, 2021

Yeah - it's a bit hit and miss. I got it to work yesterday but not today with the same puppet but longer audio. Hmmm.

Report · Oct 04, 2021

Out of curiosity, did it fail entirely on some parts (it'll usually put a marker on segments that failed) or did it produce lower quality visemes or maybe the the issue a few folks have cited where it aligns too aggressively and toward the end of the audio it stops abruptly due to running out of transcript? Just curious.

Thanks for giving it a try and reporting back.

Dan Tull

Adobe Character Animator Team

Report · Oct 19, 2021

Just downloaded the Beta and am giving is a try. I recorded a short audio, 34 seconds in Premier, did the captions and transcriptions. Created srt file, etc... The compute audio with transcript failed for me. See attached screenshot. I also tried the remove the numbers on the srt file that another user tried but I still got the error message Comput Lip Sync Failed check.

Screen Shot 2021-10-19 at 9.46.12 AM.png

Report · Oct 19, 2021

Version 22.1 Build 27

Report · Oct 19, 2021

If you're comfortable sharing the audio/srt with me (via a download link in a private message if you prefer), I can take a look and see if I can figure out why it is failing.

Dan Tull

Report · Oct 19, 2021

Happy to. I work at a school and use Character Animator for weekly advisory announcements. Can you shoot me a private message and I'm happy to get you the files.

Report · Oct 19, 2021

Found it. My development version (with an enhancement I hope to release soon) does the SRT processing in smaller segments which makes it faster to home in on issues.

The problem seems to be that the last SRT segment is truncated. It says 00:00:34,368, but that is in the middle of the word you and cuts off the name at the end entirely. When I change that to 00:00:34,668, it doesn't fail. :o)

When I get this new version released, an error like that would only lose that last line, so it should be a lot easier to figure out what's going wrong. Coming soon!

DT

Report · Oct 19, 2021

Thank you! I appreciate the hard work and all of the new features.

Report · Oct 19, 2021

Just tried it, worked like a charm! I'll keep this in mind if I run into a similar issue prior to next version. Thank you again!

Report · Oct 19, 2021

Great! Glad it helped, seeing more examples is a big help, so thanks for the report!

DT

Report · Oct 22, 2021

An update to transcript lipsync is in build 31 (pushed earlier today, might have to poke at the CC app to get it to recognize that there's an update). Basically if you are using an SRT transcript, it will now process each timecode delimited part of the transcript separately. This means a few things:
• if it fails, it should tend to only lose a few words and it'll be more obvious which line tripped it up

• it should "drift" less because it has more timepoints to keep it lined up

• for really long audio+SRT, it should be a bit faster (I found a 13 minute public domain file with MLK's I have a dream speech and it was about 25% faster: 77 vs 106 seconds)

• the progress dialog will look a little weird though, it wasn't really meant for showing progress for a series of very small items, but that's cosmetic (the number that counts up in the dialog will still give you an idea of progress)

Hopefully that helps a bit. More to come.

DT

Report · Oct 30, 2021

Dan,

Quick question. I just sent a friend my Ch. Anim. file of a puppet I built plus the Ch Data and Ch Media folders and when they open the Ch Anim comp, they get the "missing file" color bars in the viewport. However, I am running a PC and they are on a Mac, so are they not compatible?

Report · Oct 31, 2021

Projects are stored in a platform neutral format, so it should open on Mac or Windows. However, if artwork files for a puppet are not gathered into the project file (via the "Copy Media Files into Project Folder" command in the File menu), it may be unable to locate them on a different machine. Select the puppet (in the project panel) that is showing up with "color bar" content in the scene and see if the puppet's artwork file is in orange, it means the file is not in the expected location.

The resolution can be either running that "Copy Media" command before zipping up the project, or just make sure the the artwork is provided as well and click the orange artwork path in the properties panel to point Character Animator to where the artwork files reside.

Hope that helps!

Dan Tull

Character Animator Team

Feature Focus: Transcript-based Lip Sync

Photos