Text Base Editing Multiple Speakers

Forum|Forum|2 years ago
June 9, 2023
2 replies
256 views

So I'm using the text-based editing AI, which needs improvement for multiple speakers. I'm working on a documentary and it keeps putting two speakers under one speaker category. It's also not really detecting the different speakers I have 6 other people it put them down as only 2 speakers.

If anyone has any solution for the text-based AI grouping multiple speakers under one category let me know.

Effects

TDDC

Participant

Ya, I'm trying to use it for long audio tracks with 4 speakers, and every try so far, Premiere has transcribed it with only 2 different speakers. I can assure you the voices sound quite different. This same problem is why I stopped using DeScript for transcriptions - they also often get the number of speakers wrong. At which point, the time-savings were badly diminished.

Now, Descript DOES have an option to let the user help label the speakers, by giving you a bunch of voice samples and having you label each one. That's a great idea, except their program chooses those samples automatically... and is TERRIBLE at it; often their set of samples misses one or more speakers entirely, while including brilliant clips such as a person eating chips during a break (no dialogue), or the old "20 seconds of silence and then the door opening". Now, their process would work great IF the user could manually select their own clips (which we can't in Descript), and if you could include one of more speakers from a previous file so the program has additional data, to get better and better each time. Recognizing repeat speakers would be useful in many common scenarios, such as identifying the 4 regulars on a podcast, or identifying an interviewer who is the same in every source clip (and therefore more easily knowing which lines come from a guest, who may be a new voice each time).

But Adobe could blow the competition out of the water if you just incorporate those features.

A) [bare minimum] Let the user (optionally) provide their own short clips of each speaker to help 'prime' the software to recognize those voices. This could also let the user test on a smaller subsection to get all the labels right before transcribing the whole thing only to possibly misidentify or completely miss speakers (as it does now)

and ideally

B) [shoot for the moon] have the option to save data from one or more recurring speakers so that each user's transcription gets better over time