In the broadcast world, we need our captions to denote when music is playing and also any kind of ambient sound like laughing, clapping, etc. It would be wonderful if we could run the transcript feature on a finished show and have it note music and SFX in the transcription, which could then carry over to the captions when we create them from the transcript.
It would also be great if the system could learn speakers that are used regularly. For example, if we have show hosts that are always the same every week, it would be great if the AI could learn and recognize the same voice from show to show and have that reflected in the transcript/captions.
Thanks for your consideration.