I have a small problem with the new "Text to Speech" module. In and of itself, it works much better than expected. But at one point there are problems. I wanted to apply the module to a longer conversation between four people. Before, I had already separated the audio tracks and created a separate audio track for each participant in the conversation, which only contains what the respective person is saying. I thought it would be easier for this new feature to distinguish between the people. But unfortunately there is no option to create a transcript over several tracks at the same time and to distinguish the persons on the basis of these. I tried a workaround in which I tried to recognise the tracks one by one and then merge them. This did not work well either. Because it simply placed the text of a new track under the existing one in the transcript. This made the transcript messy. Furthermore, I would like to know if this feature is also planned in Audition? I would like to use it for my podcast productions directly in this programme.
Another point: If I want to change something in the transcript and double-click on the particular passage, I completely loose the orientation in the text, the words are no longer shown in the same way as before. So I have to look for the passage again to change it. That is rather inconvenient.
I would experiment by adding the 4 tracks to a sequence and nesting the audio in the sequence you want the captions in. I know there has been an issue with nested audio, so I don't know if that will work for other reasons.
Are you in a version of the beta where it gives you the option of identifying speakers? They used to, then changed in some relatively recent version, and the transcription did not appear nearly as good to me. But I haven't tested enough.
The new caption workflow, and accordingly transcription, is not multi-track friendly.