Some people prefer lots of short single line captions or subtitles with a bigger font, others prefer two lines with more text. At the moment I can't see any control in where the break points come for a new caption when someone is speaking continuously. Which means one would have to do a lot of work merging or splitting captions to fit your own style.
A referece is otter.ai which transcribes and then spits out SRT captions. When it does so it aks you
- max number of lines (usually 2, but for some people it's 1) - max characters per line (e.g. in region of 20-30)