When you Create Captions, look at your values. (You can Create Captions from the same transcript multiple times with different values.) Try a larger value for the # of characters per line. To get your result, you've entered something like 10. Try a minimum of 42, or larger. Also, perhaps you've used a value like 2 seconds for the minimum duration in seconds. Try at least 3.
The phrase you want to keep together ("Come" to "You") spans about 7 seconds. So if that doesn't work, part of the problem may be that the words (being sung I assume?) are spread out enough that the transcription isn't keeping them together. If that is the problem, you may need to merge transcript sections or captions.
Give us more information, and we can provide more thoughts.
Stan