How can I generate complex scenes in Firefly without the system blocking or failing?

You CAN'T create videos with Kling in Adobe!
I’m trying to generate a short video (5 seconds) of a group of warriors around a central fire inside a stone hall. The scene includes subtle movement (for example: small steps, slight body motion, or simple reactions), but the system either fails or simplifies the result too much.
I’ve already simplified the prompts, reduced movement, and avoided complex actions, but it still struggles when multiple characters share the same space.
My question is:
What are the current limitations when generating multi-character scenes with interaction, and how can I improve stability and control in these cases?
Are there specific prompt structures or constraints that work better for this type of scene?
Any practical insight from users who have achieved stable multi-character results would be appreciated.
