Thank you for your message. I am sorry you are having this problem.
Your prompt is too long, too conversational, like a story. The model will not understand it well. The most important object should be as far left as possible and everything related to that object should be grouped together.
"From inside the living room of a middle-class house, we see a couple and two children entering the house carrying suitcases; exterior direction towards the interior of the house"
Remove the extraneous fluff from the prompt. No one will recognize/identify a "living room of a middle-class house". Simplifying to "Family entering front door of house with two children; carrying suitcases; point of view from inside house; subjects facing front towards camera"