Develop a low resolution 3D representation of the scene that you can craft the camera motion with using a camera rig.
The AI studies the start (and end) reference images and creates a rough low res 3D representation of the scene (Gaussian splats?). Adobe provides a viewer and tools in which you can guide the camera movement and timing, ease in and ease out, speed, arc, etc. Then the AI executes the render based on the accurate instructions. Word prompts aren’t enough. Currently the AI is very good at misunderstanding clear instructions and/or making artistically bad choices with the camera movement which wastes credits.