Think of each section where the camera pauses as a separate piece of artwork. Each of those pieces would be a separate comp. You can then arrange the individual pieces of artwork in 3D space and animate the camera moves. The last step is to go into each section and then animate the graphics and text.
The trick for this animation style is to organize the artwork on a larger canvas, usually in Illustrator, as a multi-layered file. Then, copy the layers that will move from one camera position to another file so you end up with large layered files with the artwork aligned in the hero position.
Each of the AI files is imported as a composition retaining layer size. Then, you make the layers 3D if needed and start building your animations.
When each of those comps works as it should, you stack them up and sequence them in your main comp or render them, load them into Premiere Pro, and start editing the move between the animated sections.
I've done a couple hundred projects like your sample for everything from lyric videos (dynamic text animations) to safety and training videos, and they all follow the same basic workflow. Animate the elements that make the shot, then cut the shots together and add the camera moves between shots. Making a storyboard or animating just the camera move on a bunch of nested comps with 3D layers before you start animating the text and graphics will get you started, and working with a larger canvas (comp size) for the segments will help you avoid the problem of lining up layers (comps) with the same lines of text or graphic elements that continue between shots. If the layers all need to be 3D, I edit the camera moves in a main comp from each section of the scene with Collapse Transformations turned on in the nested comps. If I can use position and scale animations to stitch the projects together, I do the editing in Premiere Pro.