The most common way to do phoneme animation is to create a short image sequence with one mouth position for each frame, apply time remapping, then just enter the frame number of the frame you want. If you make up a chart and set the time display to frames it is very easy.
a = 1
o = 2
and so on.
You typically set all keyframes to hold keyframes. Everything on one layer and one number entered for each sound.
Here is a fairly decent tutorial on the process. He leaves out a few shortcuts like using the j and k key to jump between markers, and I would use a slightly different approach with the expression and the slider, but it should get you started.
If you really want to go crazy you can number the markers on the audio track and have an expression use the marker number as the frame number. This will eliminate all of the manual keyframing so you only have to do the work of syncing numbers to the audio once.
If you do not have access to Audition you can add markers to the Audio track and name them in After Effects. I do it all the time. I use markers almost every time I edit anything to audio. I use this expression to turn layer marker names (comments) to frame numbers.
mrkrName = thisComp.layer(index + 1).marker.nearestKey(time).comment;
framesToTime(mrkrName, fps = 1.0 / thisComp.frameDuration)
Put the audio track with markers just below the Phoneme animation layer and all you have to do is add the markers and number them by double-clicking. To fine-tune the timing all you have do is slide the markers.
I have a similar expression that I use to add subtitles. In the last line, framesToTime..., is simply removed to generate the text. A little additional code is added to an opacity text animator to automatically face up the words and fade them out using the description in the layer markers and the driving force for all of the animations.