there's a bit of a misconception here, which has some significant implications.
AE has numerous threads in use. we all know the main UI thread, and render threads. we need to tell the "main" render threads apart from the threads used by the iteration suite. the main render threads are the ones where the render call from AE occurs (one render thread on older versions and numerous render threads on newer version). the iteration suite uses some utility threads for general purposes.
the main difference between the main threads (ui and render), and the utility suites, is that interaction with AE is only allowed from the main threads and not any other thread. trying to talk to AE from any other thread but the main threads will result in the strangest bugs and crashes.
very few callbacks in the API are safe to call from non main threads, and are denoted as such on the docs. for example the subpixel sample callbacks are safe to call from non-main threads, but the suites need to be aquired in advance from the main threads and not to be aquired from the non main ones.
now let's talk specifics about transform_world. that callback internally uses multiple threads. indeed, one call to it will not top off the cpu usage for all threads, but it IS multithreaded internally.
i have never tried calling transfrom_world from parallel utility threads, but to the best of my knowledge, transform_world is not built for that use, so you can expect anything from bugs to crashes... not seeing a speed bump is the least of your problems here...
having said that, when i was just starting off with the AE SDK, i used a pixel iteration callback, and in it i aquired the subpixel sampling suite fresh for EVERY PIXEL. needless to say, it was VERY sluggish (though it didn't crash as it should have). once i pre-aquired the suite and used just a pointer to the sampling callback, it ran orders of magnitude faster.
so if you insist on trying to call transform_world from an iterate_generic function, perhaps the aquisition of the transfrom suite giver you an overhead that balances parallel execution. (once again, i really don't think transform_world is safe to use from a utility thread...)
i would suggest writing a simple version of transform_world that you could run on multiple threads. a basic implementation that does nearest neighbor sampling is very easy to write. and although it would not be as high a quality as a function that does subpixel sampling for upscaling and surface averaging for downscaling, you'll get a sense of how fast this method of rendering (multiple independent image transformations as opposed to scanline rendering or GPU usage) can get and if it's good enough for your purposes.