Is there a way to compare and segmentation two audio files based on content similarity?

Report · Jun 22, 2021

Hi.

I have two audio tracks, let's say audio track a and audio track b where track a has the length Ta and track b has the length Tb.

Track a is composed with segments like Sa1Ta1 + Da2Ta2 + Sa3Ta3 + Da4Ta4…

Track b is composed with segments like Sb1Tb1 + Db2Tb2 + Sb3Tb3 + Db4Tb4…

Where segment Sa1Ta1 has the duration Ta1, segment Sb1Tb1 has the duration Tb1, segment Da2Ta2 has the duration Ta2… so on.

Segment Sa1Ta1 is similar in content to people's hearing with segment Sb1Tb1 but the length of Ta1 is not equal to Tb1, in other word the 2 segments is different in speed.

Segment Da2Ta2 is different in content with segment Db2Tb2 and the duration Ta2 is different with Tb2 too.

(Abbreviation: S for similar, D for different, T for time, a for track a and b for track b)

And so on.

Now I want to compare and split the 2 audio tracks into segments. Let's say Sa1Ta1, Da2Ta2, Sa3Ta3… for track a and Sb1Tb1, Db2Tb2, Sb3Tb3… for track b.

After that I will build a 3rd track, track c, which compiled from segments Sb1Tb1 + Da2Tb2 + Sb3Tb3 + Da4Tb4 … where Da2Tb2 is the segment Da2Ta2 stretched the length to Tb2.

After that I will has track c with the audio content similar to track a but synced in time with track b.

Here are the 2 audio files for track a and track b for you to test. The first file is the audio descriptive track of the movie. The second is the movie video audio. The 2 tracks is different greatly in time. I want to build a third track from the audio descriptive track so that the 3rd track is synced in time with the movie video.

Could you add in some fuctions to the Adobe software to do just that automatically. I'm tired of manually marking, cutting, stretching and joining in my audio editor.

Thank you for your time.

Link to files

Track a

https://bit.ly/3rIPPAF

Track b

https://bit.ly/3rHLI84

Report · Jun 22, 2021

I think that it is very unlikely that Adobe would find your request to be commercially viable. And Audition already has a dialog replacement tool - perhaps you should try experimenting with that?

Report · Jun 22, 2021

Well, you have the speech recognition and sound and instrument identifying technology. As a human being, I did the task as follow but it takes me the whole afternoon to fix one audio.

First I imported the two audio tracks, the referencing movie audio and the edited descriptive audio, into my audio editor. It shows all the waveform with peeks. Then I listen to the audio, label the matching points. I marked the referencing points on the reference track every 1.5 minutes and find the matching points on the processing track. Then I calculate the speed ratio of the matching segments, segments splitted from those matching points. Because the processing track is given some extra edits like speed changing, silent removing, some director cut due to different version releases so I gathered all the speed ratio, the speed change factors, keep those factors that varied little and get rid of those immensely different factors and calculate the average speed change ratio. Then I change the speed of that processing track to based on the average factor and I get a new audio track with the same speed like the reference track. Now I compare the new audio gotten from the processing track to the reference track for different segments, delete the redundant segments, and paste in the missing segments.

That's all.

So the workflow of my job would be label the matching points -> calculate the speed changed factor -> change the speed of the processing track based on the speed factor -> label the similar and different segments of the new track compare to the reference track -> manually paste in the missing segments and remove the redundant segments. That is it but it takes the whole morning. An attention and time consuming thing with no much of joy.

I hope you could give me a tool to automate the thing, especially the matching point and different segment marking procedures.

Thank you very much.

Report · Jun 22, 2021

Well I accidentally marked my reply as correct answer while finding a way to edit my post because I didn't know how to use the community. What to do now.

Report · Jun 22, 2021

I've unmarked the reply. Also I've had a listen to your files, and I have to tell you that you stand a zero chance of getting that process automated at all - there's far too much background interference. It's the sort of process that humans can achieve pretty well, but machine learning simply can't do - it requires discrimination. And even if you could train an AI-based system to do it, it would cost you a lot to purchase.

The other reason that it won't happen is that significant additions to Audition are, as I mentioned earlier, driven commercially - from user requests where the numbers of seats they've purchased run into 6 or 7 figures (yes, really). Those are the requests that are acted upon if it's possible - simply because there will be a return on them in terms of ongoing rental. Fundamentally, those are the only requests that the developers are allowed to work on by Adobe corporate.

Report · Jun 22, 2021

Well why don't do it with current technology you had. I thought it is just a small function addition called "Audio content comparing" and "Speed normalizing". You just let it find the matching points between two tracks of audio based on the human voice and musical instrument sound. I don't know but there is something call audio fingerprints I heard on the TV. I hope you could help automate the speed normalizing task, calculate the speed factor by some mathematical differential analysis or calculus or something and restore the modified audio to the before modified normal speed based on a compared track and then do the content comparison based on fingerprints then mark the totally different segments between the two audio tracks with some colors and let user extract the segments' timing to some report text file to use with other programs later. I said it totally different segments because when we add some voice over to an audio segment, we should just consider them the same one in audio sync. I'm not very sure but I hope my idea my request can come to the Adobe development team. I'll write the tool myself if I have the programming expertise but I don't have them. And I really need these the feature for the job. Hope you'll consider it, my request.

Report · Oct 06, 2025

Hello THIEN PHU LE,

I have the exact same question. And I need to use it for hundreds of pairs of audio recordings.

Did you find software for your use?

I my search, I ran into https://sonicvisualiser.org/videos.html#2 , i.e. Sonic Visualizer with the MATCH Vamp Plugin. That nearly does what I want.