Export highlights from a PDF file

Report · Apr 10, 2023

Is there really no way to export or extract highlighted annotations of a PDF file as a simple text file? I find it absolutely ridiculous that there is no simple way to do this despite PDFs being around for decades! I am using Acrobat Pro and still no easy way to do this.

Report · Apr 10, 2023

Correct. You can create a comment summary in Acorbat, but that's the best it has.

Bluebeam Revu will let you export comments as a CSV file.

Report · Apr 10, 2023

Hi,

Last year, I wrote a script that you might be interested in!

Change the .txt extesion of the attached file in .js then place this file into the JavaScript folder of your Acrobat then restat the application.

You will get a new "* b2Tools *" item in your "Edit" menu.

Select "Comments Summary"...

Choose what you want, then "OK".

Try it and let me know...
@+

Report · Apr 10, 2023

Thank you heaps but sorry, I just don't understand where or how should I run this script.

Report · Apr 11, 2023

Hi,

After changing the file extension from .txt to .js (from b2T-Comments report.txt to b2T-Comments report.js), you must place this file into the JavaScript folder of your Acrobat application.

If you don't know where is this folder, you can use the attached "Show_me_the_path.pdf" file which will help you to find it.

Then you will have to restart your Acrobat application then follow previous indication which should answer to your need.

Capture d’écran 2023-04-11 à 18.54.12.png

Let me know.

@+

Report · Apr 11, 2023

Thanks very much for freely sharing your script. I was able to run the script displayed in Adobe Acrobat DC which is great. However, there are still a couple of issues:

Issue 1 (minor) - It does NOT generate a simple text file with the highlighted text. It generates only either a PDF, OR a console window with the highlighted text.

Issue 2 (major) - The highlighted text that is extracted to the PDF or the console window is wrapped between extra unwanted information like date/time/page/paragraph/username/colour of the highlight etc. I had 21 highlighted comments and each comment is sandwiched between extra unwanted information. So I have to manually copy paste each extracted comment or go around manually deleting the unwanted infom. This takes the same amount of time as manually copy-pasting each comment directly from the orginal PDF.

I am simply after a way to extract all the highlighted text into a textfile, clean and tidy, with no extra information.

Report · Apr 12, 2023

Hi,

I'm sorry if my utility is not exactly what you expected, but it was developed for another request and it took hours of programming.
If you only need to extract 21 comments, I think that will take less time to do that manually than to develop a similar utility adapted to your request.

FYI, I don't think this utility generates the pdf file and the display in the console without generating the txt file. You certainly don't know where find it. You should find it in the Attachment panel.

@+

Report · Apr 12, 2023

Screen-shot below displays how it gives an error message if I unselect PDF and console options and have only the text file option selected.

This below screen-shot is after generating the extraction into the console. The 'attachment' section in Acrobat simply does not display a textfile at my end unfortunately.
Oh no, this is not just for 1 PDF file! I have just started a PhD study and I will have in excess of 300 PDF files minimum and each PDF with up to 30 highlights. It would be really useful to have a utility where one can extract just the highlights as text with no metadata information as such.

I do appreciate the time you've invested in making this programme and for your detailed responses; thank you very much!

Report · Apr 12, 2023

That's effectively a bug... I will have a look on my script then I'll come back to you!

@+

Report · Apr 12, 2023

Thanks!

Report · Apr 12, 2023

In fact, that was not a bug but a demand for only import the txt file when the new pdf summary file is generated.

I've just done a revision to allow the txt file attached to the actual pdf file with or without saving.

But this revision (attached) give all previous information for each comment.

Else, I've also just written the script below you can run as an action wizzard which will only extract the highlighted text.

var version="04/23";
// Début durée
d0=new Date();
debut=util.printd("dd/mm/yyyy à HH:MM",d0);
// C'est parti !
console.show();
console.clear();
var lesTirets="––––––––––––––";
var lesProprietes=["quads","contents"];
var possible=1;
var highlightedPage=new Array(this.numPages);
this.syncAnnotScan();
var annots=this.getAnnots();
if (annots!=null) {
	var cT=0;
	for (var i=0; i<annots.length; i++) {
		if (annots[i].type=="Highlight" || annots[i].type=="Underline" || annots[i].type=="Squiggly" || annots[i].type=="StrikeOut" || annots[i].type=="Redact") {
			if (annots[i].type!="StrikeOut" && !possible) possible=1;
			var laPage=annots[i].page;
			if (typeof highlightedPage[laPage]==="undefined") highlightedPage[laPage]=new Array();
			highlightedPage[laPage].push(i.toString());
			for (var prop=0; prop<lesProprietes.length; prop++) {
				if (typeof eval("annots[i]."+lesProprietes[prop])=="string" || lesProprietes[prop]=="quads") {
					highlightedPage[laPage].push(eval("annots[i]."+lesProprietes[prop]));
				}
			}
			highlightedPage[laPage].push("-");
		}
	}
	var incr=lesProprietes.length+2; // 1 pour N° de page + 1 pour AV/AP
	for (var i=highlightedPage.length-1; i>=0; i--) {
		if (typeof highlightedPage[i]==="undefined") {
			highlightedPage.splice(i,1);
		} else {
			highlightedPage[i].unshift(i);
		}
	}
	reponses=highlightedPage.slice(0);
	for (var j=0; j<reponses.length; j++) {
		reponses[j]=highlightedPage[j].slice(0);
		for (k=2; k<reponses[j].length; k++) reponses[j][k]=highlightedPage[j][k].slice(0);
	}
	for (var j=0; j<reponses.length; j++) {
		for (k=2; k<reponses[j].length; k+=incr) reponses[j][k]=[];
	}
	//
	for (var j=0; j<highlightedPage.length; j++) {
		var p=highlightedPage[j][0];
		console.clear();
		console.println("D\Process starting: "+debut);
		console.println(lesTirets);
		console.println("Processing page "+(p+1));
		// Y maxi et mini dans la page
		var max=[];
		var min=[];
		for (k=2; k<highlightedPage[j].length; k+=incr) {
			r=highlightedPage[j][k][0];
			r=r.toString();
			r=r.split(",");
			max.push(r[1]);
			min.push(r[7]);
		}
		max.sort(function(a,b){return b-a});
		min.sort(function(a,b){return a-b});
		var yMax=Number(max[0]);
		var yMin=Number(min[0]);
		// Vérification des mots
		var nbMots=this.getPageNumWords(p);
		var mT=0;
		for (var i=0; i<nbMots; i++) {
			var leMot=this.getPageNthWord(p,i,true);
			var q=this.getPageNthWordQuads(p,i);
			m=(new Matrix2D).fromRotated(this,p);
			mInv=m.invert();
			r=mInv.transform(q);
			r=r.toString();
			r=r.split(",");
			var xGmot=Number(r[0]);
			var yGmot=Number(r[1]);
			var xDmot=Number(r[6]);
			var yDmot=Number(r[7]);
			if (yGmot>yMax+1) continue;
			else if (yGmot<yMin-1 && mT) break;
			else {
				for (k=2; k<highlightedPage[j].length; k+=incr) {
					for (m=0; m<highlightedPage[j][k].length; m++) {
						r=highlightedPage[j][k][m];
						r=r.toString();
						r=r.split(",");
						var xG=Number(r[0]);
						var yG=Number(r[1]);
						var xD=Number(r[6]);
						var yD=Number(r[7]);
						if (xGmot>xG-1 && yGmot<yG+1 && xGmot<xD && yDmot>yD-1) {
							mT++;
							reponses[j][k].push(this.getPageNthWord(p,i,false));
						}
					}
				}
			}
		}
	}
	console.clear();
	console.println("Process starting: "+debut);
	console.println(lesTirets);
	console.println("Building the result");
	var leTexte="";
	for (var j=0; j<reponses.length; j++) {
		var surPage=Math.floor((reponses[j].length-1)/incr)+cT;
		var texteChamp="";
		// Page
		if (leTexte!="") {
			leTexte+="\r";
			texteChamp+="\r";
		}
		for (k=2; k<reponses[j].length; k+=incr) {
			var lesMots=reponses[j][k].toString();
			var lesMots=lesMots.replace(/^\s+|\s+$/,"");
			var lesMots=lesMots.replace(/ ,/g," ");
			var lesMots=lesMots.replace(/-,/g,"-");
			var lesMots=lesMots.replace(/\(,/g,"\(");
			var lesMots=lesMots.replace(/\",/g,"\"");
			var lesMots=lesMots.replace(/\[,/g,"\[");
			var lesMots=lesMots.replace(/\n,/g,"\n");
			var lesMots=lesMots.replace(/¡,/g,"¡");
			var lesMots=lesMots.replace(/¿,/g,"¿");
			var adjectif=""; // Redact
			// Texte
			leTexte+="\r"+lesMots+"";
			// Commentaire
			var laReponse=reponses[j][k+1];
			leTexte+="\r";
		}
	}
	// Fin durée
	console.clear();
	console.println("Process starting: "+debut);
	df=new Date();
	fin=util.printd("dd/mm/yyyy à HH:MM",df);
	console.println("Process ending: "+fin);
	temps=(df.valueOf()-d0.valueOf())/1000/60;
	var lesMinutes=parseInt(temps);
	var lesSecondes=(temps-lesMinutes)*60;
	var lesSecondes=parseInt(lesSecondes*10)/10;
	var leTemps="";
	if (lesMinutes>0) {
		if (lesMinutes==1) {
			var leTemps="1 minute";
		} else {
			var leTemps=lesMinutes+"minutes";
		}
	}
	if (lesSecondes>0) {
		if (lesSecondes<2) {
			var leTemps=leTemps+" "+lesSecondes+" second";
		} else {
			var leTemps=leTemps+" "+lesSecondes+" seconds";
		}
	}
	var leTemps=leTemps.replace(/^\s+|\s+$/gm,"");
	if (leTemps.length>0) {
		console.println("Process duration: "+leTemps+"\r\r");
	}
	console.println(leTexte);
	var leFichier="Comments of "+util.printd("dd-mm-yy - HH:MM", new Date()).replace(/:/,"h");
	var leRapport=leFichier+".txt";
	this.createDataObject(leRapport, "©™Σ","text/html; charset=utf-16"); //
	var oFile=util.streamFromString(leTexte);
	this.setDataObjectContents(leRapport, oFile);
	// Message final
	var ouverture="You can import the attached .txt file into a spreadsheet using Unicode UTF-8 format.";
	if (annots.length-cT==1) app.alert("One comment has been detailed.\r\r"+ouverture,3);
	else app.alert((annots.length-cT)+" comments have been detailed.\r\r"+ouverture,3);
}
if (annots==null) app.alert("There are no comments in this document.",3)

This script is extracted from the utility and maybe some lines are not useful...

Let me know if you don't know how to use action wizzards.

@+

Report · Apr 10, 2023

It can be done using a script, like this (paid-for) tool I've developed many years ago, exactly for this purpose: http://try67.blogspot.com/2008/11/acrobat-create-comments-summary-txt-pdf.html

Report · Apr 10, 2023

Cheers for that.

To be honest, after obtaining the professional Adobe Acrobat DC version I am not really inclined to pay more. This is a very simple function I'd expect that Adobe provided to its clients - literrally tens of thousands of people (e.g., all in academia) will benefit from this function.

Report · Apr 10, 2023

What exactly would you want to be in this simple text file? Can you give an example of what it might look like?

Report · Apr 10, 2023

If I highlight three different sentences in the text of the PDF (as a comment or annotation), I just want to export this comment/annotation so I could use it have it saved elsewhere as study-notes instead of having to open every single PDF file looking for the highlighted components.

Report · Apr 11, 2023

Can you elaborate? Are you asking to have the text under the highlight exported or the content of the annotation? Do you need page numbers? Your request isn't clear enough to take action.

Report · Apr 11, 2023

I only want the highlighted text to be extracted into a clean textfile. Nothing else.

Report · May 06, 2024

It was like a nightmare for me to extract highlighted texts from pdf files.. I tried making it with code but later on discovered readoku.com where you can export highlights into word, excel, json and csv file formats. In case anyone still searching a time saving way..