Skip to main content
Known Participant
January 19, 2017
Question

Removing formatting from selected text

  • January 19, 2017
  • 3 replies
  • 796 views

Hello,

I have written a plugin to select highlighted text from PDF document.

However,text I get from PDF document is retrieved with actual formatting, for example, new line etc.

I want the text without any formatting.

Here is the code snippet to get the highlight annotated text.

Here is my code for getting the annotated text:

  1. ACCB1 ASBool ACCB2 pdTextSelectEnumTextProc(void* procObj, PDFont font, ASFixed size, PDColorValue color, char* text, ASInt32 textLen) 
  2.   char stringBuffer[200]; 
  3.   strcpy(stringBuffer, text); 
  4.   ss << stringBuffer; 
  5.   return true

  1. if (ASAtomFromString("Highlight") == PDAnnotGetSubtype(annot)) 
  2.   { 
  3.   // Gets the annotation's rect 
  4.   PDAnnotGetRect(annot, &boundingRect); 
  5.   // Gets the text selection from the annotation's rect 
  6.   PDTextSelect textSelect = PDDocCreateTextSelect(pdDoc, pageNum, &boundingRect); 
  7.   // create a callback to get the text from highlighted bounding box 
  8.   PDTextSelectEnumText(textSelect, ASCallbackCreateProto(PDTextSelectEnumTextProc, &pdTextSelectEnumTextProc), &annBuf); 
  9.  
  10.   MessageBox(NULL, ss.str().c_str(), NULL, NULL); 
  11.   } 


String stream contains the string with all formatting.

How can I get the text without any formatting?

Thanks

This topic has been closed for replies.

3 replies

Legend
January 19, 2017

You mentioned ellipsis (...) in some other post I think? If so we should consider that in Latin1 encodings that 0x85 is an ellipsis (single character for three dots). It is not layout.

Legend
January 19, 2017

What are the hex values?

Known Participant
January 19, 2017

Sorry I did not get. Which hex values do you mean?

Known Participant
January 19, 2017

Hello,

I did some debugging. Hex value is x85

Legend
January 19, 2017

1. You do not check the string length before strcpy. This is a serious bug.

2. If you don't want newlines you can strip them or change them to spaces.

Known Participant
January 19, 2017

I am not able to replace the new line with any other character.

I am comparing each word with \n\r, It seems that PDF formatting character is different for new line.

I can't see \n or \r in the text when I compare.

Please tell me how can I strip them?

Thanks.