Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Removing formatting from selected text

New Here ,
Jan 19, 2017 Jan 19, 2017

Hello,

I have written a plugin to select highlighted text from PDF document.

However,text I get from PDF document is retrieved with actual formatting, for example, new line etc.

I want the text without any formatting.

Here is the code snippet to get the highlight annotated text.

Here is my code for getting the annotated text:

  1. ACCB1 ASBool ACCB2 pdTextSelectEnumTextProc(void* procObj, PDFont font, ASFixed size, PDColorValue color, char* text, ASInt32 textLen) 
  2.   char stringBuffer[200]; 
  3.   strcpy(stringBuffer, text); 
  4.   ss << stringBuffer; 
  5.   return true

  1. if (ASAtomFromString("Highlight") == PDAnnotGetSubtype(annot)) 
  2.   { 
  3.   // Gets the annotation's rect 
  4.   PDAnnotGetRect(annot, &boundingRect); 
  5.   // Gets the text selection from the annotation's rect 
  6.   PDTextSelect textSelect = PDDocCreateTextSelect(pdDoc, pageNum, &boundingRect); 
  7.   // create a callback to get the text from highlighted bounding box 
  8.   PDTextSelectEnumText(textSelect, ASCallbackCreateProto(PDTextSelectEnumTextProc, &pdTextSelectEnumTextProc), &annBuf); 
  9.  
  10.   MessageBox(NULL, ss.str().c_str(), NULL, NULL); 
  11.   } 


String stream contains the string with all formatting.

How can I get the text without any formatting?

Thanks

TOPICS
Acrobat SDK and JavaScript
702
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 19, 2017 Jan 19, 2017

1. You do not check the string length before strcpy. This is a serious bug.

2. If you don't want newlines you can strip them or change them to spaces.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 19, 2017 Jan 19, 2017

I am not able to replace the new line with any other character.

I am comparing each word with \n\r, It seems that PDF formatting character is different for new line.

I can't see \n or \r in the text when I compare.

Please tell me how can I strip them?

Thanks.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 19, 2017 Jan 19, 2017

What are the hex values?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 19, 2017 Jan 19, 2017

Sorry I did not get. Which hex values do you mean?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 19, 2017 Jan 19, 2017

Hello,

I did some debugging. Hex value is x85

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 19, 2017 Jan 19, 2017
LATEST

You mentioned ellipsis (...) in some other post I think? If so we should consider that in Latin1 encodings that 0x85 is an ellipsis (single character for three dots). It is not layout.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines