Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

How can I index PDFs that reside on a Unix server?

Participant ,
Apr 21, 2014 Apr 21, 2014

I was amazed last week to discover that our website doesn't support PDF searchability. Supposedly, we'll be getting it soon. But, when that time comes, I believe we'll need to index our PDFs. I know I can do that for a Windows server. But, how can I do it for a Unix server?

Thanks,

Peter

1.1K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

LEGEND , Apr 21, 2014 Apr 21, 2014

No, not at all. "Indexing" is a stange and inconsistently used term for this connection. What I assume you really want is a search engine that includes PDF as well as HTML content. The term "indexing" can be taken to mean "taking all of the text in a file and adding it to the words available for searching".


Acrobat uses "indexing" to mean specifically creating a set of files used by Acrobat itself for searching multiple PDFs. But this is a local file thing, not a web thing. Indexing in this way P

...
Translate
LEGEND ,
Apr 21, 2014 Apr 21, 2014

Is Google suitable?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Apr 21, 2014 Apr 21, 2014

I guess, but, don't the PDFs still have to be indexed, via Acrobat?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 21, 2014 Apr 21, 2014

No, not at all. "Indexing" is a stange and inconsistently used term for this connection. What I assume you really want is a search engine that includes PDF as well as HTML content. The term "indexing" can be taken to mean "taking all of the text in a file and adding it to the words available for searching".


Acrobat uses "indexing" to mean specifically creating a set of files used by Acrobat itself for searching multiple PDFs. But this is a local file thing, not a web thing. Indexing in this way PDF files that are to be used on the web does no harm, but equally, no good.

There are two main approaches to web search engines.

1. Local. Software runs on the local machine, and reads files on the web site. Makes some kind of file, perhaps called an index. Local software on the web server uses this information to show you results.

2. Remote. Google is an obvious case of this. Visits web sites ("spiders") to read files, makes its own "index" and searches across sites.

Google (and other search engines) are often thought of only as a way of searching the whole web. But they can be used to search single servers in their list of the whole world, as in https://www.google.co.uk/search?q=indexing+Pdfs+site%3Aadobe.com

An interesting variation of this, which was on sale a few years ago, is the "google appliance" for use on private networks (intranets). It sat on the local network and did google-like things but only with the local network. Then there was a "local google" for all the local machines.

One more thing: your customers aren't likely to find it satisfactory if they have to search PDFs and search HTMLs with a different engine.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Apr 21, 2014 Apr 21, 2014

Hmmm. You're probably right. When I search around for PDF indexing, I get pretty nebulous hits, especially with the latest Acrobat, XI.

I was specifically told, though, last week, that PDF indexing/searchability is NOT available on our web server. And, supposedly it will be in a new release. So, I'm kind of the PDF guru here, but, I do most of my work on Mac and Windows. And, years ago, I did do research on "indexing" PDFs. And, that, frankly, seems kind of funky in this day and age. Having to open up hundreds of PDFs to index them, or, to pile them all into one directory to index them.

So, are you suggesting that just googling alone should be able to find data in our PDFs? Meaning, we shouldn't really have to do anything on our end?

Thanks.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 21, 2014 Apr 21, 2014

If google visits your server and can find the PDFs, yes. Finding the PDFs is important, there has to be an HTML page linking to them.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Apr 21, 2014 Apr 21, 2014

Yes. The HTML is all there. An editor was asking about this last week and the web guy said that we don't presently have searchability for PDFs, meaning, inside PDF searchability. So, I don't know what they're planning on doing when they supposedly do have searchability.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Oct 27, 2014 Oct 27, 2014

I also took Mr. Bailey's question in this way - obviously search engines index the fact that there is "foo-bar.pdf" which has an anchor link in an HTML file.

They key and very important question is "Can search engines search for and index content inside a PDF file" which is linked inside an HTML file". I'm surprised the answer above is listed as the correct one. It's certainly accurate, but does not answer Mr., Bailey's (and now my) question.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Oct 28, 2014 Oct 28, 2014

Well, search engines aren't generic. You'd have to look at the features of each one. Features could include

- no PDF searching

- PDF searching by spidering (like Google)

- PDF searching by reading lists of files (no need for HTML links)

- PDF searching by add-ons

Typically per-machine searching doesn't use spidering, which introduces delays and uncertainty, but it's the only game in town if you are Google.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Oct 28, 2014 Oct 28, 2014

I appreciate your response, however it did not answer the question I asked, nor the one Mr. Bailey asked. Since I'm here, please allow me to point something out.

You are obviously knowledgable. However, not everyone needs to know the entire spectrum of possibilities. If you read my question, I made it very specific. It's a cornerstone question, necessary before you can build to more advanced options. Most people asking this specific question are not looking to obtain specialized networking gear. They want to know if the contents of a PDF file they put on their website can be indexed by the search engine with 84% market share. After that, they start to look at more specialized scenarios. Why spend time researching advanced options before you know whether they are needed?


You can choose to respond, or not, but there's no value to me unless you begin at the beginning. Also, I put this on an Adobe forum in the hope that an Adobe employee could answer it. No offense, but MVP or not, I have no way to determine whether your answer is authoritative in this case.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Oct 28, 2014 Oct 28, 2014

This is a community forum, not a way to get support from Adobe. I didn't mention specialised networking gear in my last reply.  I don't know what Unix search engine has 84% market share, but why not ask in their forum, it certainly won't be an Adobe product? I would have something to say involving an Adobe product if it were Windows, but clearly it isn't - the original poster is happy to know it can be done for Windows.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Oct 28, 2014 Oct 28, 2014
LATEST

Sorry to bother you.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines