Skip to main content
Known Participant
August 9, 2011
Question

Indexing large PDF's

  • August 9, 2011
  • 2 replies
  • 2874 views

I'm using CFINDEX to feed Solr with pdf files.

While monitoring disk activity coldfusion CFINDEX only reads about 200k from disk while indexing an 1Mb pdf (102 pages)

Similar patterns for other large pdf files.

I can use solr to search anything on the first 38 pages but after that I score 0.

Are there any size limitations in CFINDEX ? anything I can tweak on?

(I already tried the maxfieldsize in solrconfig.)

any ideas?

Coldfusion 9.01 standard ed.

    This topic has been closed for replies.

    2 replies

    Known Participant
    August 19, 2011

    Do you need special permission to vote?  I just tried using the user name and password I have for the forum, and I can't get in.

    We plan to start a project soon where we will have hundrends of PDF's online for our SOP system, and several of them are over 50 pages and we need them to be indexed.

    Thanks

    Known Participant
    August 19, 2011

    No I dont think so. I use the same credentials as anywhere @adobe, but I'm not a frequent user.

    It's very good you are voting for this putting some attention to the matter

    Known Participant
    August 19, 2011

    I'd vote if I could.  I get Invalid credentials when I try to use my Adobe login. 

    Inspiring
    August 14, 2011

    Does this happen for any large PDF file, or just a specific one?  Perhaps if it's just the one, there's some sort of corruption or something "unexpected" at the boundary you're seeing?

    --

    Adam

    Known Participant
    August 15, 2011

    Adam, It happens for all PDF files, regardless of PDF-version and how they were generated (e.g. distiller, MS Word..)

    (I have tried more than 10 large PDF's from different sources.)

    Inspiring
    August 18, 2011

    Sorry to take a while to get back to you: I've been a bit busy in the evenings this week.

    Um... yeah... I get the same thing.  It seems to only index the first 38 or so pages for me.

    I've knocked together some stand-alone code that replicates this, in case anyone else can test it too:

    <!--- createCollection.cfm --->

    <cftry>
        <cfcollection action="delete" collection="scratch">
        <cfcatch>
        </cfcatch>
    </cftry>
    <cfcollection
        action        = "create"
        collection    = "scratch"
        path        ="#server.coldfusion.rootDir#\collections"
        engine        = "solr"
    >

    <!--- indexCollection.cfm --->

    <cfindex
        action        = "refresh"
        collection    = "scratch"
        key            = "#expandPath('.')#"
        type        = "path"
        extensions    = ".pdf"
    >

    <!--- createPdf.cfm --->

    <cfparam name="URL.file">
    <cfparam name="URL.text">
    <cfparam name="URL.size">
    <cfdocument format="PDF" filename="#expandPath('./')##URL.file#" overwrite="true">
        <cfset sPadding = "padding">
        <cfset iSize = 0>
        <cfset iPaddingLen = len(sPadding) + 1 + len(createUuid())>
        <cfloop condition="true">
            <cfset sThisPadding = sPadding & " " & createUuid()>
            <cfoutput>#sThisPadding#</cfoutput>
            <cfset iSize += iPaddingLen>
            <cfif iSize GT URL.size * 1024>
                <cfbreak>
            </cfif>
        </cfloop>
        <cfoutput>#URL.text#</cfoutput>
    </cfdocument>

    <!--- search.cfm --->

    <cfparam name="URL.search">
    <cfsearch collection="scratch" name="q" criteria="#URL.search#">
    <cfdump var="#q#">

    Save all those into a directory, then run:

    createCollection.cfm

    createPdf.cfm?size=75&file=large1.pdf&text=locate

    createPdf.cfm?size=85&file=large2.pdf&text=locate

    indexCollection.cfm

    search.cfm?search=locate

    The PDFs created are only around 120-130kB apiece, but are 34 and 39 pages respectively.  Neither in size nor in length are they very big.

    I only get a match in large1.pdf

    If I peg back the large2.pdf to be size=83, its page count drops back to within 38, and I start getting it coming back in the search results too.

    I dunno if this is a limitation of the dev edition of CF, or it's a fairly horrible bug...

    Were you running on a dev edition, or a licensed one?

    --

    Adam