Skip to main content
May 13, 2014
Question

KeywordTokenizerFactory splits the string for the exclamation mark

  • May 13, 2014
  • 0 replies
  • 620 views

Hi All

I have a following field settings in solr schema

<field name="<b>Exact_Word" omitPositions="true" termVectors="false" omitTermFreqAndPositions="true" compressed="true" type="string_ci" multiValued="false" indexed="true" stored="true" required="false" omitNorms="true"/>

<field name="Word" compressed="true" type="email_text_ptn" multiValued="false" indexed="true" stored="true" required="false" omitNorms="true"/>

<fieldtype name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true"><analyzer><tokenizer class="solr.KeywordTokenizerFactory"/><filter class="solr.LowerCaseFilterFactory"/></analyzer></fieldtype>

<copyField source="Word" dest="Exact_Word"/>

As you can see Exact_Word has the KeywordTokenizerFactory and that should treat the string as it is.

Following is my responseHeader. As you can see I am searching my string only in the filed Exact_Word and expecting it to return the Word field and the score

"responseHeader":{

    "status":0,

    "QTime":14,

    "params":{

      "explainOther":"",

      "fl":"Word,score",

      "debugQuery":"on",

      "indent":"on",

      "start":"0",

      "q":"d!sdasdsdwasd!asd@dsadsadas.edu",

      "qf":"Exact_Word",

      "wt":"json",

      "fq":"",

      "version":"2.2",

      "rows":"10"}},

But when I enter email with the following string "d!sdasdsdwasdasd@dsadsadas.edu" it splits the string to two. I was under the impression that KeywordTokenizerFactory will treat the string as it is.

Following is the query debug result. There you can see it has split the word

"parsedquery":"+((DisjunctionMaxQuery((Exact_Word:d)) -DisjunctionMaxQuery((Exact_Word:sdasdsdwasdasd@dsadsadas.edu)))~1)",

can someone please tell why it produce the query result as this

If I put a string without the "!" sign as below, the produced query will be as below

"parsedquery":"+DisjunctionMaxQuery((Exact_Word:d_sdasdsdwasd_asd@dsadsadas.edu))",. This is what I expected solr to even with the "!" mark. with "_" mark it wont do a string split and treats the string as it is

I thought if the KeywordTokenizerFactory is applied then it should return the exact string as it is

Please help me to understand what is going wrong here

This topic has been closed for replies.