Saturday, March 19, 2016

Endeca Basics : How to Add/Delete word from stemming dictionary??

Oracle Endeca provides a dictionary for stemming for all languages. This dictionary is part of MDEX engine installation. Dictionary can be ... thumbnail 1 summary
Oracle Endeca provides a dictionary for stemming for all languages. This dictionary is part of MDEX engine installation. Dictionary can be found under \MDEX\<<vesion>>\conf\stemming directory.

Why do we need to delete stemming dictionary? One of the use case below:-

A user searching for search term "short" displays "shorts" results as well and vice versa.

There are two ways to Add/Delete word from stemming dictionary:-

1. Open specific language file  under \MDEX\<<vesion>>\conf\stemming folder and edit. This is not the recommended way to do it.

2. Modify the default stemming dictionaries by running Dgidx with the --stemming-updates flag and specifying an XML file that contains the updates to the dictionary that you want to make. The update file can include both additions and deletions. Dgidx processes the file by adding and deleting entries in the static stemming dictionary file.

Adding entries to a static stemming dictionary

Each entry in a static stemming dictionary is defined by an <ADD_WORD_FORMS> element and its sub-element <WORD_FORMS_COLLECTION>. For example, the following entry adds apple and its plural form apples to the static stemming dictionary:
<!DOCTYPE WORD_FORMS_COLLECTION_UPDATES SYSTEM "word_forms_collection_updates.dtd">
<WORD_FORMS_COLLECTION_UPDATES>
<ADD_WORD_FORMS>
<WORD_FORMS_COLLECTION>
<WORD_FORMS>
<WORD_FORM>apple</WORD_FORM>
<WORD_FORM>apples</WORD_FORM>
</WORD_FORMS>
</WORD_FORMS_COLLECTION>
</ADD_WORD_FORMS>
</WORD_FORMS_COLLECTION_UPDATES>


Deleting entries from a static stemming dictionary

<!DOCTYPE WORD_FORMS_COLLECTION_UPDATES SYSTEM "word_forms_collection_updates.dtd">
<WORD_FORMS_COLLECTION_UPDATES>
<REMOVE_WORD_FORMS_KEYS>
<WORD_FORM>aalborg</WORD_FORM>
</REMOVE_WORD_FORMS_KEYS>
</WORD_FORMS_COLLECTION_UPDATES>


Combining deletes and adds

You can also specify a combination of deletes and adds. Deletes are processed before adds. For example,the following XML removes aachen and then adds it and several stemmed variants of it.
<!DOCTYPE WORD_FORMS_COLLECTION_UPDATES SYSTEM "word_forms_collection_updates.dtd">
<WORD_FORMS_COLLECTION_UPDATES>
<REMOVE_WORD_FORMS_KEYS>
<WORD_FORM>aachen</WORD_FORM>
</REMOVE_WORD_FORMS_KEYS>
<ADD_WORD_FORMS>
<WORD_FORMS_COLLECTION>
<WORD_FORMS>
<WORD_FORM>aachen</WORD_FORM>
<WORD_FORM>aachens</WORD_FORM>
<WORD_FORM>aachenes</WORD_FORM>
</WORD_FORMS>
</WORD_FORMS_COLLECTION>
</ADD_WORD_FORMS>
</WORD_FORMS_COLLECTION_UPDATES>


Processing the update file

To process the stemming update file, run Dgidx with the --stemming-updates flag and specify the XML file that contains the stemming updates.
For example:
dgidx --stemming-updates myAppStemmingChanges.en.xml


8 comments

vishal said...

HI Ajay,

First of all thanks for writing this blog. I have added a new stemming entry as described above. I have also added an dgidx arg in dataInjest.xml and ran below two commands,
- runcommand.bat Dgidx run
- runcommand.bat DistributeIndexAndApply run

I could see from the dgidx.log that the entry has been added,
XMLParser: Done reading word form updates from file "D:\Endeca\MDEX\6.5.2\bin\TrimarkStemmingChanges.en.xml". There are 0 removes and 1 adds.

I have made an entry for Smartphone and Smartphones. This is not available in default stemming file. I have a record with name Smartphone. When I search for Smartphone, I get a single result.
But when I search for Smartphones, I am not getting that result in jspref.

Please help.
Vishal

Ajay Agrawal said...

Thanks Vishal for referring to blog. Can you help me to get the xml and entry you made in dgidx.xml?

Thanks,
Ajay Agrawal

Unknown said...

Hi Ajay,

By default will stemming be enable at the time of installing endeca ? we have situation like on production it seems stemming enable (book and books results same result count) However on lower env (staging) its not enable (book and books results different result count). Can you help how to check stemming is enable or disable on particular env (In Any file or any flag).
we dont have dev studio.

Thank you
Mukesh Singh

Ajay Agrawal said...

Thanks Mukesh for Referring to my blog.

you can find <>.stemming.xml under <>/config/mdex.

I hope this will help.

Thanks,
Ajay Agrawal

Omk@r Patil said...

Hi ajay,
i want the singular n plural should give same results for which i have made changes in stemming.xml but still m getting different results, below mentioned are the changes in stemming.xml







Ajay Agrawal said...

did you do indexing after making the entry?

Omk@r Patil said...

yes i did it

Raj said...

Hi Ajay and Omkar Patil,

Please help me i would like to know the fix for following query ?

i want the singular n plural should give same results for which i have made changes in stemming.xml but still m getting different results, below mentioned are the changes in stemming.xml

Post a Comment

Note: Only a member of this blog may post a comment.

Text Widget