Sunday, April 3, 2016

Oracle Endeca 11.x : How Last Mile Crawl is used?

Endeca baseline update process invokes last mile crawl to create Dgidx-compatible data and passes it to Dgidx to generate binary files for ... thumbnail 1 summary
Endeca baseline update process invokes last mile crawl to create Dgidx-compatible data and passes it to Dgidx to generate binary files for MDEX engine.

Find out following operations during last-mile-crawl 


  • Merges index-config.json using system user and ATG user 
  • Merge Multiple record store defined in crawl configuration XML file.
  • Processes product/article/store records. Data manipulation can be done if required using custom manipulators.
  • Writes the schema and records in the MDEX-compatible format.

How Last Mile gets created

Endeca Application instance needs to create using deployment template. Following commands create last-mile-crawl

<<Endeca_App>>/control/initialize_services.sh
In turn the initialize_services.sh runs a following command
${CAS_ROOT}/bin/cas-cmd.sh createCrawls -h ${CAS_HOST} -p ${CAS_PORT} -f ${WORKING_DIR}/../config/cas/last-mile-crawl.xml 

Features

1.  Record Store Joins
Where the <<Endeca_App>>/config/cas/last-mile-crawl.xml sets up the CAS Crawl with the names of the CAS Recordstores,
<moduleProperties>
<moduleProperty>
<key>dataRecordStores</key>
<value>CRS-data</value>
<value>CRS-External-data</value>
</moduleProperty>
<moduleProperty>
<key>dimensionValueRecordStores</key>
<value>CRS-dimvals</value>
</moduleProperty>
</moduleProperties> 

As per XML snapshot above, multiple record stores can be added for further processing. CAS Based indexing only support switch join between multiple Record store.

2. Add Manipulators
Java manipulators can be added into last-mile-crawl in case any data manipulation required e.g. remove comma and create multi-valued properties
<manipulatorConfig>
<moduleId>
<id>com.endeca.cas.extension.sample.manipulator.
substring.SubstringManipulator</id>
</moduleId>
<moduleProperties>
<moduleProperty>
<key>sourceProperty</key>
<value>Endeca.Document.Text</value>
</moduleProperty>
<moduleProperty>
<key>targetProperty</key>
<value>Short.Truncated.Text</value>
</moduleProperty>
<moduleProperty>
<key>length</key>
<value>20</value>
</moduleProperty>
</moduleProperties>
<id>Create short truncated text property</id>
</manipulatorConfig>



3. Merges index-config
initialize_services.sh runs the following command to update the Endeca Configuration repository with the properties and dimensions mentioned in the ./index-config.json
"${WORKING_DIR}/index_config_cmd.sh" set-config -f "${WORKING_DIR}/../config/index_config/index-config.json" -o all


It's your Turn

Was this blog helpful for you? What do you think about this post? Any other topics that you want to cover in details. 

Provide your valuable comments or response below.

8 comments

  1. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. I will review article one by one and will do the needful immediately. Thanks for notifying me.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Sure. I have deleted all the post and images related to oracle documentation, support and images. Let me know you still see anything. I can remove those as well. I have emailed to copyright_us@oracle.com id as well about this incident last week to apologize.

      Delete
  2. Hi Ajay, I have one question. I have different sources from where I am creating record store. Now, -data has record.id but new one doesn't have it. It has Endeca.Id. These two are not getting merged. Can you please help me on how can I join those two recor stores.

    ReplyDelete
    Replies
    1. Hi Sumit, all record should have record.id to join the records in all record stores.

      Thanks,
      Ajay Agrawal

      Delete
    2. Yes correct, but the issue is I have used a crawl of type Endeca Record File and whenever I am trying to explicitly set record.id as configuration, it throws an error saying expected Endeca.Id but found record.id.
      I tried to use a modifier manipulator to add record.id as new PROP, but I am not sure whether it is the right thing to do.

      Apart from that, how can I ensure that my records are getting indexed?

      Regards,
      Sumit Saurabh

      Delete
    3. Adding modifying script manipulator in the crawl and adding a new prop as record.id resolved my issue.

      idPropertyValue = record.getPropertySingleValue("Endeca.Id");
      record.addPropertyValue(new PropertyValue("record.id", idPropertyValue.value));
      logger.info("Processed Record:" + idPropertyValue.Value);

      Thanks

      Delete

Text Widget