⚓ T125500 [Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch


Article Images

[Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch

This task is for adding labels & descriptions into ElasticSearch and enabling prefix search to use them. Current progress plan:

  • Implement the code for creating & indexing label fields
  • Reindex testwiki and check that index looks sane
  • Reindex wikidata and check that index looks sane (T162292)
  • Add code that allows wbsearchentities to use search engine depending on query flag
  • Setup test page comparing two searches and make an announcement on the list to gather user feedback.
  • Collect feedback and bikeshed about search profiles, weights and result ranking, hopefully arriving to some workable weights profile. (T172467)
  • Set up the config above in production and enable CirrusSearch on wbsearchentities by default (T175741)
  • Discuss & resolve question of how to display entity & title search together
  • Refactor code more to allow opensearch and other code using completionSearch() use the same code as wbsearchentities, in service of the results of the discussion above.
  • Implement the GUI part of the two items above.
  • Figure out how to properly index descriptions (T176903)
  • Make code to allow fulltext search to use entity search when appropriate (T178851).
  • Enable Cirrus searching for Special:Search

More detailed plan: https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Wikidata_search

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Issues found while testing on http://elastic-wikidata.wmflabs.org:

  1. Testing on http://elastic-wikidata.wmflabs.org shows that there is a minor issue: When an alias is matched, it is correctly shown in addition to the preferred label. However, if a label in a fallback language is matched (e.g. a search in de-ch matches the en label), that matched label is not shown in addition to the preferred (perhaps de) label. Example: on http://elastic-wikidata.wmflabs.org/index.php/ElasticRepo:Main_Page?uselang=de-ch, enter "H." into the quick search box. This will suggest Q5628592, because the English label is "H. T. Kung" - but it does not show that "H. T. Kung" was matched.
  2. Another issue Katie found when playing with elastic-wikidata.wmflabs.org: We'll need a way to boost exact matches vs. partial matches, or a way to boost label matches vs alias matches. Ideally, we would be able to tweak both. Example: "Poppy" brings up George H.W. Bush as the first match.

Regarding issue (2): if I read EntitySearchHelper correctly, the old system does not boost labels vs. aliases, it treats them exactly the same. It does however strongly prefer exact matches over partial (prefix) matches.

The query seems to use an inexistent field labels_all.near_match changing to labels_all.near_match_folded will fix the highlighting issue.

I plan to make a page that allows to search with both and compare. We did something like that with completion suggester, I think it shouldn't be too hard to modify it.

I've made a page for search comparison: http://elastic-wikidata.wmflabs.org/wb.html
(despite it being hosted on elastic-wikidata, the data comes from www.wikidata.org).

Note that descriptions for Elastic search are now broken, due to T162292, as soon as that is done descriptions should be fine.

Thanks! I've run some first test and I think there are still improvements needed in the field of abbrevations and disambiguation pages.

debt renamed this task from Index Wikidata labels and descriptions as separate fields in ElasticSearch to [Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch.Aug 1 2017, 5:19 PM

I've moved the "combined search" part of this ticket to T190454, and I think the rest for it is done, so I'll resolve it soon unless there are objections.

I am resolving this as "combined search" part now has its own task.

Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits