Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Deskana | T143585 Initial BM25 A/B Test | |||
Resolved | mpopov | T143589 Analyze results of BM25 AB test |
Event Timeline
Initialized the repo on GH: https://github.com/wikimedia-research/Discovery-Search-Test-BM25
The plan is to test 5 buckets:
- bm25:control
identical to what we serve to our users today plus an artificial latency to compensate the fact that we run the other buckets on another datacenter.
discernatron nDCG@5 score: 0.2772
- bm25:allfield
Here we use the same query builder as bm25:control but we switched the similarity function to BM25 and a weighted sum for the incoming links query independant factor. We expect this bucket to behaves poorly in term of click through compared to the control group.
We test this to confirm our assumptions that the current query builder and the allfield approach is not designed for the bm25 similarity.
discernatron nDCG@5 score: 0.2689
- bm25:inclinks
Here we switch to a per field query builder using only incoming links as a query independant factor. This is the best contender according to discernatron. We expect an increase in clickthrough because it tends rank obvious matches in the top 3.
discernatron nDCG@5 score: 0.3362
- bm25:inclinks_pv
Similar to bm25:inclinks but with we added pageviews as an additional query independent factor, the weight for pageviews is still very low compared to incoming links. The is test is mostly to see how pageviews could affect the ranking. We expect a very minimal difference in behavior compared to bm25:inclinks.
discernatron nDCG@5 score: 0.3359
- bm25:inclinks_pv_reverse
Similar to bm25:inclinks_pv with an additional field to track typos in the first 2 chars. Today the "did you mean suggestion" engine is unable to suggest a fix for the query "qlbert einstein". We expect a slight decrease in zero result rates and hopefully an increase in click-through rate. This test is added to measure the benefit of such field, did you mean suggestion are not great and the question here is: will this increase noise and provide more annoying suggestions or will it help our users?
discernatron nDCG@5 score: 0.3359 (this feature can't be really tested with discernatron today)
Overall we should see a slight decrease in ZRR for buckets 3/4/5 because of the new query builder used, ZRR should be almost identical between 1 and 2, if it's not the case it's either a sampling issue or inconsistencies with the two clusters.
We hope to see an increase in click-through for 3/4/5 due to the per field scoring approach, if it's not the case it could probably mean that the tuning done discernatron is not appropriate when applied to real world usage.
Finally we'd like to confirm that we can trust the nDCG scores as a measure for offline testing by seeing the same variation in click-through rate between buckets.
Detailed features
bucket | cluster | similarity | builder | QI Factors | QI method | boost templates | title+redirects ngrams | DYM reverse field |
bm25:control | eqiad | lucene tf/idf | QueryString allfield | incoming links | (similarity+phraseboost)*log10(qi+2) | yes | no | no |
bm25:allfield | codfw | BM25 | QueryString allfield | incoming links | (similarity+phraseboost) + ∑(weight*satu(qi factor)) | no | no | no |
bm25:inclinks | codfw | BM25 | per field | incoming links | (similarity+phraseboost) + ∑(weight*satu(qi factor)) | no | yes | no |
bm25:inclinks_pv | codfw | BM25 | per field | incoming links, pageviews | (similarity+phraseboost) + ∑(weight*satu(qi factor)) | no | yes | no |
bm25:inclinks_pv_rev | codfw | BM25 | per field | incoming links, pageviews | (similarity+phraseboost) + ∑(weight*satu(qi factor)) | no | yes | yes |
satu is : valueᵃ/(valueᵃ+kᵃ)
a and k are constants:
bucket | inclinks weiht | inclinks k | inclinks ᵃ | pageviews weiht | pageviews k | pageviews ᵃ |
bm25:control | N/A | N/A | N/A | N/A | N/A | N/A |
bm25:allfield | 1.3 | 30 | 0.7 | 0 | N/A | N/A |
bm25:inclinks | 6.5 | 30 | 0.7 | 0 | N/A | N/A |
bm25:inclinks_pv | 5.0 | 30 | 0.7 | 1.5 | 8e⁻⁶ | 0.8 |
bm25:inclinks_pv_rev | 5.0 | 30 | 0.7 | 1.5 | 8e⁻⁶ | 0.8 |
Pageviews is stored in the index as: weeklypageviews/sum(pageview for the project), values are very low and it's why pageviews k is so low.
The weights cannot be compared between each others if the query builder is different, bm25:allfield works on a single field while bm25:inclinks is a per field approach thus scores from text features are higher.
After I mentioned that I've been using string edit distance and hierarchical clustering to group queries together, Erik suggested that I also look at search results in CirrusSearchRequestSet in Hive. The idea is that queries that share a result are probably reformulations.
To that end, I've imported TSS2 data into Hive and have the following JOIN: https://phabricator.wikimedia.org/P4095 (Nothing this for future reference.)
First draft up: https://wikimedia-research.github.io/Discovery-Search-Test-BM25/
@TJones, @EBernhardson, @chelsyx, @debt, @TJones Could you please review? (After the weekend, of course!)
Second draft up on https://wikimedia-research.github.io/Discovery-Search-Test-BM25/ (thanks @TJones
for the color-coding-in-text idea!)
(Trying something new this time. I put all the feedback into a - [x] task-list format. Changes between 1st and 2nd drafts are documented in https://github.com/wikimedia-research/Discovery-Search-Test-BM25/blob/master/docs/CHANGELOG.md)
@dcausse Could you please explain the difference between the all-field query builder and the per-field query builder?
@mpopov sure,
The allfield field approach combines raw term frequency and field weights at index time, it creates an artificial field where the content of each field is copied n times:
- A word in the title is copied 20 times
- A word in the title is copied 15 times
At query time we use this single all weighted field
The per field builder approach combines scores of individual fields at query time.
Comments on the report:
BM25 is a similarity function of the TF-IDF family, it'd be more exact to state that we ran a test to measure the difference between Okapi BM25 and the lucene classic similarity.
Excellent use of color—even beyond what I was suggesting. Love it!
One minor technical glitch, Under Background | PaulScore, the link to "PaulScore Definition" doesn't work. I tried it in Chrome, Safari, and Firefox. The TOC link works, though, which is weird.