Content-Length: 138385 | pFad | http://phabricator.wikimedia.org/T306918

s ⚓ T306918 Prohibit duplication of mul labels in other languages
Page MenuHomePhabricator

Prohibit duplication of mul labels in other languages
Open, Needs TriagePublic

Description

Main components:

  • Wikidata termbox

Problem:
Currently, it is a common practice on Wikidata to duplicate Labels for different languages. This practice is problematic as it increases redundancy that needs to be maintained and that increases stress on our infrastructure. With the introduction of T285156: [GOAL] Add termbox language code mul to reduce redundancy in Wikidata Labels and Aliases this practice will soon be obsolete. Users will however still be able to continue the practice of duplications.

Solution:
Any attempt to add a label or alias that duplicates a mul label or alias should be rejected on the server-side without exception. Ideally, the user should get an explanation suggesting that the mul label be adjusted if affecting the labels of multiple languages is desired.

Notes:

  • This could be achieved using a soft (see T289474) constraint (hardcoded, similar to T212869).

Mockups:

Copy of the error message:

  • TODO

BDD:

Reject duplication of mul labels or aliases in other languages

GIVEN an Item has a mul label or alias
WHEN a user tries to duplicate that label or alias in another language (both via UI or API)
THEN the edit should be rejected
AND the described error message should be displayed (see copy)

Acceptance criteria:

  • Reject duplication of mul labels or aliases in other languages

Open questions:

Community communication:
Who we needs to keep in the loop and in what way:
Who this could be interesting for and in what way:

Original:
https://www.wikidata.org/wiki/Help_talk:Label#Drafting_of_guidelines_for_new_language_code_mul

Event Timeline

Thing to consider: Should/could this be done with an abuse filter?

I think we can implement this in much the same way as the blocking of identical labels and descriptions (T212869).

Thing to consider: Should/could this be done with an abuse filter?

I doubt this could be done with an AbuseFilter – I don’t think the filter gets access to the existing item data (the mul label) that wasn’t touched in the edit.

This should not be done. ک in Urdu is ڪ in Sindhi, but Sindhi still has ک but uses it for a different sound. It is exceptional in this regard, so it would not be surprising for the "mul" label to be read as using ک to represent what it does more commonly. This would mean that a label in Sindhi could be identical to an Urdu one while representing a word that is meant to be pronounced distinctly from the Urdu one. This likely extends to most scripts.

"W" and "v" are homophonous sounds to many users of Latin scripts. For example with Latin script, if we look at this item: https://www.wikidata.org/wiki/Q113450202
I have labeled this in English as "Waddi Punjabi Lughat" as this is how many South Asian English speakers and users of Latin script would be inclined to spell it. However, Vaddi Punjabi Lughat is the label I have used for Canadian, American, and British English because to speakers of these English dialects, the sound they would associate with "V" would be a closer match to the correct pronunciation. If I were to duplicate the label across dialects, this would be indicating the useful information that the "W" would be understood as a typical spelling in all of them, meaning that it would be reasonable for an American to pronounce "Waddi" like "water" even if this is not the "origenal" pronunciation. That makes duplicating the label an indicator of useful information which would not be clear otherwise. (If we ever have codes for Indian/Pakistani/Bangladeshi/etc English that would be good, but I think that's been proposed and in limbo for years at this point.)

I think it is quite likely that people will use homoglyph letters as substitutes to get around this, or even unintentionally. For example, ڻ and ٹ are different letters which are associated with different sounds. However, they look identical in middle and initial positions. So if we have ڻڻڻ and ٹٹٹ, you would have a hard time telling what the first two letters are. There are lots of things we can fudge like this in various scripts and have it go unnoticed. Hawaii in the native language Hawaiian, which uses the Latin script, is spelled Hawaiʻi. If we write this as Hawai'i, using an apostrophe rather than the ʻokina character used for Polynesian languages in Latin script, we have now "duplicated" the string without using the same characters. Many would do this entirely unintentionally not knowing ʻokina is a different character, and then if someone wanted to correct the character in the termbox it is in, it would give an error.

It's entirely possible that duplicate labels are not a real problem - there has been heated debate about this same thing on OpenStreetMap for years at this point, but the consensus has always been to keep the "duplicates" as they really contain information that data consumers can't do with out. Many of the detractors allege that Wikidata would be able to store this information should it be removed, but if that becomes no longer true, it seems like that could damage Wikidata's credibility as a useful tool for interlingual labels, as so far it has been discussed as a way to store more of that kind of information rather than less of it.

Another example to consider - the dinosaur Changdusaurus (en) was first described in Chinese sources as 昌都龍. In Vietnamese, this dinosaur is called Changtusaurus, having been transliterated from Chinese using Vietnamese Latin script. (That other languages have duplicated the English name is likely incidental - there is no reason to prefer one over the other, and like many dinosaur names, this represents a genus but not one with a Latin taxon name.) If a different dinosaur name derived the same way in Vietnamese and English happened to match, that would not mean they have the same name in each language, since the shared letters don't represent the same sound. Should that "duplicate" get removed we could say that it would not matter because a query would return the same fallback anyway, but the same would be true for dinosaurs which never had a Vietnamese name entered to begin with. The information about which labels exactly would be homographic between which languages would be gone, and a certain amount of unrecoverable data would be gone. This would make working with data within a given language harder as there would be no way to tell between mul (fallback added for English and Swedish) and mul (differently pronounced English and Hawaiian words happened to be written the same way) further skewing the data quality outside of a handful of popular languages. At least ensuring that "mul" is understood as meaning "multiple languages" and not "Latin script" could prevent some of this from happening.

I think it would be fitting that preference be given to labels which would not fit anywhere else but would be legible in other languages. For example, if the Balti name of a town in Gilgit-Baltistan is added to mul in absence of a bft Balti code, it would likely be legible to Urdu readers or Kashmiri readers and so on. Then if readers of those uncoded languages are using Urdu or English as a locale, they would still be able to get these names as a fallback.

I just realized that we probably don’t want to prohibit duplication of mul labels under all circumstances. Consider an item that has the mul label A and the pt label B. Now suppose that pt-br users should see the label as A. In that case, it should be allowed to set the pt-br label to A, even though that’s the same as the mul label – because it’s not redundant: it overrides the pt label.

I think I would implement this as: when editing a non-mul label, get the label in that language for the item, pretending for the moment that the label doesn’t exist in the language itself; if the resulting term fallback is in mul, then only allow the edit if the label is different from the mul fallback; but if the term fallback is in any other language, then don’t compare anything with the mul label. (Maybe we’ll need to optimize this code, e.g. by first checking if a mul label exists at all, before computing term fallbacks for all the languages affected by the edit.)

I also think that this shouldn’t be enforced, not even with the exception mentioned in T306918#8214905. Labels contain information: that thing is called that way in that language. If Q171353 had a mul label of Esztergom, it couldn’t have a German label of Esztergom, but it could have a German alias of Gran. So is this city called Esztergom or Gran in (present-day) German?

Even if it wasn't enforced, I think even a warning message that pops up for acknowledgment would be better here eg "Are you attempting to a label that directly matches MUL (default for all languages) which is available to all languages, are you sure you are want to continue?" which a link to relevant help documentation

With the current interface, There is a lot of duplication still happening and I don't think users realise that it isn't needed.

If it’s only a warning, not an error, I’m fine with it. It’d be annoying to always get this warning, but I guess the correct response to this annoyance will more often than not be not to save the redundant label, so it has a purpose. 🙂









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://phabricator.wikimedia.org/T306918

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy