Wikipedia:Wikipedia Signpost/2019-07-31/Recent research

Recent research

Most influential medical journals; detecting pages to protect

By FULBERT, SashiRolls, Tilman Bayer, Miriam Redi and Lane Rasberry

A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.

The Most Influential Medical Journals According to Wikipedia

Evidence for recentism: Cited medical journal articles by their year of publication

Reviewed by FULBERT

In the recent research paper, "The Most Influential Medical Journals According to Wikipedia: Quantitative Analysis"^[1], the authors sought to determine the ranking of the most cited medical journals in English Wikipedia by evaluating the number of days between article publication and their citation. They analyzed 11,325 medical articles in Wikipedia that included citations from 137,889 articles from over 15,000 journals. They found that the top five journals cited, in order, include The Cochrane Database of Systematic Reviews, The New England Journal of Medicine, PLOS One, The BMJ, and JAMA: The Journal of the American Medical Association. This ranking, along with the next 25 journals they found in the study, was related to the highest ranked journals based on Journal Citation Reports and the Scientific Journal Ranking, yet due to the Wikipedia focus around reviews and meta-analysis, there were some clear differences. While evidence of recentism was identified, all journals that appeared in this study are directly related to medicine. The researchers suggested that similar studies be applied to other disciplinary areas, especially as "Wikipedia editing increases information literacy,"(p. 9) while also being more widely used by academics.

Journal citations over time, showing e.g. a rise of PLOS One since 2009

Detecting Pages to Protect

Reviewed by FULBERT

The authors in this study, "Detecting pages to protect in Wikipedia across multiple languages,"^[2] wanted to understand aspects of page protection due to concerns related to vandalism, libel, and edit wars, and determine if tools could help automate this process. The researchers studied two data-sets: the 0.2% of pages which were protected in April 2016, and a similarly-sized random selection of unprotected pages. Their system performed well in predicting candidates for protection and has been developed to work across languages. The researchers hope to test this tool in live Wikipedias as their next step in automated page protection tests. (See also an earlier paper by some of the same authors: "DePP: A System for Detecting Pages to Protect in Wikipedia")

Matching English and Chinese Wikipedia

Reviewed by SashiRolls

In "XLORE2: Large-scale Cross-lingual Knowledge Graph Construction and Application,"^[3] from the inaugural issue (Winter 2019) of Data Intelligence—a joint venture between MIT and the Chinese Academy of Sciences—the authors explore better methods of mapping concepts between Chinese and English in XLORE2, whose taxonomy "is derived from the Wikipedia category system." Fewer than 5% of the over 100,000 infobox attributes in English Wikipedia are matched in Chinese Wikipedia. The authors discuss methods for improving the quality of typological relationships derived from English Wikipedia. Besides English and Chinese Wikipedia, their knowledge base also uses data from Baidu Baike and Hudong Baike.

Presentation slides from a 15-minute video presentation of findings from the project.

Automated moderation of English Wikipedia

Presented by one of the research team, bluerasberry

In "Automated Detection of Online Abuse"^[4] researchers applied machine learning to analyze the user behavior of blocked accounts on English Wikipedia. This analysis identified activity patterns of misconduct and modeled an WP:automated moderation system which could monitor unblocked user accounts to detect patterns of misconduct before human patrol identifies them. The research team's Wikimedia project page includes supplementary materials, including essays on ethical considerations for technological development in this direction.

Conferences and events

See the research events page on Meta-wiki for upcoming conferences and events, including submission deadlines, and the page of the monthly Wikimedia Research Showcase for videos and slides of past presentations.

Other recent publications

Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, are always welcome.

Compiled by Tilman Bayer and Miriam Redi

"Wikidata: Recruiting the Crowd to Power Access to Digital Archives"

From the abstract:^[5]

"This paper will look at how cultural heritage organizations [GLAMs] can work with Wikidata, positioning themselves to become a more useful and accessible knowledge resource to the world."

"WikiDataSets : Standardized sub-graphs from WikiData"

This paper^[6] provides a unified framework to extract topic specific subgraphs from Wikidata. These datasets can help develop new methods of knowledge graph processing and relational learning.

"Predicting Economic Development using Geolocated Wikipedia Articles"

From the abstract:^[7]

"...we propose a novel method for estimating socioeconomic indicators using open-source, geolocated textual information from Wikipedia articles. We demonstrate that modern NLP techniques can be used to predict community-level asset wealth and education outcomes using nearby geolocated Wikipedia articles. When paired with nightlights satellite imagery, our method outperforms all previously published benchmarks for this prediction task, indicating the potential of Wikipedia to inform both research in the social sciences and future policy decisions."

"Improving Knowledge Base Construction from Robust Infobox Extraction"

This paper^[8] offers an effective method to build a comprehensive knowledge base by extracting information from Wikipedia infoboxes.

"Understanding the Signature of Controversial Wikipedia Articles through Motifs in Editor Revision Networks"

From the abstract:^[9]

"The relationship between editors, derived from their sequence of editing activity, results in a directed network structure called the revision network, that potentially holds valuable insights into editing activity. In this paper we create revision networks to assess differences between controversial and non-controversial articles, as labelled by Wikipedia. Originating from complex networks, we apply motif analysis, which determines the under or over-representation of induced sub-structures, in this case triads of editors. We analyse 21,631 Wikipedia articles in this way, and use principal component analysis to consider the relationship between their motif subgraph ratio profiles. Results show that a small number of induced triads play an important role in characterising relationships between editors, with controversial articles having a tendency to cluster. This provides useful insight into editing behaviour [... and also] a potentially useful feature for future prediction of controversial Wikipedia articles."

"Schema Inference on Wikidata"

From the abstract:^[10]

[Wikidata's] data quality is managed and monitored by its community using several quality control mechanisms, recently including formal schemas in the Shape Expressions language. However, larger schemas can be tedious to write, making automatic inference of schemas from a set of exemplary Items an attractive prospect. This thesis investigates this option by updating and adapting the RDF2Graph program to infer schemas from a set of Wikidata Items, and providing a web-based tool which makes this process available to the Wikidata community."

Wikidata as a candidate for the "bibliography of life"

From the abstract:^[11]

"..."
"This talk explores the role Wikidata [...] might play in the task of assembling biodiversity information into a single, richly annotated and cross linked structure known as the biodiversity knowledge graph [...] Much of the content of Wikispecies is being automatically added to Wikidata [...] Wikidata is a candidate for the 'bibliography of life' [...], a database of all taxonomic literature."

"SEDTWik: Segmentation-based Event Detection from Tweets Using Wikipedia"

This paper^[12] presents a system for detecting newsworthy events occurring at different locations of the world from a wide range of categories using Twitter and Wikipedia.

"Assessing the quality of information on wikipedia: A deep-learning approach"

Described as "the first comparative analysis of deep-learning models to assess Wikipedia article quality", this paper^[13] observes that the most important quality indicators (model features) appear to be the following: That the article has been reviewed by "quality" editors, the edit count of the contributors, and the number of its translations.

"The governance of Wikipedia: Examination of Ostrom's rules and theory"

From the English abstract:^[14]

"In this empirical study, we examine a facet of [Wikipedia's] governance: the modalities of construction of two rules related to the citation of sources. We show that these rules are discussed and written by a minority of contributors who are particularly involved. Thus, in Wikipedia, there is no “political class” cut off from the ground. The modalities for the elaboration of the two rules are studied and discussed using Ostrom’s theory of the Commons."

"A clustering approach to infer Wikipedia contributors' profile"

From the abstract:^[15]

We show on both Romanian and Danish wikis that using only the edit and their distribution over time to feed clustering techniques, allows to build [editor's] profiles with good accuracy and stability. This suggests that light monitoring of newcomers may be sufficient to adapt the interaction with them and to increase the retention rate."

References

^ Jemielniak, Dariusz; Masukume, Gwinyai; Wilamowski, Maciej (2019-01-18). "The Most Influential Medical Journals According to Wikipedia: Quantitative Analysis". Journal of Medical Internet Research. 21 (1): e11429. doi:10.2196/11429. ISSN 1438-8871. PMC 6356187. PMID 30664451.
^ Spezzano, Francesca; Suyehira, Kelsey; Gundala, Laxmi Amulya (December 2019). "Detecting pages to protect in Wikipedia across multiple languages". Social Network Analysis and Mining. 9 (1). doi:10.1007/s13278-019-0555-0. ISSN 1869-5450. S2CID 256094610.
^ Hailong Jin; Chenjian Li; Jing Zhang; Lei Hou (March 27, 2019). "XLORE2: Large-scale Cross-lingual Knowledge Graph Construction and Application". Data Intelligence. 1 (1): 77–98. doi:10.1162/dint_a_00003. S2CID 85527325. Retrieved July 20, 2019.
^ Rawat, Charu; Sarkar, Arnab; Singh, Sameer; Alvarado, Rafael; Rasberry, Lane (13 June 2019). "Automatic Detection of Online Abuse and Analysis of Problematic Users in Wikipedia". 2019 Systems and Information Engineering Design Symposium (SIEDS). pp. 1–6. doi:10.1109/SIEDS.2019.8735592. ISBN 978-1-7281-0998-5. S2CID 189825104.
^ Kapsalis, Effie (2019-01-02). "Wikidata: Recruiting the Crowd to Power Access to Digital Archives". Journal of Radio & Audio Media. 26 (1): 134–142. doi:10.1080/19376529.2019.1559520. ISSN 1937-6529. S2CID 204370555.
^ Boschin, Armand (2019-06-11). "WikiDataSets : Standardized sub-graphs from WikiData". arXiv:1906.04536 [cs.LG]. (datasets)
^ Sheehan, Evan; Meng, Chenlin; Tan, Matthew; Uzkent, Burak; Jean, Neal; Lobell, David; Burke, Marshall; Ermon, Stefano (2019-05-05). "Predicting Economic Development using Geolocated Wikipedia Articles". arXiv:1905.01627 [cs.LG].
^ Peng, Boya; Huh, Yejin; Ling, Xiao; Banko, Michele (June 2019). "Improving Knowledge Base Construction from Robust Infobox Extraction". Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers). pp. 138–148.
^ Ashford, James R.; Turner, Liam D.; Whitaker, Roger M.; Preece, Alun; Felmlee, Diane; Towsley, Don (2019-04-17). "Understanding the Signature of Controversial Wikipedia Articles through Motifs in Editor Revision Networks". Companion Proceedings of the 2019 World Wide Web Conference on - WWW '19. pp. 1180–1187. arXiv:1904.08139. doi:10.1145/3308560.3316754. ISBN 9781450366755. S2CID 118627914.
^ Lucas Werkmeister: Schema Inference on Wikidata (Master's Thesis, Karlsruhe Institute of Technology, 2018)
^ Page, Roderic (2019-06-13). "Wikidata and the biodiversity knowledge graph". Biodiversity Information Science and Standards. 3: e34742. doi:10.3897/biss.3.34742. ISSN 2535-0897. (conference abstract)
^ Morabia, Keval; Murthy, Neti Lalita Bhanu; Malapati, Aruna; Samant, Surender (June 2019). SEDTWik: Segmentation-based Event Detection from Tweets Using Wikipedia. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. pp. 77–85.
^ Wang, Ping; Li, Xiaodan (2019). "Assessing the quality of information on wikipedia: A deep-learning approach". Journal of the Association for Information Science and Technology. 71: 16–28. doi:10.1002/asi.24210. ISSN 2330-1643. S2CID 145850217.
^ Sahut, Gilles (2018-05-31). "La gouvernance de Wikipédia : élaboration de règles et théorie d'Ostrom". Tic & Société. 12 (1): 167–200. doi:10.4000/ticetsociete.2426. S2CID 150217819. (in French, with English and Spanish abstracts)
^ Krishna, Shubham; Billot, Romain; Jullien, Nicolas (2018). "A Clustering Approach to Infer Wikipedia Contributors' Profile". Proceedings of the 14th International Symposium on Open Collaboration. OpenSym '18. New York, NY, USA: ACM. pp. 24–1–24:5. doi:10.1145/3233391.3233968. ISBN 9781450359368. PDF

← Previous "Recent research"

Next "Recent research" →

In this issue

31 July 2019 (all comments)

Recent research

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Those who are curious about influential journals on Wikipedia may want to take a look at WP:JCW, in particular WP:JCW/TAR. Headbomb {t · c · p · b} 20:06, 31 July 2019 (UTC)[reply]

Does anyone at all understand Ashford et al's "Understanding the Signature of Controversial Wikipedia Articles through Motifs in Editor Revision Networks"? I've never seen anything like it. I asked in more detail at WP:JIMBOTALK#Ashford et al. in "Recent research". EllenCT (talk) 11:12, 1 August 2019 (UTC)[reply]

I understand the mathematical operations. I suspect the phrase "controversial articles exhibit more reciprocation", that they use, reflects that such articles are more subject to edit warring, or at least, pairs of editors negotiating acceptable content in successive edits. It's nevertheless interesting to see an algorithm that picks up such behaviour. I could see such a technique being useful to pick up pathological user behaviour in financial systems, too. William Avery (talk) 19:20, 1 August 2019 (UTC)[reply]

@William Avery: how do you map the ordered list of editors to the nodes of the triads? For their example in Figure 1, what are the corresponding triads of Figure 2, and which editors are associated with each of those resulting triads' three (left, right, top) nodes? EllenCT (talk) 09:37, 2 August 2019 (UTC)[reply]

Editors A, B and D in figure 1 form a one-way circular relationship corresponding to 030C in figure 2. Any editor could be any node. Triad A, B and C form the graph 111U: A is at bottom right, C is at bottom left (A and B both follow/precede each other in the sequence), B is top (B follows A; B and C never follow each other). The algorithm to get the graphs from an ordered list of edits (the edit history) wouldn't be difficult. As a procedural algorithm, merely move along the edit history considering sections that involve three different editors, then determine whether each editor follows/precedes each of the other two editors in that segment of the history. In practice you would use a tool like https://igraph.org/. William Avery (talk) 12:32, 2 August 2019 (UTC)[reply]

@William Avery: I'm still trying to get my head around the exact algorithm here, but let me ask: do you believe that the paper establishes that a meaningfully accurate classification procedure exists? Because I don't think it does, even if the algorithm is well defined. If so, any idea how accurate it is? EllenCT (talk) 02:20, 4 August 2019 (UTC)[reply]

No. I don't really have any opinions on those matters. Perhaps the authors do. William Avery (talk) 08:45, 5 August 2019 (UTC)[reply]

The Signpost is looking for new talent.

Home

About