Result of identification and evaluation of millions of information sources in Wikipedia

The principle of verifiability on Wikipedia mandates that all content must be verifiable by readers through reliable sources. Wikipedia insists on information being based on what has been previously reported in reputable publications, not the personal convictions or unpublished work of its contributors. If there are contrasting views from reliable sources, Wikipedia maintains an impartial stance by presenting each viewpoint proportionately.

All content in the Wikipedia articles must be backed up by reliable sources. Direct citations are required for all quotations and any content that might be questioned or is prone to questioning. Without proper inline citations, such content is subject to removal.

Wikipedia emphasizes the importance of grounding its articles in dependable, autonomous sources known for their diligence in fact-checking and accuracy. Such sources should be published, which in Wikipedia’s context means they should be accessible to the public in any format. Wikipedia does not consider unpublished materials reliable. It is important to use sources that adequately support the content and are suitable for the statements being made, especially when dealing with sensitive topics like biographies of living individuals or medical information.

Given the vastness of the internet, which hosts over a billion websites, it becomes a challenging task for Wikipedia users to individually evaluate the reliability of each sources. In some editions of Wikipedia across different languages, there are specific guidelines detailing which sources may be deemed reliable. However, there is not complete list of websites that can be used in Wikipedia as reliable sources of information. Additionally, the reliability and reputation of a website can vary over time, depending on the language and subject matter, necessitating frequent updates to these lists. Therefore, a more comprehensive and current compilation of such trusted sources would not only benefit the editors who curate Wikipedia’s content but also its readers who rely on the encyclopedia for accurate information.

BestRef serves as a tool to evaluate the importance of information sources utilized in Wikipedia. It offers insights into the most significant sources of information across various language editions of Wikipedia, facilitating the assessment of the quality and credibility of the content presented within this vast online encyclopedia. This aids in ensuring that Wikipedia remains a trustworthy repository of knowledge.

Currently the BestRef database contains results of assessment of 3.8 million websites in over 300 language versions of Wikipedia. Based on analysis of over 60 million Wikipedia articles in October 2023 it was possible to extract information about over 330 million references. This allowed to identify the best information sources of Wikipedia using different assessment models. The table below shows the results of reference extraction for selected language versions and number of unique websites (links lead to rankings of the best sources of information in the selected language versions):

Wiki Language version Article count Reference count Unique websites
ar Arabic 1,219,168 6,355,164 294,089
ca Catalan 735,551 3,895,389 197,470
cs Czech 532,602 2,752,877 119,313
de German 2,839,878 14,473,501 622,551
en English 6,722,214 79,687,819 1,942,579
es Spanish 1,833,749 12,558,623 509,313
fa Persian 975,931 2,477,763 133,634
fi Finnish 559,931 3,371,084 138,320
fr French 2,557,559 19,455,752 576,523
he Hebrew 342,285 1,867,068 103,848
hi Hindi 162,954 496,057 47,617
hu Hungarian 530,977 2,545,152 124,536
id Indonesian 661,844 2,672,604 162,924
it Italian 1,829,095 8,856,574 278,232
ja Japanese 1,388,532 14,684,917 359,446
ko Korean 646,717 1,885,878 91,918
nl Dutch 2,133,536 3,010,002 112,318
no Norwegian 616,624 2,102,507 107,343
pl Polish 1,583,919 8,847,928 242,835
pt Portuguese 1,110,209 7,692,600 319,534
ru Russian 1,940,113 15,461,960 454,351
sv Swedish 2,572,575 11,791,609 134,081
th Thai 158,905 1,010,438 70,395
tr Turkish 533,201 2,773,455 146,854
uk Ukrainian 1,289,727 5,455,954 217,787
vi Vietnamese 1,288,093 3,796,577 147,041
zh Chinese 1,379,496 8,130,187 283,516

Models

Importance of each of websites was assessed by the BestRef using three models (which were described in the research published in 2020):

  1. F-model: based on frequency (F) of source usage.
  2. PR-model: based on cumulative pageviews (P) of the article in which source appears divided by number of the references (R) in this article.
  3. AR-model: based on number of authors (A) of the article in which source appears divided by number of the references (R) in this article.

Frequency of source usage in F-model means how many references contain the analyzed domain in URL. This method was commonly used in different research works. So, F-model takes into account a total number of appearances of such reference, i.e., if the same source is cited 3 times, then the frequency will be equal 3. Equation [1] shows the calculation for F-model, where s is the source, n is a number of the considered Wikipedia articles, Cs(i) is a number of references using source s (e.q. domain in URL) in article i.

PR-model uses cumulative pageviews divided by the total number of the references in a considered article. Comparing to the previous model, here additionally popularity of the Wikipedia article and visibility of the references that used the analyzed source was taken the into account. This model amuses that in general the more references in the article, the less visible the specific reference is. The equation [2] shows the calculation of measure using PR-model, where s is the source, n is a number of the considered Wikipedia articles, C(i) is total number of the references in article i, Cs(i) is a number of the references using source s (e.q. domain in URL) in article i, V(i) is cumulative pageviews value of article i. Please note, that overcklocked values of the pageviews for some Wikipedia articles were reduced.

As the pageviews value of article is more related to readers, there is also another important measure that addresses the popularity among authors, i.e., number of users who decided to add content or make changes in the article. Given the assumptions of previous model, AR-model is related to authors. It is described on the equation [3], where s is the source, n is a number of the considered Wikipedia articles, C(i) is total number of the references in article i, Cs(i) is a number of references using source s (e.q. domain in URL) in article i, E(i) is total number of registered authors (non-bots) of article i.

More detailed information on the use of these and other models can be found in relevant scientific publications: