Those datasets include lists of over 43 million Wikipedia articles in 55 languages with quality scores by WikiRank. Additionally, the datasets contain the quality measures (metrics) which directly affect these scores.
Quality measures were extracted based on Wikipedia dumps from April, 2022.
- page_id — The identifier of the Wikipedia article (int), e.g. 840191
- page_name — The title of the Wikipedia article (utf-8), e.g. Sagittarius A*
- wikirank_quality — quality score for Wikipedia article in a scale 0-100 (as of April 1, 2022). This is a synthetic measure that was calculated based on the metrics below (also included in the datasets).
- norm_len – normalized “page length”
- norm_refs – normalized “number of references”
- norm_img – normalized “number of images”
- norm_sec – normalized “number of sections”
- norm_reflen – normalized “references per length ratio”
- norm_authors – normalized “number of authors” (without bots and anonymous users)
- flawtemps – flaw templates
Datasets are available under CC BY 4.0 licence: https://dx.doi.org/10.6084/m9.figshare.19762927.v1