WikiRank dataset on Hugging Face

A new comprehensive dataset containing WikiRank quality scores for approximately 47 million Wikipedia articles across 55 language editions has been released on Hugging Face, marking a significant milestone for researchers, developers, and organizations interested in assessing and enhancing the quality of Wikipedia content globally.

The WikiRank quality score is an innovative metric designed to evaluate the overall quality of Wikipedia articles. The significance of this dataset lies in its vast scale and multilingual coverage, making it a uniquely valuable resource. Previously, assessing the quality of Wikipedia articles at scale, particularly across diverse languages, posed significant challenges due to limitations in data accessibility, standardization, and computational resources. The availability of this extensive dataset on Hugging Face, an accessible and popular platform for machine learning resources, democratizes data-driven research and innovation related to Wikipedia’s content quality.

This dataset can be leveraged in numerous impactful ways, for example:

  • Quality improvement initiatives. Wikipedia communities, editors, and content moderators can utilize WikiRank scores to systematically identify articles requiring improvement or updates, ensuring the encyclopedia’s content remains accurate, reliable, and comprehensive.
  • Multilingual content development. By highlighting quality disparities across different language editions of Wikipedia, the dataset can guide targeted content creation or translation efforts, fostering equity in knowledge representation and accessibility globally.
  • Education and information literacy. Educators and academic institutions can use this dataset to teach students about information quality assessment, critical analysis of sources, and digital literacy, providing concrete examples of varying content standards and their indicators.
  • Natural Language Processing (NLP) and AI development. Researchers and developers can utilize WikiRank scores to train machine learning models for automated content assessment, generating summaries, or content recommendation systems that prioritize high-quality information.
  • Benchmarking. Analysts and researchers can leverage the dataset to perform comparative studies on information quality, editorial processes, and knowledge gaps across different cultural and linguistic contexts.

The complete dataset is available for download on the Hugging Face.