Measuring the Quality of Data Visualizations in Wikipedia

A new scientific publication presents the first comprehensive analysis of data visualizations across the Polish-language edition of Wikipedia. The study examines more than one million articles, extracting and classifying visual elements such as tables, charts, diagrams, and maps, and proposing a scalable framework for assessing their quality.

The authors of the study “Quality Measures for Data Visualization: A Case Study of Polish Wikipedia” identified visualizations using a combination of algorithmic classification and direct image analysis. Each article was assigned to one of 22 thematic groups (e.g., STEM, geography, culture, history), enabling domain-level comparisons. The enriched dataset includes:

  • Over 30 structural, contextual, and qualitative measures of visualization quality.
  • Metadata and semantic labels generated by a locally deployed multimodal (language–vision) model.
  • Thematic categorization based on relevant Wikidata items.

The results show uneven distribution of visualizations across domains. Tables are the most common format across nearly all fields. Maps dominate geographically oriented articles, while charts are especially prevalent in technical and STEM-related content. This study contributes to several ongoing discussions within the Wikimedia and academic communities:

  • Visual literacy and knowledge representation: it provides empirical evidence on how visual knowledge is structured and distributed in a major encyclopedic platform.
  • Quality monitoring at scale: the proposed methodology allows systematic evaluation of visual content, complementing existing textual quality assessments.
  • Design best practices in open knowledge: by defining more than 30 quality measures, the study creates a practical benchmark for improving visual communication.

Importantly, this is the first large-scale, systematic exploration of Wikipedia’s visual layer, opening a new research direction for Wikimedia scholarship.

Different communities can benefit from this research in meaningful and practical ways. For Wikipedians and Wikimedia contributors, the findings provide evidence-based guidance on where visual content is lacking, unevenly distributed, or potentially below established quality standards, helping editors improve clarity, accessibility, and consistency across articles. The proposed quality measures can also support community discussions about best practices and inform the development of automated tools that assist in detecting low-quality or misleading visualizations.

For information designers and science communicators, the study offers a large-scale empirical overview of how visual knowledge is currently presented in a real-world, collaborative encyclopedia. This makes it possible to benchmark design strategies, identify common weaknesses in visual communication, and develop more effective approaches to presenting complex information to broad audiences.

Digital humanities researchers can use the dataset and methodology to explore patterns of knowledge representation across thematic domains, investigate how different fields rely on specific visualization types, and conduct comparative analyses across language editions of Wikipedia or other open knowledge platforms. The study also opens new avenues for examining how visual structures shape public understanding of science, history, geography, and culture.

For developers of AI and content-support tools, the research provides a tested framework for multimodal analysis that combines textual, structural, and visual signals. This framework can be adapted to build systems for automated quality monitoring, visualization enrichment, semantic labeling, and large-scale content analysis in collaborative digital environments.

Finally, educators and advocates of visual literacy can draw on the findings to demonstrate how visualizations function within one of the world’s largest knowledge repositories, using the results to promote critical evaluation of visual information and to improve training in data communication skills.

The publication is freely available in open access: doi.org/10.1016/j.procs.2025.09.553