The free, multilingual online encyclopedia known as Wikipedia has been with us for more than twenty years. Its content is contributed and curated by volunteer users through the Wikipedia community. The value of this vast repository of information could be even greater than the superficial access to facts and figures if the content followed a standardised approach. Writing in the International Journal of the Metadata, Semantics and Ontologies, a team from Brazil describes their analysis of Wikipedia and its structural characteristics.
Johny Moreira, Everaldo Costa Neto, and Luciano Barbosa Centro de Informática of the Universidade Federal de Pernambuco explain that main, core, content in Wikipedia does not follow a standard structure from entry to entry. However, they demonstrate how the “infoboxes” within each page do follow a standard structure. Unfortunately, only around one in every two Wikipedia entries carry an infobox.
As such, while the infoboxes might be a useful component of Wikipedia that could be addressed by automated data mining tools, given that only 54 percent of entries carry such a component, this could limit the usefulness of data mining, search engine augmentation, and database construction, at least until the user community adds standard infoboxes to the majority of the Wikipedia entries. Of course, there might be ways to extract standardised information from the entries that lack an infobox to create just such an entity. However, there are several different templates that have already been used to create infoboxes even within the same Wikipedia categories.
One might suggest that by its very nature Wikipedia is always a work in progress, but some work is of a more fundamental nature than the creation of content and perhaps a part of the community needs to be enlisted and directed to create infoboxes and standardise the infobox templates if at all possible.
The team explains that there is a lot of interest in the infobox data found in Wikipedia. The researchers have now analysed many aspects of this content with the aim of helping the Wikipedia community to “uncover some data limitations and to guide researchers and practitioners interested in performing tasks using this data.”
Indeed, the team is itself working toward this goal in its own efforts: “Our next step for improving and extending the work presented here is to apply deep learning techniques for automatic measurement and classification of the quality of the defined infoboxes and articles in Wikipedia,” the researchers conclude.
Moreira, J., Costa Neto, E. and Barbosa, L. (2021) ‘Analysis of structured data on Wikipedia’, Int. J. Metadata Semantics and Ontologies, Vol. 15, No. 1, pp.71–86.