Probing a nation’s web domain — the historical development of the Danish web — part 2

This project is guided by the following research question: What has characterised the Danish web and its development from 2005 onwards? 

This project builds upon the insights from the Pilot project P001: ‘Probing a nation’s web domain — the historical development of the Danish web’, and it will extend this project with analyses of types of material that were not included in the project, namely hyperlinks, images, audio and video. 

P001 has led to the development of robust procedures for extracting large amounts of data from Netarkivet to be analysed on the Cultural Heritage Cluster (the ETL process, documented in a report by Per Møldrup- Dalum), and, in addition, procedures for pre-processing/cleaning of the material have been developed as well as an algorithm that is a necessary prerequisite for delimiting and establishing one annual corpus per year where each website is only present once. Both these processes have never been developed elsewhere internationally, and they would not have been developed had it not been for the Cultural Heritage Cluster and the support of being a Pilot project. In the remaining weeks of P001’s Pilot project period the project team will make initial analyses of the material to test and showcase what can be done with this type of material. These analyses will be based on the two types of data that have been extracted — crawl logs and the text from the HTML pages — and they will focus on research questions such as: Which parts of the Danish web are protected by passwords? How updated is the Danish web? What are the most prevalent languages? The most frequently used words? Where are specific topics (refugee crisis, politics, sports events…) discussed on the Danish web? And probably several other questions, as we get to know the material better. 

However, the Pilot project period of P001 has made it clear that extracting and analysing material in the Danish Netarkivet is more complicated than anticipated (in terms of technology, curatorial expertise, project organisation, etc.), and therefore it has not been possible to establish the procedures for extracting, pre-processing/cleaning and analysing other types of material that are very important for an understanding of the Danish web and its development. 

The aim of the present project — ‘Probing a nation’s web domain — the historical development of the Danish web — part 2’ — is to develop the procedures for extracting, pre-processing/cleaning and analysing one of the fundamental web specific features, namely hyperlinks. In addition, the project also aims at extracting, pre-processing/cleaning and analysing multimedia content such as images, audio and video material. This will allow for hyperlink based analyses, such as analyses of hyperlink networks — which are the 100 most central websites on the Danish web? — or the link structure out of the Danish web, including which countries and social media platforms does the Danish web link to. The extracted multimedia content will allow for analyses of billions of images — What are the most prevalent scene types, colours, etc.? 

As was the case with P001 the developed methods are generic so they can be used in smaller scale studies and with other sub-sets of the holdings of Netarkivet. Undoubtedly, the Pilot project P001 has constituted a great leap forward for Big Data studies of the Danish web and its historical development, and the project has been agenda-setting internationally, which is mirrored in, among other things, the edited volume The Historical Web and Digital Humanities: The Case of National Web domains (Routledge, manuscript submitted today), which is edited by two of the project participants and includes chapters by all project participants. But it is important to keep the momentum of the project and to include elements that will help future-proof this type of analyses as well as develop the use of the Cultural Heritage Cluster further

The project is led by Niels Brügger.