Website archiving: Successful project of the University Archive

It is the medium with which TU Ilmenau presents itself to the worldwide public: its website. The university's website has just undergone a comprehensive relaunch and now that the new university website has been online for half a year, the old one will be switched off on 1 November. From then on, it will no longer be possible to access data from the previous website - unless it has been archived according to all the rules of the art.


For the past two years, a small team consisting of Dr. Anja Kürbis, head of the University Archives, and Maximilian Gagewi, EDP staff member of the University Library, have dedicated themselves to this task - without additional funds and in addition to the existing extensive range of activities. In the following article, Dr. Anja Kürbis and Maximilian Gagewi present the background, development and challenges of the project "Web Archiving" at the TU Ilmenau.

Websites have high source value

The university's website is a central component of online communication. It presents the institution as it would like to be seen. It conveys the university's self-image to the intended public and offers information and services for the purpose of transparency and advertising. The website thus has a source value that should not be underestimated, for example, in order to explore the institution's image of itself and of others, to research information about structures or people, or to reconstruct events in teaching and research.

But what happens to a website that is taken offline, to the information about university life, teaching, research and science? Is its content lost if it has not been archived? At the time of the previous relaunch of the TU Ilmenau website in 2010, no web archiving procedure existed at the university, but the American non-profit company "Internet Archive" had been making snapshots, i.e. copies, of the university website at irregular intervals since 1998. However, a closer look reveals considerable shortcomings in these snapshots: Be it that they cannot be accessed at all, that content is missing or not adequately reproduced, or that time periods are mixed up. Reliable information cannot be obtained from these web offerings, and scientific research is not even possible. A loss of information, as was the case with the web relaunch 10 years ago, could not be accepted again with the current relaunch.

The challenge of web archiving

The challenge of preserving such content is the ephemeral nature of the digital medium. After all, apart from the major disruption of a web relaunch, the particular appeal of a website is the ability to change, delete and add content whenever necessary. But even this aspect is currently in flux, with the university web presence increasingly being used for static information with longer validity, while up-to-date information is being moved to the university's social media channels. As communication behavior changes, so does the medium.

For archives and libraries struggling to preserve network resources, this is both a truism and a challenge. And there is a second, no less insignificant challenge associated with web archiving: the cost in resources. For this reason, only a few and especially large libraries, such as the German National Library, and archives take on this task. Thuringia-wide activities of web archiving are currently not perceptible except for a declaration of intent of the Thuringian State Library Jena.

Project Website Archiving

As part of the web archiving project, around 150 university-related web presences were examined in an elaborate autopsy procedure with regard to the value and quality of the information on the one hand and the associated copyrights on the other. 30 web presences were classified as worthy of archiving and an appropriate archiving was prepared. In parallel, an extensive evaluation of paid and open source tools was carried out, with which the websites can be crawled and subsequently viewed. With Heritrix and Pywb, two open source tools were selected that are constantly being developed further and had to be adapted for our needs. The timing and extent of the crawl process were determined and a start was made on obtaining initial archiving rights from the respective website owners and applying for additional storage space from the data centre. A first test crawl of the domain www.tu-ilmenau.de could already be carried out in December 2019. The intranet, on the other hand, which is hidden behind the employee login, could only be backed up manually page by page. The archiving of the website together with metadata is done in WARC, the most promising and ISO-specified format for web archiving as things stand. Currently these files occupy about 180 GB.

The first results have recently been made available in advance to all members of the TU Ilmenau via the archive's SharePoint presence. For the website and the employee intranet, a time slice of the website and the employee intranet will be offered for the periods 2019/2020 and 2021 respectively. This will give them the opportunity to check what content has been saved and so does not necessarily need to be transferred to the new website. Finally, streamlining the website was considered an important goal of the web relaunch.

Web archive portal accessible to all

In the course of further work, a web archive portal for the TU Ilmenau is to be created that is accessible to all. In the future, the university web presence will be crawled twice a year. Extraordinary events, such as the Corona Pandemic, have been and will continue to be additionally and closely backed up. Similarly, the University's social media channels, Twitter and Facebook, are archived. Other websites, such as those of student societies, which have also been deemed worthy of archiving, will also be archived in the future, provided the Archive has the rights to do so, and made available to all either in the Archive's rooms or online.

Core task of the archive

One of the basic functions of a university archive is to relieve the active data stock of the university by archiving it and to make the old, but still historically, culturally and legally valuable data accessible in a legally separate room. This applies to the analogue files as well as to the websites, which are ultimately nothing other than official records. In this respect, web archiving is ultimately not a project, but a core task of the archive. And yet: one last step is still missing: the actual long-term archiving. Creating the appropriate structures for this is the task of the university in the coming years.


Dr. Anja Kürbis

Head of the University Archives