Introduction

What is a DeiC National pilot project on the Cultural Heritage Cluster?

The Cultural Heritage Cluster (KAC) provides a technical platform and technical abilities but does not provide knowledge related to the specific research domain. Neither does KAC provide knowledge about methods that will be suitable for the specific research question.

Working with KAC does not require prior knowledge of code-based quantitative analysis nor the use of KB’s collections as data. Although, any prior knowledge would enhance the value of the pilot project and the expectations of achievable results.

What does a DeiC pilot project on the Cultural Heritage Cluster contain?

A National DeiC pilot project contains the following elements

  • Access to KAC for six consecutive months
  • 120 hours of consultancy, which is provided by the IT Development department at KB

Along with the consultancy, general IT-support is included. Therefore, KAC offers support to problems like non-functioning software, login-troubles, etc. The solutions to such problems are not included in the allocated hours of consultancy.

Structure of a pilot project

A normal project process follows the following phases

  •   Dialogue phase
    • Focus on determining data to be used
  •   Milestone 1
    • The signing of the agreement to use KAC and signing of an extradition agreement for data from KB’s collections
  •   ETL phase
    • Data extraction
  • Milestone 2
    • Extradition of data
  • Phase 1
    • Analysis work with intensive consultancy
  • Phase 2
    • Analysis work including sparring with a KB consultant
  • Milestone 3
    • Project closure

Dialogue phase

In the dialogue phase, the project process is planned. The following elements will be discussed.

  • How should the consultancy hours be distributed throughout the project?
    • g. three months with 80% of the hours followed by three months with the remaining 20%. This is equivalent to approx. a day of the week for the first three months.
  • How should the collaboration be structured?
    • This could be planning of half or full days during the week, where we sit separately, but work together on the project or it could be a more loose structure over the workweek.
  • At which time of the month will our status meeting take place?
    • g. the first Wednesday every month. These meetings seek to outline the status of the pilot project and work in KAC. Thus, these meetings are not for technical issues.
  • Will there be a need for one or two long-lasting breaks to the access to KAC?
    • g. due to teaching of a course or vacation.

Further, collaboration tools and channels will be determined within this phase. KAC uses e-mail for primary project-specific communication. General information will be available on our website, where technical documentation, FAQ and links to the main tools are provided.

There is also a need to determine a place for reporting the work record, decisions and developed code. Some tools to support this could be e.g. Microsoft Teams, Atlassian Confluence, Basecamp, GitHub, etc. The responsibility for determining these tools is on the research project as KB/KAC does not provide such tools.

It has also been perceived as beneficial for the collaboration to have a chat platform. Once again, numerous platforms are available and many of them are integrated with the abovementioned tools. Another example could be Slack.

Another element to be dealt with in the dialogue phase is determining the data that is going to be used in the project. If the project desires to work with data from KB’s collections, KAC mediates the contact with relevant curators at KB.

Milestone 1 – the agreements

Before a project can be granted access to KAC, an agreement that describes what the access contains and involves for KB and the research project must be signed. The agreement is signed by the head of the institute/department, the leading researcher on the project and KB

If the project desires to work with data from KB’s collections an extradition agreement must be compiled. This is signed by the head of the institute/department, the leading researcher on the project and the responsible vice president for the digital cultural heritage at KB.

Both agreements are available as templates that are approved by legal offices at the IT University and Aarhus University.

ETL phase

ETL is an abbreviation of Extract, Transform, and Load and represents the process of extracting data from KB’s collections, transform the data to be analyzable and make the data available on KAC.

This phase is technical and primarily contains work from the consultant at KB. However, challenges about data often arise. These challenges must be considered by the project team.

The hours the consultant spends in this phase counts towards the total number of granted consultancy hours of the project.

Milestone 2 – Data extradition

Once the ETL phase is complete, data are ready to be extradited to the project.

Phase 1 and phase 2

In these two phases, the real analysis takes place. The difference between the phases is the intensity of which the consultant participates in the project. Typically, the consultant will use 80% of the granted hours in phase 1. These hours are primarily scheduled for whole-day collaborations. In phase two the remaining 20% is typically used for short, weekly sparring sessions.

Methods for analysis and tools

Primarily, KAC is based on an Apache Spark HPC solution, and all tools used for data analysis have to be compatible with this setup. To a limited extent, KAC also provides the opportunity to work with machine learning in one of its many types.

As KAC is an Apache Spark Cluster, all jobs must be run on Spark. Spark has good integrations with both R and Python. The integration with R is described here: https://spark.rstudio.com. The corresponding solution for Python is called pySpark and is available here: https://pypi.org/project/pyspark/

For both of the abovementioned types, a prerequisite is that the main programming languages are Python and R which can be accessed through RStudio Server, Jupyter Notebook or SSH and the command-line.

It is not expected that the research project is experienced in using Python or R from the get-go. However, to achieve a successful project it is necessary that the project develops competencies in Python or R or employ others, who possess these competencies. Both languages of code are applicable, both separately and in combination. If R is chosen, then it is natural to use RStudio. For more information on how to get started in R and RSstudio, please take a look at these blogposts from our website: Getting Started and Introduction.

Further, the best textbook about R and Data Science can be found at https://r4ds.had.co.nz. Moreover, the RMarkdown format is a great tool for many kinds of operations. More information about RMarkdown can be found at https://rmarkdown.rstudio.com.

KAC offers remote access to the workspace of the research project. This remote connection is protected by a VPN. This VPN connection requires that supportive software of OpenVPN has been installed on the computer from which it is desired to gain access from.

Once secure and encrypted access to KAC through VPN has been established, it is possible to access all the tools and services, which is contained in the agreement of use of KAC e.g. access to RStudio Server, a Python Jupyter Notebook or an SSH-connection.

RStudio Server is freely accessible in the cloud. At https://rstudio.cloud it is possible to sign up and learn how to use RStudio without having to install software on your computer. At https://jupyter.org/try the same applies to Jupyter Notebook software. From these pages it is also possible to learn more and download the full program.

In these software systems, it is possible to load data, perform analyses, create diagrams and export one's results. For small data volumes, this can be done in the local work area and for larger data volumes the analysis is sent to the connected HPC installation. If it desired to work with machine learning, then Keras is a beneficial tool to explore https://keras.rstudio.com.