Aarhus Byraads forhandlingsprotokoller 1930-1940 term frequency(tf) – inverse document frequency(idf)

Denne rapport dokumenterer databehandlingen af data fra Aarhus Stadsarkivs Github-repository1. I dette tilfælde er der taget udgangspunkt i den del af datasættet der omhandler årene fra 1930 til 1940.

Datasættet er struktureret som følger:

DESCRIPTION
The datasets consist of the transcribed and proof-read text from the annually printed minutes. Text from one specific agenda item on one specific page produces one row. If the same agenda item runs across several pages, it just produces several rows of text.

Each row has the following columns:

date_of_meeting
The date of the meeting (yyyy-mm-dd)

publication_page
The original pagenumber from the printed minutes

page_url
Link to a scanned copy of the printed page

record_ids
One or more record_ids that the current agenda item references. The ids are assigned by the City Council

text
The transcribed text”[2]

Datasættet er behandlet i statistik-programmet R, der giver mange muligheder for statistisk arbejde og efterfølgende grafisk fremstilling af resultaterne. I R arbejder man med pakker, som tilføjer forskellige funktionaliteter til grundstammen af R-funktioner. I dette tilfælde er de relevante pakker:

library(tidyverse)
library(tidytext)
library(lubridate)
library(ggplot2)

Dokumentation for de enkelte pakker:
*https://www.tidyverse.org/packages/
*https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
*https://lubridate.tidyverse.org/
*https://ggplot2.tidyverse.org/

For mere information om R generelt:
https://www.r-project.org/

Indlæsning af data

Først indlæses datasættet i R. Dette gøre med et link til datasættet på Aarhus Stadsarkivs github:

meetings_1930_1940 <- read_csv("https://raw.githubusercontent.com/aarhusstadsarkiv/datasets/master/minutes/city-council/city-council-minutes-1930-1940.csv")

 

Oprensning af data

Data behandlingen vil tage udgangspunkt i Tidy Data-princippet som den er implementeret i tidytext-pakken. Tankegangen er her at tage en tekst og splitte den op i enkelte ord. På denne måde optræder der kun ét ord per række i datasættet. Dette er dog et problem i forhold til propier på formen “M. O. Pedersen”. Ved brug af tidytext princippet vil dette proprium blive til “M”, “O”, og “Pedersen” på hver sin række. Alt tegnsætning fjernes af tidytext-formattet, hvorfor det kun er “M” og “O”, der fremgår. Herved opstår der altså et meningstab i og med at “M” og “O” for sig selv ikke gør os klogere. Dette meningstab er vi interesseret i at undgå og dette gøres ved hjælp af regulære udtryk som:

“([A-Z]). ([A-Z]). ([A-z-]+)”, “\1\2\3”

Dette dette udtryk får R til at lede efter alle tilfælde hvor et stort bogstav efterfølges af et punktum, et mellemrum, et stort bogstav, et punktum, et mellemrum og et stort bogstav efterfulgt af et eller flere små bogstaver. Herefter erstattes punktummerne og mellemrummene med tegnet “_”, således at:

“M. O. Pedersen” ændres til “M_O_Pedersen”

Ved et kig på mødereferaterne kan man se, at propriet “Christian” forkortes “Chr.” efterfulgt af et efternavn. Det og lignende tilfælde er også søgt løst med regulære udtryk som vist herunder:

meetings_1930_1940 %>% 
  mutate(text = 
           str_replace_all(
                          text, 
                          pattern = 
                            "([A-Z])\\. ([A-Z])\\. ([A-z-]+)", "\\1_\\2_\\3")) %>%
  mutate(text = 
           str_replace_all(
                          text, 
                          pattern = 
                            "([A-Z])\\. ([A-Z])\\. ([A-Z])\\. ([A-z-]+)", 
                            "\\1_\\2_\\3_\\4")) %>% 
  mutate(text = 
           str_replace_all(
                          text, 
                          pattern = 
                            "([A-Z])\\. ([A-Z][a-z]+)", "\\1_\\2")) %>% 
  mutate(text = 
           str_replace_all(
                          text, 
                          pattern = 
                            "Chr\\. ([A-z-]+)", "Chr_\\1" )) %>% 
  mutate(text = 
           str_replace_all(
                          text, 
                          pattern = 
                            "Vald\\. ([A-z]+)", "Vald_\\1")) -> meetings_1930_1940

Dette kan muligvis vise sig at være utilstrækkeligt, da andre navne kan forkortes på lignende måder lige så vel som der kan være flere mellemnavne end de 3, der bliver kodet efter her.

I denne undersøgelse er ønsket at finde de vigtigste ord pr. år i Aarhus Byråds forhandlingsprotokoller. Problemet er imidlertidig, at tidsformatet i Stadsarkivets data er en dato på formen ÅÅÅÅ-MM-DD. Da vi her kun er interesseret i året, kan vi takket være pakken lubridate med funktionen ‘year’ udtrække året og sætte den over i sin egen kolonne:

meetings_1930_1940 %>% 
  mutate(aar = year(date_of_meeting)) %>% 
  select(aar, text, record_ids)
## # A tibble: 14,508 x 3
##      aar text                                                   record_ids
##    <dbl> <chr>                                                  <chr>     
##  1  1930 Mødet den 3. April 1930. (For lukkede Døre). Fraværen… <NA>      
##  2  1930 Indstilling fra Skolekommission og Skoleudvalg angaae… 54-1930   
##  3  1930 IV. 1. J_Jensen, do. 2. E_Nordvig-Petersen, do. 3. K_… 54-1930   
##  4  1930 Ting talte for, at man ikke skulde forbigaa de gifte … 54-1930   
##  5  1930 "Andragende fra Restauratør Carl Hansen, \"Pavillonen… 47-1930   
##  6  1930 Andragende fra Drejer H_C_Salbro om Eftergivelse af F… 17_31-1929
##  7  1930 Andragende fra Arbejdsmand Aage Nielsen om Eftergivel… 17_39-1929
##  8  1930 Andragende fra Enke Amalie Rodenberg om Eftergivelse … 17_38-1929
##  9  1930 Mødet den 3. April 1930. Fraværende: Chr_Nielsen og V… <NA>      
## 10  1930 Indenrigsministeriets Samtykke til Køb af Ejendommen … 716-1929  
## # ... with 14,498 more rows

Det næste der sker er, at vi omdanner data til det førnævnte tidytextformat, hvor hvert ord kommer til at stå på en række for sig selv:, hvilket gøres med unnest_tokens-funktionen:

meetings_1930_1940 %>% 
  mutate(aar = year(date_of_meeting)) %>% 
  select(aar, text, record_ids) %>% 
  unnest_tokens(word, text)
## # A tibble: 1,157,594 x 3
##      aar record_ids word       
##    <dbl> <chr>      <chr>      
##  1  1930 <NA>       mødet      
##  2  1930 <NA>       den        
##  3  1930 <NA>       3          
##  4  1930 <NA>       april      
##  5  1930 <NA>       1930       
##  6  1930 <NA>       for        
##  7  1930 <NA>       lukkede    
##  8  1930 <NA>       døre       
##  9  1930 <NA>       fraværende 
## 10  1930 <NA>       chr_nielsen
## # ... with 1,157,584 more rows

Analyse

Vi er nu interesserede i at finde de ord, der hyppigst forekommer pr. år i årene 1930-1940, som vores datasæt spænder over.

meetings_1930_1940 %>% 
  mutate(aar = year(date_of_meeting)) %>% 
  select(aar, text, record_ids) %>% 
  unnest_tokens(word, text) %>% 
  count(aar, word, sort = TRUE)
## # A tibble: 125,715 x 3
##      aar word      n
##    <dbl> <chr> <int>
##  1  1935 at     4979
##  2  1938 at     4010
##  3  1936 at     3949
##  4  1931 at     3922
##  5  1934 at     3863
##  6  1939 at     3818
##  7  1937 at     3732
##  8  1935 til    3499
##  9  1932 at     3374
## 10  1933 at     3238
## # ... with 125,705 more rows

Ikke overraskende er det småord, som optræder flest gange pr. år. Dette er ikke videre interessant i denne undersøgelse, så vi er nu interesseret i at finde et mål, der gør at vi kan sammenligne ords hyppighed på tværs af årene. Dette kan vi gøre ved at udregne ordets, termets, frekvens:

\[f = \frac{n_{term}}{N_{aar}}\]

Før vi kan tage dette skridt skal vi dog have R til at tælle, hvor mange ord, der er i de enkelte år. Dette gøres med funktionen group_by efterfulgt af summarise:

meetings_1930_1940 %>% 
  mutate(aar = year(date_of_meeting)) %>% 
  select(aar, text, record_ids) %>% 
  unnest_tokens(word, text) %>% 
  count(aar, word, sort = TRUE) %>% 

  group_by(aar) %>% 
  summarise(total = sum(n)) -> total_words


total_words
## # A tibble: 11 x 2
##      aar  total
##    <dbl>  <int>
##  1  1930  71014
##  2  1931 115062
##  3  1932 102449
##  4  1933 102346
##  5  1934 117568
##  6  1935 140307
##  7  1936 117619
##  8  1937 115134
##  9  1938 120169
## 10  1939 120286
## 11  1940  35640

Herefter skal vi have tilføjet det totale antal ord til vores dataframe, hvilket gøres med left_join:

meetings_1930_1940 %>% 
  mutate(aar = year(date_of_meeting)) %>% 
  select(aar, text, record_ids) %>% 
  unnest_tokens(word, text) %>% 
  count(aar, word, sort = TRUE) %>% 

  left_join(total_words, by = "aar") -> meetings_1930_1940
meetings_1930_1940
## # A tibble: 125,715 x 4
##      aar word      n  total
##    <dbl> <chr> <int>  <int>
##  1  1935 at     4979 140307
##  2  1938 at     4010 120169
##  3  1936 at     3949 117619
##  4  1931 at     3922 115062
##  5  1934 at     3863 117568
##  6  1939 at     3818 120286
##  7  1937 at     3732 115134
##  8  1935 til    3499 140307
##  9  1932 at     3374 102449
## 10  1933 at     3238 102346
## # ... with 125,705 more rows

Nu har vi de tal vi skal bruge for at udregne ordenes frekvenser. Her udregner vi for “at” i 1935.

\[f(\textrm{at})=\frac{4979}{140307}=0,0354864690\]

Ved at udregne frekvensen for termer kan vi sammenligne dem på tværs af år. Det er dog ikke videre interessant at sammenligne brugen af ordet “at” årene i mellem. Vi mangler derfor en måde at “straffe” ord som optræder hyppigt i alle årene. Til dette kan vi bruge inversed document frequency(idf):
\[\textrm{idf}(term)=\ln(\frac{n}{N})\]
Hvor n er det totale antal dokumenter(i vores tilfælde år) og N er antallet af år, hvor ordet fremgår.

\[\textrm{idf}(at)=\ln(\frac{10}{10})=0\]
Herved får vi altså straffet ord som optræder med stor hyppighed i alle årene eller mange af årene. Ord der forekommer i alle årene kan altså altså ikke fortælle os noget særlig om et givent år. Disse ord vil have en idf på 0 hvorfor deres tf_idf også bliver 0, da denne er defineret ved tf gange med idf.

Heldigvis kan R udregne tf og tf_idf for alle ordene for os i et snuptag med bind_tf_idf-funktionen:

meetings_1930_1940 <- meetings_1930_1940 %>% 
  bind_tf_idf(word, aar, n)
meetings_1930_1940
## # A tibble: 125,715 x 7
##      aar word      n  total     tf   idf tf_idf
##    <dbl> <chr> <int>  <int>  <dbl> <dbl>  <dbl>
##  1  1935 at     4979 140307 0.0355     0      0
##  2  1938 at     4010 120169 0.0334     0      0
##  3  1936 at     3949 117619 0.0336     0      0
##  4  1931 at     3922 115062 0.0341     0      0
##  5  1934 at     3863 117568 0.0329     0      0
##  6  1939 at     3818 120286 0.0317     0      0
##  7  1937 at     3732 115134 0.0324     0      0
##  8  1935 til    3499 140307 0.0249     0      0
##  9  1932 at     3374 102449 0.0329     0      0
## 10  1933 at     3238 102346 0.0316     0      0
## # ... with 125,705 more rows

Ikke desto mindre ser vi ikke nogen interessant ord. Dette skyldes at R lister ordene op i et stigende hierarki – altså lavest til højst.
Vi beder det om at gøre det faldende i stedet – højest tf_idf

meetings_1930_1940 %>% 
  select(-total) %>% 
  arrange(desc(tf_idf))
## # A tibble: 125,715 x 6
##      aar word              n       tf   idf   tf_idf
##    <dbl> <chr>         <int>    <dbl> <dbl>    <dbl>
##  1  1938 1938            452 0.00376  0.452 0.00170 
##  2  1940 flyvepladsen     18 0.000505 1.70  0.000861
##  3  1933 j_chr_møller    103 0.00101  0.788 0.000793
##  4  1940 eksercerplads    16 0.000449 1.70  0.000765
##  5  1932 j_chr_møller     92 0.000898 0.788 0.000708
##  6  1937 raadhus          99 0.000860 0.788 0.000678
##  7  1940 stadsdyrlægen    10 0.000281 2.40  0.000673
##  8  1936 1936            245 0.00208  0.318 0.000663
##  9  1939 1939            392 0.00326  0.201 0.000654
## 10  1930 j_chr_møller     56 0.000789 0.788 0.000622
## # ... with 125,705 more rows

Vi ser her at 1938, 1936 og 1939 kommer ret højt på listen. Dette skyldes formentlig at journalnumrene dannes udfra det pågældende år. Inden vi laver en grafisk visualisering, fjerner vi derfor alle årstal i teksten.

stopord <- data_frame(word = c("1930", "1931", "1932", "1933", "1934", "1935", 
                                   "1936", "1937", "1938", "1939", "1940"))
meetings_1930_1940 <- anti_join(meetings_1930_1940, stopord, by = "word")

Visualisering

Herefter kan vi gå over til en grafisk visualisering.

meetings_1930_1940 %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(aar) %>% 
  top_n(15) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf)) +
  geom_col(show.legend = FALSE, fill = "skyblue2") +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~aar, ncol = 3, scales = "free") +
  scale_y_continuous(labels = scales::comma_format(accuracy = 0.0001)) +
  coord_flip()
## Selecting by tf_idf

Klik for større billede

[1]: https://github.com/aarhusstadsarkiv/datasets/tree/master/minutes/city-council
[2]: Indsat fra https://github.com/aarhusstadsarkiv/datasets/tree/master/minutes/city-council

Posted by Max Odsbjerg Pedersen in Ikke-kategoriseret, 0 comments

Rundtur i Danmark

I de kommende måneder vil Kulturarvsclusteret sammen med DigHumLab besøge de danske universiteter og andre institutioner der måtte ønske det.

Formålet med disse besøg er at præsentere Kulturarvsclusteret som projekt, som teknisk infrastruktur og som en platform der giver danske forskere mulighed for at prøve kræfter med kvantitative digitale metoder på store eller enorme datamængder.

Der kan læses mere om hvad Kulturarvsclusteret er på https://kulturarvscluster.kb.dk/

Vi kommer med et program der typisk vil fylde en halv dag. Skulle din institution være interesseret i et besøg må du meget gerne tage kontakt via e-mail til Kulturarvsclusteret.

Ind til videre har vi planlagt følgende besøg:

SDU den 29. november 2018 — Kontakt Katrine Frøkjær Baunvig baunvig@sdu.dk hvis du er interesseret i at deltage.

KU den 12. december 2018 — Tilmelding og kontakt på https://kubkalender.kb.dk/event/3356480?&hs=a


AAU under planlægning

Posted by Per Møldrup-Dalum in event

Èndagsevent: Large Scale Computational Humanities

Kom og vær med til DeiC og Det Kgl. Biblioteks éndagsevent, hvor der stilles skarpt på Kulturarvsclusteret (KAC).

Arrangementet, Large Scale Computational Humanities, løber af stablen den 22. november mellem klokken 10:00 og frem til kl. 15:00 på Jens Chr. Skous Vej 4 i Aarhus.

Arrangementet handler om, hvad Kulturarvscluteret er for en størrelse og om de muligheder du, som kommende bruger af clusteret, har.

Kulturarvsclusteret benytter moderne teknologier inden for data science og giver for første gang mulighed for at lave kvantitative forskningsprojekter i den digitale danske kulturarv indenfor eksempelvis radio- og tv-udsendelser, arkiverede hjemmesider eller historiske aviser.

Til at fortælle om arbejdet med store datamængder inden for henholdsvis humaniora og naturvidenskab kommer de to eksperter: Kristoffer Nielbo og Rasmus Handberg.

Supercomputing og humanister passer sammen

DeiCs opgaver er blandt andet at udbrede High Performance Computing (HPC) til nye forskningsområder såsom de humanistiske og samfundsvidenskabelige områder.

For at imødekomme det har DeiC og Det Kgl. Bibliotek indgået aftale om etablering af DeiC Nationale Kulturarvscluster, Det Kgl. Bibliotek. Etableringen af kulturarvsclusteret betyder en styrkelse af den humanistiske forskning, hvor brugen af store datasæt hidtil har været begrænset.

Det Kgl. Bibliotek har altid deltaget i nationale og internationale forsknings- og forskningsinfrastruktur-projekter med baggrund i dansk digital kulturarv. Biblioteket har derved god viden og kompetencer omkring, hvad der kræves for at tilbyde eksempelvis søgning efter strukturer og mønstre i store datamængder.

Vi glæder os til at se dig.

TILMELDINGSSIDE

Posted by Per Møldrup-Dalum in event

Podcast: Humanister bruger også supercomputer

Hør i denne episode fra podcastserien Supercomputing i Danmark et interview af Det Kgl. Biblioteks vicedirektør for IT Udvikling og Infrastruktur Bjarne Andersen og Projektleder for Kulturarvsclusteret Per Møldrup-Dalum. Interviewet omhandler spørgsmålet om:

  • Hvad er Kulturarvsclusteret?
  • Hvordan det kan bruges og hvordan bliver brugt?
  • Hvordan man som forsker får adgang
  • Og ikke mindst: hvad fremtiden kan komme til at byde på for forskere indenfor Large Scale Computational Humanities i Danmark?

Dette interview er offentliggjortDeiC’s vidensportal.

I de kommende måneder vil du i denne podcastserie kunne høre mere om arbejdet på Kulturarvsclusteret.

God lyttelyst!

Posted by Per Møldrup-Dalum in repost

A Very Short Introduction to R

Introduction

The text you are reading now is actually created as an RMarkdown Notebook in RStudio. The best way to read this would be to open the notebook in RStudio and follow along, evaluating the code. You can get the notebook file here.

R code can be evaluated either by pressing Ctrl-Shift-Enter with the cursor inside some code or by pressing the little green arrow at the right margin of the code blocks.

On Getting Started with R you can read a short introduction on how to install R and RStudio.

Okay, so the aim of this short R walk-through is to show a path from basic R to running code on the DeIC Cultural Heritage Cluster at the Royal Danish Library (CHC)

Prolouge

Computer languages like R, that has been around for a long time, live through different styles and opinionated principles. One such principle is expressed by the Tidyverse which I like and advocate.

You enter the Tidyverse by loading the tidyverse library – almost. On a newly created R installation, you first need to install the libraries on your computer. This is done using the install.packages function in the following code part. Note that this is only nessecary once!

#install.packages("tidyverse")
#install.packages("rlist")

Then you can load the tidyverse and some other nessecary libraries for this tutorial.

library(tidyverse)
library(rlist) # for list.filter
library(readr) # for write_csv and read_csv

The very basics of R

Standard operators, i.e. plus, minus and so on, work as we expect

2 + 2
## [1] 4

Naming values

Values can be stored for later re-use by giving them a name. Naming values is done with an arrow and it can be performed both to the left and to the right.

2 + 2         -> a_named_value
another_named_value <-  2 + 3

a_named_value + another_named_value
## [1] 9

Pipelines

To me, one of the things that R wonderfull to work with, is the pipe operator: %>%. This operator take what's on the left and sends it to whatever is on the right, e.g. "The Quick Brown Fox" %>% length calculates the length of the given sentence.

You enable the pipe operator by loading the magrittr library.

library(magrittr)

Okay, so what can we do with this pipe?

Let's say we have a string of words that we want to count. Evaluating such a sentence just gives us the same thing back:

"Gollum or Frodo And Sam"
## [1] "Gollum or Frodo And Sam"

To count elements in a list, i.e. the words in the sentence, we can use the length function:

"Gollum or Frodo And Sam" %>% length()
## [1] 1

Okay, so length recieves a list with one element: sentence. Let's split that sentence into words (observe that it's okay to break pipes into multiple lines):

"Gollum or Frodo And Sam" %>%
  str_split(" ", simplify = TRUE) %>% 
  length()
## [1] 5

Now, the simplify = TRUE is needed because str_split can do a lot more that just split a sentence, but for now we just need the simple stuff.

Well, we're not content yet, as we don't want to count words like “or” and “and”. Such words are called stop words. In other words, we want to ignore words that belong to a list of stop words we define. In R, a list of words is defines thus:

c("or", "and")
## [1] "or"  "and"

If we want to know if a a word is contained in such a list, i.e. we want to ask whether “or” is in the list (“or”, “and”), we can do like this:

"or" %in% c("or", "and")
## [1] TRUE

But we really want to ask whether “or” is not in the list. In most computer languages, a truth or false statement can be reversed by the ! character.

!FALSE
## [1] TRUE

So, our list checking expression becomes

!("or" %in% c("or", "and"))
## [1] FALSE

Back to our word counting example. We can now filter the list of words using the above with the list.filter function

"Gollum or Frodo And Sam" %>%
  str_split(" ", simplify = TRUE) %>% 
  list.filter(some_word ~ !(some_word %in% c("or", "and"))) %>% 
  length()
## [1] 4

Four?! Well, as we should know, computers are stupid and don't understand that when we say “and”, we also mean “And”. This is remedied by one more part to the pipeline

"Gollum or Frodo And Sam" %>%
  tolower() %>% 
  str_split(" ", simplify = TRUE) %>% 
  list.filter(some_word ~ !(some_word %in% c("or", "and"))) %>% 
  length()
## [1] 3

Voila!

Data in tables

Most data come in tables in one form or another. Data could be in an Excel spreadsheet, a csv file, a database table, an HTML table, and so on. R understands all these forms and can import them into an R data table, or data frame, as they are called in R.

A very easy way to create a data table or frame, is to use the tibble package, again part of the Tidyverse. The following function creates a data frame with two columns named letter_code and value:

tribble(
  ~letter_code, ~value,
  "a", 2,
  "b", 3,
  "c", 4,
  "π", pi,
  "a", 9
)
## # A tibble: 5 x 2
##   letter_code value
##   <chr>       <dbl>
## 1 a            2   
## 2 b            3   
## 3 c            4   
## 4 π            3.14
## 5 a            9

Let's do that again and also give the table a name

tribble(
  ~letter_code, ~value,
  "a", 2,
  "b", 3,
  "c", 4,
  "π", pi,
  "a", 9
) -> some_data_frame

Data frames can be mutated, filtered, grouped etc. As an example, let's look at all the rows that have value greater than 3:

some_data_frame %>% 
  filter(value > 3)
## # A tibble: 3 x 2
##   letter_code value
##   <chr>       <dbl>
## 1 c            4   
## 2 π            3.14
## 3 a            9

Look at the same data in a visual way

some_data_frame %>% 
  filter(value > 3) %>% 
  ggplot() +
    geom_point(aes(x = letter_code, y = value))

plot of chunk unnamed-chunk-18

The ggplot2 library is the best plotting library for R. It can produce everything from simple plots to animations and high quality plots ready for publication. It's also a part of the Tidyverse.

Here is a plot that aggregates the values into letter_codes:

some_data_frame %>% 
  ggplot() +
    geom_col(aes(x = letter_code, y = value))

plot of chunk unnamed-chunk-19

Getting ready for large scale

Okay, so let's take R code to the next level. R is normally developed and run on a desktop computer or laptop, but it can also run as a server with a web browser interface — and you can hardly tell the difference.

As stated in the introduction, the aim of this text is to show how to run R analysis on the Cultural Heritage Cluster. This cluster is primarily an Apache Spark cluster and off course R, through the Tidyverse, has an interface to such a Spark cluster.

Now, let's see how that works, but be aware: we're trying to break a butterfly upon a wheel…

First ensure that the package for the Spark integration is installed:

#install.packages("sparklyr")

Now, sparklyr works up against two different Spark clusters. The one being a real cluster running on physical or virtual hardware in some server room and the other being a local pseudo cluster. The latter makes it easy for us to create the nessecary code for analysis before turning to the Real Big Thing.

Load the Spark library:

library(sparklyr)

If you want to run against a local pseudo instance, do this, which installs Apache Spark on your machine.

spark_install(version = "2.1.0")
## Spark 2.1.0 for Hadoop 2.7 or later already installed.

The only difference for us is how to initiate the cluster, pseudo or not:

# Sys.setenv(SPARK_HOME='/usr/hdp/current/spark2-client') # for connecting to the CHC
# sc <- spark_connect(master = 'yarn-client')             # for connecting to the CHC
sc <- spark_connect(master = "local", version = "2.1.0")   # for connection to a local Spark
## Re-using existing Spark connection to local

Interlude: Get some data

Fetch the works of Mark Twain. Text mining with Spark & sparklyr) has a more in-depth example using this data.

Well, all of Mark Twains works are available at the Gutenberger project and R has an interface to that treasure trove.

# install.packages("gutenbergr") # evaluate if the package isn't installed already
library(gutenbergr)

Now, let's use the pipe operator and some functions from the gutenbergr package to fetch and store Mark Twain's writings. The expression takes a few minutes to complete.

gutenberg_works()  %>%
  filter(author == "Twain, Mark") %>%
  pull(gutenberg_id) %>%
  gutenberg_download() %>%
  pull(text) %>%
  writeLines("mark_twain.txt")

Okay, so what did we get?

readLines("mark_twain.txt", 20)
##  [1] "WHAT IS MAN? AND OTHER ESSAYS"        
##  [2] ""                                     
##  [3] ""                                     
##  [4] "By Mark Twain"                        
##  [5] ""                                     
##  [6] "(Samuel Langhorne Clemens, 1835-1910)"
##  [7] ""                                     
##  [8] ""                                     
##  [9] ""                                     
## [10] ""                                     
## [11] "CONTENTS:"                            
## [12] ""                                     
## [13] "     What Is Man?"                    
## [14] ""                                     
## [15] "     The Death of Jean"               
## [16] ""                                     
## [17] "     The Turning-Point of My Life"    
## [18] ""                                     
## [19] "     How to Make History Dates Stick" 
## [20] ""

Load the texts onto Spark

Now, the texts are on the local file system, but we want it in Spark. Remember that we are breaking butterflies on wheels here!

twain <-  spark_read_text(sc, "twain", "mark_twain.txt")

Analysis

The texts now has a copy in the Spark system, cluster, machine, or whatever we should call that thing. What's important is that we can use that copy for very large scale analysis. Here, we'll just do some very simple visualization.

First, let's get the data on a tidy form, i.e. remove all punctuation, remove stop-words and transform the text into a form with one word per row.

twain %>%

  filter(nchar(line) > 0) %>%
  mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>%

  ft_tokenizer(input.col = "line",
               output.col = "word_list") %>%

  ft_stop_words_remover(input.col = "word_list",
                        output.col = "wo_stop_words") %>%

  mutate(word = explode(wo_stop_words)) %>%
  select(word) %>%
  filter(nchar(word) > 2) %>%
  compute("tidy_words") -> tidy_words

That snippet of code does a lot:

  • The first filter function remove all empty lines (number of characters is more than zero)
  • the mutate function replaces all punctuation with spaces
  • the ft_tokenizer function tramsforms each line into a list of words
  • the ft_stop_words_remover removes a set of pre-defined stop words
  • the second mutate takes the list of words on each line a transforms that list into multiple rows, one per word
  • the select function removes all columns except the column with the word
  • the last filter function removes words with only one or two letters
  • the compute function stores the result in the Spark cluster for easy retrival later
  • and lastly save that Spark result as an R name called tidy_words

Count the word frequencies

Okay, so that can be used to perform a word count. The arrange function sorts a data frame, and the desc function gives us descending order, i.e. that largest number first. n is a implicit name created by the count function and n refers to the count of the thing counted in the count function.

tidy_words %>%
  count(word) %>% 
  arrange(desc(n)) -> word_count

So, what were the ten most used words by Twain?

word_count %>% head(10)
## # Source:     lazy query [?? x 2]
## # Database:   spark_connection
## # Ordered by: desc(n)
##    word      n
##    <chr> <dbl>
##  1 one   20028
##  2 would 15735
##  3 said  13204
##  4 could 11301
##  5 time  10502
##  6 man    8391
##  7 see    8138
##  8 two    7829
##  9 like   7589
## 10 good   7534
## # ... with more rows

Show me the data

Again, a visualization gives some extra and we will now create a word cloud of Twain's words

#install.packages("wordcloud")
library(wordcloud)
word_count %>%
  arrange(desc(n)) %>% 
  head(70) %>%
  collect() %>%
  with(wordcloud::wordcloud(
    word, 
    n,
    colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")
  ))

plot of chunk unnamed-chunk-32

Next steps

And so on, towards ∞

Posted by Per Møldrup-Dalum in R, tech

Getting Started with R

The DeIC National Cultural Heritage Cluster, the Royal Danish Library (CHC) has R as one of its two main interfaces, Python being the other one. R is very widespread in the data centric communities including the digital humanities. This blog post describes how to get started with R with the main objective of enabling the use of R at the CHC. Still, most of the descriptions here are generic and platform agnostic.

The R Project describes R in the following way:

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.


R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.


One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.


R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

https://www.r-project.org/about.html

The R Project describes what is called the R environment in the following way

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either on-screen or on hardcopy, and a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.


The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

Many users think of R as a statistics system. We prefer to think of it of an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.

https://www.r-project.org/about.html

We propose to use the RStudio platform for working with R. RStudio is a commercial organisation – developing tools and methods for and with R and their mission is:

RStudio has a mission to provide the most widely used open source and enterprise-ready professional software for the R statistical computing environment. These tools further the cause of equipping everyone, regardless of means, to participate in a global economy that increasingly rewards data literacy.

We offer open source and enterprise ready tools for the R computing environment. Our flagship product is an Integrated Development Environment (IDE) which makes it easy for anyone to analyze data with R. We also offer many R packages, including Shiny and R Markdown, and a platform for sharing interactive applications and reproducible reports with others.

https://www.rstudio.com/about/

Getting and installing R

As we propose to use RStudio for all things R, two things are needed: The R environment itself and the RStudio platform, where R is the language (end implementation) and RStudio is the workbench.

To download and install R, go to CRAN and select the package matching your platform. Windows, Linux, and macOS are all supported.

To download and install RStudio, go to RStudio Download and select the RStudio Desktop – Open Source License matching your operating system. Again all major systems are supported.

The first code

A very fine and highly recommended introduction to R and Data Science using R is the 2017 book R for Data Science by Hadley Wickham and Garrett Grolemund. This book is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 and can be read freely on the web or bought from O’Reilly.

Some notes on coding in R

As R is several decades old, a lot of R-code has been written using a lot of styles and principles and a lot of extension libraries that add functionality to the base of R. In recent years, the biggest movement within the R community has been the Tidyverse. The Tidyverse is, in their own words

R packages for data science


The tidyverse is an opinionated collection of R packages designed for data science.


All packages share an underlying design philosophy, grammar, and data structures.

https://www.tidyverse.org

The “tidy” in Tidyverse refers to an underlying principle on the structure on the data to be analyzed. In tidy data, each variable is a column, each observation is a row, and each type of observational unit is a table. This principle makes data much more easy to clean, explore, visualize, analyse, and so on. An in-depth description of, and argumentation for, the tidy data principle, can be found in Tidy data by Hadley Wickham (also published in The Journal of Statistical Software, vol. 59, 2014).

Oh, and Hadley Wickham has written a R style guide.

So, the Tidyverse contains libraries for creating and manipulating tidy data, but also tools for writing and communicating code and results. From interactive notebooksbook writing systems, and interactive web applications, all at your R-enabled fingertips.

Further reading

Books

When finished with R for Data Science, a next logical step could be another R book under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license:

Text Mining with R by Julia Silge and David Robinson

Advanced R by Hadley Wickham

Communities and online resources

If you want to learn R, make it habit of visiting R-bloggers with daily news and tutorials about R, contributed by over 750 bloggers.

The R community is also very active on Twitter, where most R tweets are tagged with #rstat. Some important tweeters are:

  • Hadley Wickham hadleywickham is the main author of a lot of the Tidyverse (Not so long ago it was actually called the Hadleyverse) and ggplot, the primary plotting library for R. He is also the author of the books R for Data Science and Advanced R Programming
  • Mara Averick dataandme tweets a lot on everything R and does so in a fun and entertaining way.
  • The dane Thomas Lin Pedersen thomasp85 tweets a lot on data visualisering and is the author on a lot of very interesting R packages.


Posted by Per Møldrup-Dalum in R, tech