Introduction
The text you are reading now is actually created as an RMarkdown Notebook in RStudio. The best way to read this would be to open the notebook in RStudio and follow along, evaluating the code. You can get the notebook file here.
R code can be evaluated either by pressing Ctrl-Shift-Enter with the cursor inside some code or by pressing the little green arrow at the right margin of the code blocks.
On Getting Started with R you can read a short introduction on how to install R and RStudio.
Okay, so the aim of this short R walk-through is to show a path from basic R to running code on the DeIC Cultural Heritage Cluster at the Royal Danish Library (CHC)
Prolouge
Computer languages like R, that has been around for a long time, live through different styles and opinionated principles. One such principle is expressed by the Tidyverse which I like and advocate.
You enter the Tidyverse by loading the tidyverse
library – almost. On a newly created R installation, you first need to install the libraries on your computer. This is done using the install.packages
function in the following code part. Note that this is only nessecary once!
#install.packages("tidyverse")
#install.packages("rlist")
Then you can load the tidyverse
and some other nessecary libraries for this tutorial.
library(tidyverse)
library(rlist) # for list.filter
library(readr) # for write_csv and read_csv
The very basics of R
Standard operators, i.e. plus, minus and so on, work as we expect
2 + 2
## [1] 4
Naming values
Values can be stored for later re-use by giving them a name. Naming values is done with an arrow and it can be performed both to the left and to the right.
2 + 2 -> a_named_value
another_named_value <- 2 + 3
a_named_value + another_named_value
## [1] 9
Pipelines
To me, one of the things that R wonderfull to work with, is the pipe operator: %>%
. This operator take what's on the left and sends it to whatever is on the right, e.g. "The Quick Brown Fox" %>% length
calculates the length of the given sentence.
You enable the pipe operator by loading the magrittr
library.
library(magrittr)
Okay, so what can we do with this pipe?
Let's say we have a string of words that we want to count. Evaluating such a sentence just gives us the same thing back:
"Gollum or Frodo And Sam"
## [1] "Gollum or Frodo And Sam"
To count elements in a list, i.e. the words in the sentence, we can use the length
function:
"Gollum or Frodo And Sam" %>% length()
## [1] 1
Okay, so length
recieves a list with one element: sentence. Let's split that sentence into words (observe that it's okay to break pipes into multiple lines):
"Gollum or Frodo And Sam" %>%
str_split(" ", simplify = TRUE) %>%
length()
## [1] 5
Now, the simplify = TRUE
is needed because str_split
can do a lot more that just split a sentence, but for now we just need the simple stuff.
Well, we're not content yet, as we don't want to count words like “or” and “and”. Such words are called stop words. In other words, we want to ignore words that belong to a list of stop words we define. In R, a list of words is defines thus:
c("or", "and")
## [1] "or" "and"
If we want to know if a a word is contained in such a list, i.e. we want to ask whether “or” is in the list (“or”, “and”), we can do like this:
"or" %in% c("or", "and")
## [1] TRUE
But we really want to ask whether “or” is not in the list. In most computer languages, a truth or false statement can be reversed by the !
character.
!FALSE
## [1] TRUE
So, our list checking expression becomes
!("or" %in% c("or", "and"))
## [1] FALSE
Back to our word counting example. We can now filter the list of words using the above with the list.filter
function
"Gollum or Frodo And Sam" %>%
str_split(" ", simplify = TRUE) %>%
list.filter(some_word ~ !(some_word %in% c("or", "and"))) %>%
length()
## [1] 4
Four?! Well, as we should know, computers are stupid and don't understand that when we say “and”, we also mean “And”. This is remedied by one more part to the pipeline
"Gollum or Frodo And Sam" %>%
tolower() %>%
str_split(" ", simplify = TRUE) %>%
list.filter(some_word ~ !(some_word %in% c("or", "and"))) %>%
length()
## [1] 3
Voila!
Data in tables
Most data come in tables in one form or another. Data could be in an Excel spreadsheet, a csv file, a database table, an HTML table, and so on. R understands all these forms and can import them into an R data table, or data frame, as they are called in R.
A very easy way to create a data table or frame, is to use the tibble
package, again part of the Tidyverse. The following function creates a data frame with two columns named letter_code
and value
:
tribble(
~letter_code, ~value,
"a", 2,
"b", 3,
"c", 4,
"π", pi,
"a", 9
)
## # A tibble: 5 x 2
## letter_code value
## <chr> <dbl>
## 1 a 2
## 2 b 3
## 3 c 4
## 4 π 3.14
## 5 a 9
Let's do that again and also give the table a name
tribble(
~letter_code, ~value,
"a", 2,
"b", 3,
"c", 4,
"π", pi,
"a", 9
) -> some_data_frame
Data frames can be mutated, filtered, grouped etc. As an example, let's look at all the rows that have value
greater than 3:
some_data_frame %>%
filter(value > 3)
## # A tibble: 3 x 2
## letter_code value
## <chr> <dbl>
## 1 c 4
## 2 π 3.14
## 3 a 9
Look at the same data in a visual way
some_data_frame %>%
filter(value > 3) %>%
ggplot() +
geom_point(aes(x = letter_code, y = value))

The ggplot2 library is the best plotting library for R. It can produce everything from simple plots to animations and high quality plots ready for publication. It's also a part of the Tidyverse.
Here is a plot that aggregates the value
s into letter_codes
:
some_data_frame %>%
ggplot() +
geom_col(aes(x = letter_code, y = value))

Getting ready for large scale
Okay, so let's take R code to the next level. R is normally developed and run on a desktop computer or laptop, but it can also run as a server with a web browser interface — and you can hardly tell the difference.
As stated in the introduction, the aim of this text is to show how to run R analysis on the Cultural Heritage Cluster. This cluster is primarily an Apache Spark cluster and off course R, through the Tidyverse, has an interface to such a Spark cluster.
Now, let's see how that works, but be aware: we're trying to break a butterfly upon a wheel…
First ensure that the package for the Spark integration is installed:
#install.packages("sparklyr")
Now, sparklyr works up against two different Spark clusters. The one being a real cluster running on physical or virtual hardware in some server room and the other being a local pseudo cluster. The latter makes it easy for us to create the nessecary code for analysis before turning to the Real Big Thing.
Load the Spark library:
library(sparklyr)
If you want to run against a local pseudo instance, do this, which installs Apache Spark on your machine.
spark_install(version = "2.1.0")
## Spark 2.1.0 for Hadoop 2.7 or later already installed.
The only difference for us is how to initiate the cluster, pseudo or not:
# Sys.setenv(SPARK_HOME='/usr/hdp/current/spark2-client') # for connecting to the CHC
# sc <- spark_connect(master = 'yarn-client') # for connecting to the CHC
sc <- spark_connect(master = "local", version = "2.1.0") # for connection to a local Spark
## Re-using existing Spark connection to local
Interlude: Get some data
Fetch the works of Mark Twain. Text mining with Spark & sparklyr) has a more in-depth example using this data.
Well, all of Mark Twains works are available at the Gutenberger project and R has an interface to that treasure trove.
# install.packages("gutenbergr") # evaluate if the package isn't installed already
library(gutenbergr)
Now, let's use the pipe operator and some functions from the gutenbergr package to fetch and store Mark Twain's writings. The expression takes a few minutes to complete.
gutenberg_works() %>%
filter(author == "Twain, Mark") %>%
pull(gutenberg_id) %>%
gutenberg_download() %>%
pull(text) %>%
writeLines("mark_twain.txt")
Okay, so what did we get?
readLines("mark_twain.txt", 20)
## [1] "WHAT IS MAN? AND OTHER ESSAYS"
## [2] ""
## [3] ""
## [4] "By Mark Twain"
## [5] ""
## [6] "(Samuel Langhorne Clemens, 1835-1910)"
## [7] ""
## [8] ""
## [9] ""
## [10] ""
## [11] "CONTENTS:"
## [12] ""
## [13] " What Is Man?"
## [14] ""
## [15] " The Death of Jean"
## [16] ""
## [17] " The Turning-Point of My Life"
## [18] ""
## [19] " How to Make History Dates Stick"
## [20] ""
Load the texts onto Spark
Now, the texts are on the local file system, but we want it in Spark. Remember that we are breaking butterflies on wheels here!
twain <- spark_read_text(sc, "twain", "mark_twain.txt")
Analysis
The texts now has a copy in the Spark system, cluster, machine, or whatever we should call that thing. What's important is that we can use that copy for very large scale analysis. Here, we'll just do some very simple visualization.
First, let's get the data on a tidy form, i.e. remove all punctuation, remove stop-words and transform the text into a form with one word per row.
twain %>%
filter(nchar(line) > 0) %>%
mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>%
ft_tokenizer(input.col = "line",
output.col = "word_list") %>%
ft_stop_words_remover(input.col = "word_list",
output.col = "wo_stop_words") %>%
mutate(word = explode(wo_stop_words)) %>%
select(word) %>%
filter(nchar(word) > 2) %>%
compute("tidy_words") -> tidy_words
That snippet of code does a lot:
- The first filter function remove all empty lines (number of characters is more than zero)
- the mutate function replaces all punctuation with spaces
- the ft_tokenizer function tramsforms each line into a list of words
- the ft_stop_words_remover removes a set of pre-defined stop words
- the second mutate takes the list of words on each line a transforms that list into multiple rows, one per word
- the select function removes all columns except the column with the word
- the last filter function removes words with only one or two letters
- the compute function stores the result in the Spark cluster for easy retrival later
- and lastly save that Spark result as an R name called
tidy_words
Count the word frequencies
Okay, so that can be used to perform a word count. The arrange
function sorts a data frame, and the desc
function gives us descending order, i.e. that largest number first. n
is a implicit name created by the count function and n
refers to the count of the thing counted in the count function.
tidy_words %>%
count(word) %>%
arrange(desc(n)) -> word_count
So, what were the ten most used words by Twain?
word_count %>% head(10)
## # Source: lazy query [?? x 2]
## # Database: spark_connection
## # Ordered by: desc(n)
## word n
## <chr> <dbl>
## 1 one 20028
## 2 would 15735
## 3 said 13204
## 4 could 11301
## 5 time 10502
## 6 man 8391
## 7 see 8138
## 8 two 7829
## 9 like 7589
## 10 good 7534
## # ... with more rows
Show me the data
Again, a visualization gives some extra and we will now create a word cloud of Twain's words
#install.packages("wordcloud")
library(wordcloud)
word_count %>%
arrange(desc(n)) %>%
head(70) %>%
collect() %>%
with(wordcloud::wordcloud(
word,
n,
colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")
))

Next steps
And so on, towards ∞