# Introduction

The text you are reading now is actually created as an RMarkdown Notebook in RStudio. The best way to read this would be to open the notebook in RStudio and follow along, evaluating the code. You can get the notebook file here.

R code can be evaluated either by pressing Ctrl-Shift-Enter with the cursor inside some code or by pressing the little green arrow at the right margin of the code blocks.

On Getting Started with R you can read a short introduction on how to install R and RStudio.

Okay, so the aim of this short R walk-through is to show a path from basic R to running code on the DeIC Cultural Heritage Cluster at the Royal Danish Library (CHC)

# Prolouge

Computer languages like R, that has been around for a long time, live through different styles and opinionated principles. One such principle is expressed by the Tidyverse which I like and advocate.

You enter the Tidyverse by loading the tidyverse library – almost. On a newly created R installation, you first need to install the libraries on your computer. This is done using the install.packages function in the following code part. Note that this is only nessecary once!

#install.packages("tidyverse")
#install.packages("rlist")


Then you can load the tidyverse and some other nessecary libraries for this tutorial.

library(tidyverse)
library(rlist) # for list.filter


# The very basics of R

Standard operators, i.e. plus, minus and so on, work as we expect

2 + 2

## [1] 4


## Naming values

Values can be stored for later re-use by giving them a name. Naming values is done with an arrow and it can be performed both to the left and to the right.

2 + 2         -> a_named_value
another_named_value <-  2 + 3

a_named_value + another_named_value

## [1] 9


## Pipelines

To me, one of the things that R wonderfull to work with, is the pipe operator: %>%. This operator take what's on the left and sends it to whatever is on the right, e.g. "The Quick Brown Fox" %>% length calculates the length of the given sentence.

You enable the pipe operator by loading the magrittr library.

library(magrittr)


Okay, so what can we do with this pipe?

Let's say we have a string of words that we want to count. Evaluating such a sentence just gives us the same thing back:

"Gollum or Frodo And Sam"

## [1] "Gollum or Frodo And Sam"


To count elements in a list, i.e. the words in the sentence, we can use the length function:

"Gollum or Frodo And Sam" %>% length()

## [1] 1


Okay, so length recieves a list with one element: sentence. Let's split that sentence into words (observe that it's okay to break pipes into multiple lines):

"Gollum or Frodo And Sam" %>%
str_split(" ", simplify = TRUE) %>%
length()

## [1] 5


Now, the simplify = TRUE is needed because str_split can do a lot more that just split a sentence, but for now we just need the simple stuff.

Well, we're not content yet, as we don't want to count words like “or” and “and”. Such words are called stop words. In other words, we want to ignore words that belong to a list of stop words we define. In R, a list of words is defines thus:

c("or", "and")

## [1] "or"  "and"


If we want to know if a a word is contained in such a list, i.e. we want to ask whether “or” is in the list (“or”, “and”), we can do like this:

"or" %in% c("or", "and")

## [1] TRUE


But we really want to ask whether “or” is not in the list. In most computer languages, a truth or false statement can be reversed by the ! character.

!FALSE

## [1] TRUE


So, our list checking expression becomes

!("or" %in% c("or", "and"))

## [1] FALSE


Back to our word counting example. We can now filter the list of words using the above with the list.filter function

"Gollum or Frodo And Sam" %>%
str_split(" ", simplify = TRUE) %>%
list.filter(some_word ~ !(some_word %in% c("or", "and"))) %>%
length()

## [1] 4


Four?! Well, as we should know, computers are stupid and don't understand that when we say “and”, we also mean “And”. This is remedied by one more part to the pipeline

"Gollum or Frodo And Sam" %>%
tolower() %>%
str_split(" ", simplify = TRUE) %>%
list.filter(some_word ~ !(some_word %in% c("or", "and"))) %>%
length()

## [1] 3


Voila!

## Data in tables

Most data come in tables in one form or another. Data could be in an Excel spreadsheet, a csv file, a database table, an HTML table, and so on. R understands all these forms and can import them into an R data table, or data frame, as they are called in R.

A very easy way to create a data table or frame, is to use the tibble package, again part of the Tidyverse. The following function creates a data frame with two columns named letter_code and value:

tribble(
~letter_code, ~value,
"a", 2,
"b", 3,
"c", 4,
"π", pi,
"a", 9
)

## # A tibble: 5 x 2
##   letter_code value
##   <chr>       <dbl>
## 1 a            2
## 2 b            3
## 3 c            4
## 4 π            3.14
## 5 a            9


Let's do that again and also give the table a name

tribble(
~letter_code, ~value,
"a", 2,
"b", 3,
"c", 4,
"π", pi,
"a", 9
) -> some_data_frame


Data frames can be mutated, filtered, grouped etc. As an example, let's look at all the rows that have value greater than 3:

some_data_frame %>%
filter(value > 3)

## # A tibble: 3 x 2
##   letter_code value
##   <chr>       <dbl>
## 1 c            4
## 2 π            3.14
## 3 a            9


Look at the same data in a visual way

some_data_frame %>%
filter(value > 3) %>%
ggplot() +
geom_point(aes(x = letter_code, y = value))


The ggplot2 library is the best plotting library for R. It can produce everything from simple plots to animations and high quality plots ready for publication. It's also a part of the Tidyverse.

Here is a plot that aggregates the values into letter_codes:

some_data_frame %>%
ggplot() +
geom_col(aes(x = letter_code, y = value))


# Getting ready for large scale

Okay, so let's take R code to the next level. R is normally developed and run on a desktop computer or laptop, but it can also run as a server with a web browser interface — and you can hardly tell the difference.

As stated in the introduction, the aim of this text is to show how to run R analysis on the Cultural Heritage Cluster. This cluster is primarily an Apache Spark cluster and off course R, through the Tidyverse, has an interface to such a Spark cluster.

Now, let's see how that works, but be aware: we're trying to break a butterfly upon a wheel…

First ensure that the package for the Spark integration is installed:

#install.packages("sparklyr")


Now, sparklyr works up against two different Spark clusters. The one being a real cluster running on physical or virtual hardware in some server room and the other being a local pseudo cluster. The latter makes it easy for us to create the nessecary code for analysis before turning to the Real Big Thing.

library(sparklyr)


If you want to run against a local pseudo instance, do this, which installs Apache Spark on your machine.

spark_install(version = "2.1.0")

## Spark 2.1.0 for Hadoop 2.7 or later already installed.


The only difference for us is how to initiate the cluster, pseudo or not:

# Sys.setenv(SPARK_HOME='/usr/hdp/current/spark2-client') # for connecting to the CHC
# sc <- spark_connect(master = 'yarn-client')             # for connecting to the CHC
sc <- spark_connect(master = "local", version = "2.1.0")   # for connection to a local Spark

## Re-using existing Spark connection to local


## Interlude: Get some data

Fetch the works of Mark Twain. Text mining with Spark & sparklyr) has a more in-depth example using this data.

Well, all of Mark Twains works are available at the Gutenberger project and R has an interface to that treasure trove.

# install.packages("gutenbergr") # evaluate if the package isn't installed already
library(gutenbergr)


Now, let's use the pipe operator and some functions from the gutenbergr package to fetch and store Mark Twain's writings. The expression takes a few minutes to complete.

gutenberg_works()  %>%
filter(author == "Twain, Mark") %>%
pull(gutenberg_id) %>%
pull(text) %>%
writeLines("mark_twain.txt")


Okay, so what did we get?

readLines("mark_twain.txt", 20)

##  [1] "WHAT IS MAN? AND OTHER ESSAYS"
##  [2] ""
##  [3] ""
##  [4] "By Mark Twain"
##  [5] ""
##  [6] "(Samuel Langhorne Clemens, 1835-1910)"
##  [7] ""
##  [8] ""
##  [9] ""
## [10] ""
## [11] "CONTENTS:"
## [12] ""
## [13] "     What Is Man?"
## [14] ""
## [15] "     The Death of Jean"
## [16] ""
## [17] "     The Turning-Point of My Life"
## [18] ""
## [19] "     How to Make History Dates Stick"
## [20] ""


## Load the texts onto Spark

Now, the texts are on the local file system, but we want it in Spark. Remember that we are breaking butterflies on wheels here!

twain <-  spark_read_text(sc, "twain", "mark_twain.txt")


# Analysis

The texts now has a copy in the Spark system, cluster, machine, or whatever we should call that thing. What's important is that we can use that copy for very large scale analysis. Here, we'll just do some very simple visualization.

First, let's get the data on a tidy form, i.e. remove all punctuation, remove stop-words and transform the text into a form with one word per row.

twain %>%

filter(nchar(line) > 0) %>%
mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>%

ft_tokenizer(input.col = "line",
output.col = "word_list") %>%

ft_stop_words_remover(input.col = "word_list",
output.col = "wo_stop_words") %>%

mutate(word = explode(wo_stop_words)) %>%
select(word) %>%
filter(nchar(word) > 2) %>%
compute("tidy_words") -> tidy_words


That snippet of code does a lot:

• The first filter function remove all empty lines (number of characters is more than zero)
• the mutate function replaces all punctuation with spaces
• the ft_tokenizer function tramsforms each line into a list of words
• the ft_stop_words_remover removes a set of pre-defined stop words
• the second mutate takes the list of words on each line a transforms that list into multiple rows, one per word
• the select function removes all columns except the column with the word
• the last filter function removes words with only one or two letters
• the compute function stores the result in the Spark cluster for easy retrival later
• and lastly save that Spark result as an R name called tidy_words

### Count the word frequencies

Okay, so that can be used to perform a word count. The arrange function sorts a data frame, and the desc function gives us descending order, i.e. that largest number first. n is a implicit name created by the count function and n refers to the count of the thing counted in the count function.

tidy_words %>%
count(word) %>%
arrange(desc(n)) -> word_count


So, what were the ten most used words by Twain?

word_count %>% head(10)

## # Source:     lazy query [?? x 2]
## # Database:   spark_connection
## # Ordered by: desc(n)
##    word      n
##    <chr> <dbl>
##  1 one   20028
##  2 would 15735
##  3 said  13204
##  4 could 11301
##  5 time  10502
##  6 man    8391
##  7 see    8138
##  8 two    7829
##  9 like   7589
## 10 good   7534
## # ... with more rows


### Show me the data

Again, a visualization gives some extra and we will now create a word cloud of Twain's words

#install.packages("wordcloud")
library(wordcloud)

word_count %>%
arrange(desc(n)) %>%
collect() %>%
with(wordcloud::wordcloud(
word,
n,
colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")
))


# Next steps

And so on, towards ∞

Posted by Per Møldrup-Dalum in R, Tech

## Getting Started with R

The DeIC National Cultural Heritage Cluster, the Royal Danish Library (CHC) has R as one of its two main interfaces, Python being the other one. R is very widespread in the data centric communities including the digital humanities. This blog post describes how to get started with R with the main objective of enabling the use of R at the CHC. Still, most of the descriptions here are generic and platform agnostic.

The R Project describes R in the following way:

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

The R Project describes what is called the R environment in the following way

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either on-screen or on hardcopy, and a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

Many users think of R as a statistics system. We prefer to think of it of an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.

We propose to use the RStudio platform for working with R. RStudio is a commercial organisation – developing tools and methods for and with R and their mission is:

RStudio has a mission to provide the most widely used open source and enterprise-ready professional software for the R statistical computing environment. These tools further the cause of equipping everyone, regardless of means, to participate in a global economy that increasingly rewards data literacy.

We offer open source and enterprise ready tools for the R computing environment. Our flagship product is an Integrated Development Environment (IDE) which makes it easy for anyone to analyze data with R. We also offer many R packages, including Shiny and R Markdown, and a platform for sharing interactive applications and reproducible reports with others.

## Getting and installing R

As we propose to use RStudio for all things R, two things are needed: The R environment itself and the RStudio platform, where R is the language (end implementation) and RStudio is the workbench.

To download and install R, go to CRAN and select the package matching your platform. Windows, Linux, and macOS are all supported.

## The first code

A very fine and highly recommended introduction to R and Data Science using R is the 2017 book R for Data Science by Hadley Wickham and Garrett Grolemund. This book is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 and can be read freely on the web or bought from O’Reilly.

## Some notes on coding in R

As R is several decades old, a lot of R-code has been written using a lot of styles and principles and a lot of extension libraries that add functionality to the base of R. In recent years, the biggest movement within the R community has been the Tidyverse. The Tidyverse is, in their own words

R packages for data science

The tidyverse is an opinionated collection of R packages designed for data science.

All packages share an underlying design philosophy, grammar, and data structures.

https://www.tidyverse.org

The “tidy” in Tidyverse refers to an underlying principle on the structure on the data to be analyzed. In tidy data, each variable is a column, each observation is a row, and each type of observational unit is a table. This principle makes data much more easy to clean, explore, visualize, analyse, and so on. An in-depth description of, and argumentation for, the tidy data principle, can be found in Tidy data by Hadley Wickham (also published in The Journal of Statistical Software, vol. 59, 2014).

Oh, and Hadley Wickham has written a R style guide.

So, the Tidyverse contains libraries for creating and manipulating tidy data, but also tools for writing and communicating code and results. From interactive notebooksbook writing systems, and interactive web applications, all at your R-enabled fingertips.

### Books

When finished with R for Data Science, a next logical step could be another R book under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license:

Text Mining with R by Julia Silge and David Robinson