Getting Started with R

The DeIC National Cultural Heritage Cluster, the Royal Danish Library (CHC) has R as one of its two main interfaces, Python being the other one. R is very widespread in the data centric communities including the digital humanities. This blog post describes how to get started with R with the main objective of enabling the use of R at the CHC. Still, most of the descriptions here are generic and platform agnostic.

The R Project describes R in the following way:

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.


R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.


One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.


R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

https://www.r-project.org/about.html

The R Project describes what is called the R environment in the following way

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either on-screen or on hardcopy, and a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.


The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

Many users think of R as a statistics system. We prefer to think of it of an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.

https://www.r-project.org/about.html

We propose to use the RStudio platform for working with R. RStudio is a commercial organisation – developing tools and methods for and with R and their mission is:

RStudio has a mission to provide the most widely used open source and enterprise-ready professional software for the R statistical computing environment. These tools further the cause of equipping everyone, regardless of means, to participate in a global economy that increasingly rewards data literacy.

We offer open source and enterprise ready tools for the R computing environment. Our flagship product is an Integrated Development Environment (IDE) which makes it easy for anyone to analyze data with R. We also offer many R packages, including Shiny and R Markdown, and a platform for sharing interactive applications and reproducible reports with others.

https://www.rstudio.com/about/

Getting and installing R

As we propose to use RStudio for all things R, two things are needed: The R environment itself and the RStudio platform, where R is the language (end implementation) and RStudio is the workbench.

To download and install R, go to CRAN and select the package matching your platform. Windows, Linux, and macOS are all supported.

To download and install RStudio, go to RStudio Download and select the RStudio Desktop – Open Source License matching your operating system. Again all major systems are supported.

The first code

A very fine and highly recommended introduction to R and Data Science using R is the 2017 book R for Data Science by Hadley Wickham and Garrett Grolemund. This book is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 and can be read freely on the web or bought from O’Reilly.

Some notes on coding in R

As R is several decades old, a lot of R-code has been written using a lot of styles and principles and a lot of extension libraries that add functionality to the base of R. In recent years, the biggest movement within the R community has been the Tidyverse. The Tidyverse is, in their own words

R packages for data science


The tidyverse is an opinionated collection of R packages designed for data science.


All packages share an underlying design philosophy, grammar, and data structures.

https://www.tidyverse.org

The “tidy” in Tidyverse refers to an underlying principle on the structure on the data to be analyzed. In tidy data, each variable is a column, each observation is a row, and each type of observational unit is a table. This principle makes data much more easy to clean, explore, visualize, analyse, and so on. An in-depth description of, and argumentation for, the tidy data principle, can be found in Tidy data by Hadley Wickham (also published in The Journal of Statistical Software, vol. 59, 2014).

Oh, and Hadley Wickham has written a R style guide.

So, the Tidyverse contains libraries for creating and manipulating tidy data, but also tools for writing and communicating code and results. From interactive notebooksbook writing systems, and interactive web applications, all at your R-enabled fingertips.

Further reading

Books

When finished with R for Data Science, a next logical step could be another R book under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license:

Text Mining with R by Julia Silge and David Robinson

Advanced R by Hadley Wickham

Communities and online resources

If you want to learn R, make it habit of visiting R-bloggers with daily news and tutorials about R, contributed by over 750 bloggers.

The R community is also very active on Twitter, where most R tweets are tagged with #rstat. Some important tweeters are:

  • Hadley Wickham hadleywickham is the main author of a lot of the Tidyverse (Not so long ago it was actually called the Hadleyverse) and ggplot, the primary plotting library for R. He is also the author of the books R for Data Science and Advanced R Programming
  • Mara Averick dataandme tweets a lot on everything R and does so in a fun and entertaining way.
  • The dane Thomas Lin Pedersen thomasp85 tweets a lot on data visualisering and is the author on a lot of very interesting R packages.


Posted by Per Møldrup-Dalum