Per Møldrup-Dalum

How we observed an unexpected high peak in the amount of hits in 2013 for ‘kierkegaard’

Background

In the initial phase of the DeiC National Pilot Project N.F.S. Grundtvig i danske medier, we have been discussing the design of the project, and thereby which material from The Royal Danish Library to use for the quantitative analyses. One of the cultural heritage collections will be The Danish Netarchive.

The project aims to explore the effect of Grundtvig in the Danish Culture by looking at how his name appears, in terms of but not limited to frequency, semantics, and graph networks.

So, in order to getting to understand the data, we decided to use Kierkegaard for a comparison to Grundtvig.

This text describes the discovery of an anomaly in the frequency of Kierkegaard-hits. I.e. we observed far too many documents containing the term ‘kierkegaard’ compared to what our intuition would expect. We also try to explain this anomaly, how it might affect studies using the Net Archive, and some paths going forward.

Exploring the data

We have counted all text documents in the Net Archive, that contain either ´kierkegaard’ or ‘grundtvig’. To normalise the results, we also counted every text document. All the counts have been grouped by the month the documents were harvested.

In order to remove the fluctuation of the harvest process, we use the relative amount of hits compared to the total number of harvested documents in a given month, i.e. the percentage of documents containing the search term.

If we visualise the relative counts as a function of the month they were harvested, we observe the mentioned anomaly.

Is it possible, that 2% of all documents in august 2013 mentioned Kierkagaard? Well, it was 200 years since his birth, his birthday being May 5th 1813, which is probably part of the explanation, but it would be extreme if that explanation accounted for the complete observation.

Exploring the data by searching and digging down in this anomaly, we discovered that very few domains accounted for most of the hits. One of these domains was a danish newspaper, that in all of 2013 had a topic around Søren Kierkegaard. They implemented this topic by having a drop-down menu on all web pages, containing a link to the topic, visualized by the Kierkegaard name (‘Kierkegaard – 200 år’), therefore every document harvested from that newspaper that year appears to be mentioning Kierkegaard. On top of that, said newspaper was and still is, harvested with a very high frequency, thereby boosting the effect. We confirmed the use of Kierkegaard in the menus by visual inspection, as can be seen in this screenshot. An interactive example can be enjoyed at The Internet Archive: JP 25. august 2013.

Example of Kierkegaard being part of a web page on a unrelated topic. Kilde: Jyllands-Posten.

So, to answer the question from above: yes, 2% of the documents from August 2013 in the archive did indeed contain the word Kierkegaard. Just not in a very semantic valuable form.

At the moment, we have no know methods implemented to discern between ‘kierkegaard’ appearing in menus and as part of the actual content of the webpage.

Going forward

This is an example on how the technical design of a web page can completely overshadow the actual content, that one tries to analyse. We have always had a suspicion on this, but this is to our knowledge one of the first examples of that actually skewing the results. We were lucky, that the skewness was huge and easily observable.

Even though we have no methods implemented to remedy this, we do have a few ideas:

  • In the specific newspaper, we could remove all instances of the text “Kierkegaard – 200 år” from the results, as that specific wording is used in the menu at least one newspaper. Still, that would only handle the skewing for that specific newspaper.
  • We could identify all websites having an unreasonably high counts of ‘kierkegaard, and eliminate those complete websites from the result. This would, of course, introduce other skewness.
  • We could look only at the board harvests, as done by Probing a Nation’s Web Domain project. Like above, this introduces other forms of skewness.
  • We could use an advanced re-rendering of the source HTML, and try to identify how to discern between design and content elements. As there is no standard way of building web pages, this would also have to be implemented per domain/media house/web publisher.
  • Instead of re-rendering the complete page, we could use existing tools for just extracting the visible parts of a webpage from the HTML code.
  • We could come up with a heuristic identifying when Kierkegaard is used in a sentence and not in design elements. This method could be a more general solution, as it is based on linguistics and not web design or technicalities.

It is important to realize that without some sort of processing, this text data cannot be used for topic modelling in any form.

Source code

The code producing the above graph and the analysis is for the time being published as a gist: why-so-many-hits.Rmd

Posted by Per Møldrup-Dalum in Ikke-kategoriseret, 0 comments

Rundtur i Danmark

I de kommende måneder vil Kulturarvsclusteret sammen med DigHumLab besøge de danske universiteter og andre institutioner der måtte ønske det.

Formålet med disse besøg er at præsentere Kulturarvsclusteret som projekt, som teknisk infrastruktur og som en platform der giver danske forskere mulighed for at prøve kræfter med kvantitative digitale metoder på store eller enorme datamængder.

Der kan læses mere om hvad Kulturarvsclusteret er på https://kulturarvscluster.kb.dk/

Vi kommer med et program der typisk vil fylde en halv dag. Skulle din institution være interesseret i et besøg må du meget gerne tage kontakt via e-mail til Kulturarvsclusteret.

Ind til videre har vi planlagt følgende besøg:

SDU den 29. november 2018 — Kontakt Katrine Frøkjær Baunvig baunvig@sdu.dk hvis du er interesseret i at deltage.

KU den 12. december 2018 — Tilmelding og kontakt på https://kubkalender.kb.dk/event/3356480?&hs=a


AAU under planlægning

Posted by Per Møldrup-Dalum in Event

Èndagsevent: Large Scale Computational Humanities

Kom og vær med til DeiC og Det Kgl. Biblioteks éndagsevent, hvor der stilles skarpt på Kulturarvsclusteret (KAC).

Arrangementet, Large Scale Computational Humanities, løber af stablen den 22. november mellem klokken 10:00 og frem til kl. 15:00 på Jens Chr. Skous Vej 4 i Aarhus.

Arrangementet handler om, hvad Kulturarvscluteret er for en størrelse og om de muligheder du, som kommende bruger af clusteret, har.

Kulturarvsclusteret benytter moderne teknologier inden for data science og giver for første gang mulighed for at lave kvantitative forskningsprojekter i den digitale danske kulturarv indenfor eksempelvis radio- og tv-udsendelser, arkiverede hjemmesider eller historiske aviser.

Til at fortælle om arbejdet med store datamængder inden for henholdsvis humaniora og naturvidenskab kommer de to eksperter: Kristoffer Nielbo og Rasmus Handberg.

Supercomputing og humanister passer sammen

DeiCs opgaver er blandt andet at udbrede High Performance Computing (HPC) til nye forskningsområder såsom de humanistiske og samfundsvidenskabelige områder.

For at imødekomme det har DeiC og Det Kgl. Bibliotek indgået aftale om etablering af DeiC Nationale Kulturarvscluster, Det Kgl. Bibliotek. Etableringen af kulturarvsclusteret betyder en styrkelse af den humanistiske forskning, hvor brugen af store datasæt hidtil har været begrænset.

Det Kgl. Bibliotek har altid deltaget i nationale og internationale forsknings- og forskningsinfrastruktur-projekter med baggrund i dansk digital kulturarv. Biblioteket har derved god viden og kompetencer omkring, hvad der kræves for at tilbyde eksempelvis søgning efter strukturer og mønstre i store datamængder.

Vi glæder os til at se dig.

TILMELDINGSSIDE

Posted by Per Møldrup-Dalum in Event

Podcast: Humanister bruger også supercomputer

Hør i denne episode fra podcastserien Supercomputing i Danmark et interview af Det Kgl. Biblioteks vicedirektør for IT Udvikling og Infrastruktur Bjarne Andersen og Projektleder for Kulturarvsclusteret Per Møldrup-Dalum. Interviewet omhandler spørgsmålet om:

  • Hvad er Kulturarvsclusteret?
  • Hvordan det kan bruges og hvordan bliver brugt?
  • Hvordan man som forsker får adgang
  • Og ikke mindst: hvad fremtiden kan komme til at byde på for forskere indenfor Large Scale Computational Humanities i Danmark?

Dette interview er offentliggjortDeiC’s vidensportal.

I de kommende måneder vil du i denne podcastserie kunne høre mere om arbejdet på Kulturarvsclusteret.

God lyttelyst!

Posted by Per Møldrup-Dalum in Repost

A Very Short Introduction to R

Introduction

The text you are reading now is actually created as an RMarkdown Notebook in RStudio. The best way to read this would be to open the notebook in RStudio and follow along, evaluating the code. You can get the notebook file here.

R code can be evaluated either by pressing Ctrl-Shift-Enter with the cursor inside some code or by pressing the little green arrow at the right margin of the code blocks.

On Getting Started with R you can read a short introduction on how to install R and RStudio.

Okay, so the aim of this short R walk-through is to show a path from basic R to running code on the DeIC Cultural Heritage Cluster at the Royal Danish Library (CHC)

Prolouge

Computer languages like R, that has been around for a long time, live through different styles and opinionated principles. One such principle is expressed by the Tidyverse which I like and advocate.

You enter the Tidyverse by loading the tidyverse library – almost. On a newly created R installation, you first need to install the libraries on your computer. This is done using the install.packages function in the following code part. Note that this is only nessecary once!

#install.packages("tidyverse")
#install.packages("rlist")

Then you can load the tidyverse and some other nessecary libraries for this tutorial.

library(tidyverse)
library(rlist) # for list.filter
library(readr) # for write_csv and read_csv

The very basics of R

Standard operators, i.e. plus, minus and so on, work as we expect

2 + 2
## [1] 4

Naming values

Values can be stored for later re-use by giving them a name. Naming values is done with an arrow and it can be performed both to the left and to the right.

2 + 2         -> a_named_value
another_named_value <-  2 + 3

a_named_value + another_named_value
## [1] 9

Pipelines

To me, one of the things that R wonderfull to work with, is the pipe operator: %>%. This operator take what's on the left and sends it to whatever is on the right, e.g. "The Quick Brown Fox" %>% length calculates the length of the given sentence.

You enable the pipe operator by loading the magrittr library.

library(magrittr)

Okay, so what can we do with this pipe?

Let's say we have a string of words that we want to count. Evaluating such a sentence just gives us the same thing back:

"Gollum or Frodo And Sam"
## [1] "Gollum or Frodo And Sam"

To count elements in a list, i.e. the words in the sentence, we can use the length function:

"Gollum or Frodo And Sam" %>% length()
## [1] 1

Okay, so length recieves a list with one element: sentence. Let's split that sentence into words (observe that it's okay to break pipes into multiple lines):

"Gollum or Frodo And Sam" %>%
  str_split(" ", simplify = TRUE) %>% 
  length()
## [1] 5

Now, the simplify = TRUE is needed because str_split can do a lot more that just split a sentence, but for now we just need the simple stuff.

Well, we're not content yet, as we don't want to count words like “or” and “and”. Such words are called stop words. In other words, we want to ignore words that belong to a list of stop words we define. In R, a list of words is defines thus:

c("or", "and")
## [1] "or"  "and"

If we want to know if a a word is contained in such a list, i.e. we want to ask whether “or” is in the list (“or”, “and”), we can do like this:

"or" %in% c("or", "and")
## [1] TRUE

But we really want to ask whether “or” is not in the list. In most computer languages, a truth or false statement can be reversed by the ! character.

!FALSE
## [1] TRUE

So, our list checking expression becomes

!("or" %in% c("or", "and"))
## [1] FALSE

Back to our word counting example. We can now filter the list of words using the above with the list.filter function

"Gollum or Frodo And Sam" %>%
  str_split(" ", simplify = TRUE) %>% 
  list.filter(some_word ~ !(some_word %in% c("or", "and"))) %>% 
  length()
## [1] 4

Four?! Well, as we should know, computers are stupid and don't understand that when we say “and”, we also mean “And”. This is remedied by one more part to the pipeline

"Gollum or Frodo And Sam" %>%
  tolower() %>% 
  str_split(" ", simplify = TRUE) %>% 
  list.filter(some_word ~ !(some_word %in% c("or", "and"))) %>% 
  length()
## [1] 3

Voila!

Data in tables

Most data come in tables in one form or another. Data could be in an Excel spreadsheet, a csv file, a database table, an HTML table, and so on. R understands all these forms and can import them into an R data table, or data frame, as they are called in R.

A very easy way to create a data table or frame, is to use the tibble package, again part of the Tidyverse. The following function creates a data frame with two columns named letter_code and value:

tribble(
  ~letter_code, ~value,
  "a", 2,
  "b", 3,
  "c", 4,
  "π", pi,
  "a", 9
)
## # A tibble: 5 x 2
##   letter_code value
##   <chr>       <dbl>
## 1 a            2   
## 2 b            3   
## 3 c            4   
## 4 π            3.14
## 5 a            9

Let's do that again and also give the table a name

tribble(
  ~letter_code, ~value,
  "a", 2,
  "b", 3,
  "c", 4,
  "π", pi,
  "a", 9
) -> some_data_frame

Data frames can be mutated, filtered, grouped etc. As an example, let's look at all the rows that have value greater than 3:

some_data_frame %>% 
  filter(value > 3)
## # A tibble: 3 x 2
##   letter_code value
##   <chr>       <dbl>
## 1 c            4   
## 2 π            3.14
## 3 a            9

Look at the same data in a visual way

some_data_frame %>% 
  filter(value > 3) %>% 
  ggplot() +
    geom_point(aes(x = letter_code, y = value))

plot of chunk unnamed-chunk-18

The ggplot2 library is the best plotting library for R. It can produce everything from simple plots to animations and high quality plots ready for publication. It's also a part of the Tidyverse.

Here is a plot that aggregates the values into letter_codes:

some_data_frame %>% 
  ggplot() +
    geom_col(aes(x = letter_code, y = value))

plot of chunk unnamed-chunk-19

Getting ready for large scale

Okay, so let's take R code to the next level. R is normally developed and run on a desktop computer or laptop, but it can also run as a server with a web browser interface — and you can hardly tell the difference.

As stated in the introduction, the aim of this text is to show how to run R analysis on the Cultural Heritage Cluster. This cluster is primarily an Apache Spark cluster and off course R, through the Tidyverse, has an interface to such a Spark cluster.

Now, let's see how that works, but be aware: we're trying to break a butterfly upon a wheel…

First ensure that the package for the Spark integration is installed:

#install.packages("sparklyr")

Now, sparklyr works up against two different Spark clusters. The one being a real cluster running on physical or virtual hardware in some server room and the other being a local pseudo cluster. The latter makes it easy for us to create the nessecary code for analysis before turning to the Real Big Thing.

Load the Spark library:

library(sparklyr)

If you want to run against a local pseudo instance, do this, which installs Apache Spark on your machine.

spark_install(version = "2.1.0")
## Spark 2.1.0 for Hadoop 2.7 or later already installed.

The only difference for us is how to initiate the cluster, pseudo or not:

# Sys.setenv(SPARK_HOME='/usr/hdp/current/spark2-client') # for connecting to the CHC
# sc <- spark_connect(master = 'yarn-client')             # for connecting to the CHC
sc <- spark_connect(master = "local", version = "2.1.0")   # for connection to a local Spark
## Re-using existing Spark connection to local

Interlude: Get some data

Fetch the works of Mark Twain. Text mining with Spark & sparklyr) has a more in-depth example using this data.

Well, all of Mark Twains works are available at the Gutenberger project and R has an interface to that treasure trove.

# install.packages("gutenbergr") # evaluate if the package isn't installed already
library(gutenbergr)

Now, let's use the pipe operator and some functions from the gutenbergr package to fetch and store Mark Twain's writings. The expression takes a few minutes to complete.

gutenberg_works()  %>%
  filter(author == "Twain, Mark") %>%
  pull(gutenberg_id) %>%
  gutenberg_download() %>%
  pull(text) %>%
  writeLines("mark_twain.txt")

Okay, so what did we get?

readLines("mark_twain.txt", 20)
##  [1] "WHAT IS MAN? AND OTHER ESSAYS"        
##  [2] ""                                     
##  [3] ""                                     
##  [4] "By Mark Twain"                        
##  [5] ""                                     
##  [6] "(Samuel Langhorne Clemens, 1835-1910)"
##  [7] ""                                     
##  [8] ""                                     
##  [9] ""                                     
## [10] ""                                     
## [11] "CONTENTS:"                            
## [12] ""                                     
## [13] "     What Is Man?"                    
## [14] ""                                     
## [15] "     The Death of Jean"               
## [16] ""                                     
## [17] "     The Turning-Point of My Life"    
## [18] ""                                     
## [19] "     How to Make History Dates Stick" 
## [20] ""

Load the texts onto Spark

Now, the texts are on the local file system, but we want it in Spark. Remember that we are breaking butterflies on wheels here!

twain <-  spark_read_text(sc, "twain", "mark_twain.txt")

Analysis

The texts now has a copy in the Spark system, cluster, machine, or whatever we should call that thing. What's important is that we can use that copy for very large scale analysis. Here, we'll just do some very simple visualization.

First, let's get the data on a tidy form, i.e. remove all punctuation, remove stop-words and transform the text into a form with one word per row.

twain %>%

  filter(nchar(line) > 0) %>%
  mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>%

  ft_tokenizer(input.col = "line",
               output.col = "word_list") %>%

  ft_stop_words_remover(input.col = "word_list",
                        output.col = "wo_stop_words") %>%

  mutate(word = explode(wo_stop_words)) %>%
  select(word) %>%
  filter(nchar(word) > 2) %>%
  compute("tidy_words") -> tidy_words

That snippet of code does a lot:

  • The first filter function remove all empty lines (number of characters is more than zero)
  • the mutate function replaces all punctuation with spaces
  • the ft_tokenizer function tramsforms each line into a list of words
  • the ft_stop_words_remover removes a set of pre-defined stop words
  • the second mutate takes the list of words on each line a transforms that list into multiple rows, one per word
  • the select function removes all columns except the column with the word
  • the last filter function removes words with only one or two letters
  • the compute function stores the result in the Spark cluster for easy retrival later
  • and lastly save that Spark result as an R name called tidy_words

Count the word frequencies

Okay, so that can be used to perform a word count. The arrange function sorts a data frame, and the desc function gives us descending order, i.e. that largest number first. n is a implicit name created by the count function and n refers to the count of the thing counted in the count function.

tidy_words %>%
  count(word) %>% 
  arrange(desc(n)) -> word_count

So, what were the ten most used words by Twain?

word_count %>% head(10)
## # Source:     lazy query [?? x 2]
## # Database:   spark_connection
## # Ordered by: desc(n)
##    word      n
##    <chr> <dbl>
##  1 one   20028
##  2 would 15735
##  3 said  13204
##  4 could 11301
##  5 time  10502
##  6 man    8391
##  7 see    8138
##  8 two    7829
##  9 like   7589
## 10 good   7534
## # ... with more rows

Show me the data

Again, a visualization gives some extra and we will now create a word cloud of Twain's words

#install.packages("wordcloud")
library(wordcloud)
word_count %>%
  arrange(desc(n)) %>% 
  head(70) %>%
  collect() %>%
  with(wordcloud::wordcloud(
    word, 
    n,
    colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")
  ))

plot of chunk unnamed-chunk-32

Next steps

And so on, towards ∞

Posted by Per Møldrup-Dalum in R, Tech

Getting Started with R

The DeIC National Cultural Heritage Cluster, the Royal Danish Library (CHC) has R as one of its two main interfaces, Python being the other one. R is very widespread in the data centric communities including the digital humanities. This blog post describes how to get started with R with the main objective of enabling the use of R at the CHC. Still, most of the descriptions here are generic and platform agnostic.

The R Project describes R in the following way:

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.


R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.


One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.


R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

https://www.r-project.org/about.html

The R Project describes what is called the R environment in the following way

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either on-screen or on hardcopy, and a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.


The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software.

R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

Many users think of R as a statistics system. We prefer to think of it of an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.

https://www.r-project.org/about.html

We propose to use the RStudio platform for working with R. RStudio is a commercial organisation – developing tools and methods for and with R and their mission is:

RStudio has a mission to provide the most widely used open source and enterprise-ready professional software for the R statistical computing environment. These tools further the cause of equipping everyone, regardless of means, to participate in a global economy that increasingly rewards data literacy.

We offer open source and enterprise ready tools for the R computing environment. Our flagship product is an Integrated Development Environment (IDE) which makes it easy for anyone to analyze data with R. We also offer many R packages, including Shiny and R Markdown, and a platform for sharing interactive applications and reproducible reports with others.

https://www.rstudio.com/about/

Getting and installing R

As we propose to use RStudio for all things R, two things are needed: The R environment itself and the RStudio platform, where R is the language (end implementation) and RStudio is the workbench.

To download and install R, go to CRAN and select the package matching your platform. Windows, Linux, and macOS are all supported.

To download and install RStudio, go to RStudio Download and select the RStudio Desktop – Open Source License matching your operating system. Again all major systems are supported.

The first code

A very fine and highly recommended introduction to R and Data Science using R is the 2017 book R for Data Science by Hadley Wickham and Garrett Grolemund. This book is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 and can be read freely on the web or bought from O’Reilly.

Some notes on coding in R

As R is several decades old, a lot of R-code has been written using a lot of styles and principles and a lot of extension libraries that add functionality to the base of R. In recent years, the biggest movement within the R community has been the Tidyverse. The Tidyverse is, in their own words

R packages for data science


The tidyverse is an opinionated collection of R packages designed for data science.


All packages share an underlying design philosophy, grammar, and data structures.

https://www.tidyverse.org

The “tidy” in Tidyverse refers to an underlying principle on the structure on the data to be analyzed. In tidy data, each variable is a column, each observation is a row, and each type of observational unit is a table. This principle makes data much more easy to clean, explore, visualize, analyse, and so on. An in-depth description of, and argumentation for, the tidy data principle, can be found in Tidy data by Hadley Wickham (also published in The Journal of Statistical Software, vol. 59, 2014).

Oh, and Hadley Wickham has written a R style guide.

So, the Tidyverse contains libraries for creating and manipulating tidy data, but also tools for writing and communicating code and results. From interactive notebooksbook writing systems, and interactive web applications, all at your R-enabled fingertips.

Further reading

Books

When finished with R for Data Science, a next logical step could be another R book under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license:

Text Mining with R by Julia Silge and David Robinson

Advanced R by Hadley Wickham

Communities and online resources

If you want to learn R, make it habit of visiting R-bloggers with daily news and tutorials about R, contributed by over 750 bloggers.

The R community is also very active on Twitter, where most R tweets are tagged with #rstat. Some important tweeters are:

  • Hadley Wickham hadleywickham is the main author of a lot of the Tidyverse (Not so long ago it was actually called the Hadleyverse) and ggplot, the primary plotting library for R. He is also the author of the books R for Data Science and Advanced R Programming
  • Mara Averick dataandme tweets a lot on everything R and does so in a fun and entertaining way.
  • The dane Thomas Lin Pedersen thomasp85 tweets a lot on data visualisering and is the author on a lot of very interesting R packages.


Posted by Per Møldrup-Dalum in R, Tech