SSH

Irrespective of the fancy webbased platforms, you can also do your research the old fashioned way. Of course we offer normal SSH access to your project machine.

Note, the security system prevents you from SSH’ing to any KAC machines except your assigned project machine.

ssh {INITIALS}p{XXX}@kac-proj-{XXX}.kach.sblokalnet

where INITIALS are the initials and XXX is the project number. 

From here, you can start R or Python directly, or run java programs as you need.

Not all jobs can be written in R(Sparklyr) or PySpark. If you need to start a java-based hadoop job, you can do it from this interface.

Here are examples on how Spark jobs can be written, without any dependencies on RStudio or Jupyter –https://spark.apache.org/examples.html

See MapReduce Streaming for another way to write a shell based hadoop job – https://hadoop.apache.org/docs/r2.7.1/hadoop-streaming/HadoopStreaming.html 

HDFS and unix tools

You can access the HDFS contents as detailed in HDFS Guide.

If you need to use traditional Unix tools such as grep and awk on the data, this will probably be the way to do it

hdfs dfs -cat /filepath | grep "linesMatchingThis" | hdfs dfs -put /newFilePath

This command will read the datafile as a stream, run grep on each line and write the results back into HDFS. All without storing the data contents, which can be huge, locally.