Irrespective of the fancy webbased platforms, you can also do your research the old fashioned way. Of course we offer normal SSH access to your project machine.
Note, the security system prevents you from SSH’ing to any KAC machines except your assigned project machine.
ssh {INITIALS}p{XXX}@kac-proj-{XXX}.kach.sblokalnet
where INITIALS are the initials and XXX is the project number.
From here, you can start R or Python directly, or run java programs as you need.
Not all jobs can be written in R(Sparklyr) or PySpark. If you need to start a java-based hadoop job, you can do it from this interface.
Here are examples on how Spark jobs can be written, without any dependencies on RStudio or Jupyter –https://spark.apache.org/examples.html
See MapReduce Streaming for another way to write a shell based hadoop job – https://hadoop.apache.org/docs/r2.7.1/hadoop-streaming/HadoopStreaming.html
HDFS and unix tools
You can access the HDFS contents as detailed in HDFS Guide.
If you need to use traditional Unix tools such as grep and awk on the data, this will probably be the way to do it
hdfs dfs -cat /filepath | grep "linesMatchingThis" | hdfs dfs -put /newFilePath
This command will read the datafile as a stream, run grep on each line and write the results back into HDFS. All without storing the data contents, which can be huge, locally.