Table of Contents | ||
---|---|---|
|
...
8. Open a browser session on your desktop and enter the URL 'localhost:8080' (i.e. the string within the single quotes) into its search field. You should now see the notebook.
Spark
Spark is a fast and general-purpose cluster computing framework for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports cyclic data flow and in-memory computing.
For a short overview of how Spark runs on clusters, refer to the Spark Cluster Mode Overview .
The spark-slurm script launches Spark in standalone cluster mode and is integrated with the cluster scheduler to automatically set up a spark mini-cluster using the nodes allocated to your job.
To use the script, you must first launch a job that allocates at least 2 nodes.
The script performs the following steps:
Launches a spark master process.
Launches a spark worker process on each allocated node, pointing each one to the master process.
Sets various default environment variables some of which can be overridden.
The spark-slurm script on Habanero is slightly modified version of the github spark-slurm script.
Set your environment:
Before running spark-slurm, JAVA_HOME and the spark environment must be set:
Code Block |
---|
$ export JAVA_HOME=/usr
$ module load spark |
To run spark within an interactive job allocation of 3 nodes (replacing <account> with your account):
Code Block |
---|
$ salloc -N 3 -A <account> --cpus-per-task 24 --mem=120G |
...
Code Block |
---|
$ spark-slurm |
Or, if you'd like to save the console output to a log file:
Code Block |
---|
$ spark-slurm > ~/.spark/spark-${SLURM_JOB_ID}.log & |
...
Code Block |
---|
$ less ~/.spark/spark-${SLURM_JOB_ID}.log |
After spark-slurm successfully started a spark cluster, look for the line starting with starting master: .... and use that URL for your spark-shell or spark-submit scripts.
So to get the spark master URL from a log file:
Code Block |
---|
$ awk '/master:/ {print $NF}' ${SLURM_JOB_ID}.log
spark://10.43.4.220:7077 |
...
Running spark as a non-interactive batch job
Example submit script spark-submit.sh:
Code Block |
---|
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --account=<your_account>
#SBATCH --nodes=3
#SBATCH --mem=120G
#SBATCH --cpus-per-task=24
#SBATCH --mail-user=<your_email>
#SBATCH --mail-type=ALL
module load spark
export JAVA_HOME=/usr
SPARK_LOG=~/.spark/spark-${SLURM_JOB_ID}.log
spark-slurm > $SPARK_LOG &
sleep 20
sparkmaster=`awk '/master:/ {print $NF}' $SPARK_LOG`
echo sparkmaster="$sparkmaster"
spark-submit --master $sparkmaster $SPARK_HOME/examples/src/main/python/wordcount.py $SPARK_HOME/README.md |
Submit
Code Block |
---|
$ sbatch spark-submit.sh |
The console/log will also indicate the master WebUI port. To determine which port it was started on:
Code Block |
---|
$ grep UI spark-9463479.log | grep Success
2018-10-20 09:42:53 INFO Utils:54 - Successfully started service 'MasterUI' on port 8082. |
To connect to the spark master WebUI, you can launch google-chrome from a login node. You will need to use Xwindows. If using Windows operating system, install and run Xming and then use Putty and enable SSH Xwindows forward before connecting.
Run chrome in the Xwindows session.
Code Block |
---|
$ google-chrome & |
This should bring up a new browser window which is running on the login node. This is necessary since you cannot directly connect to the compute node's internal network from your personal computer. In that browser, load the URL for the master WebUI, for example:
Example URL only, replace with actual master node and actual port as shown in log file:
node220:8082
Use spark-submit to submit spark programs to the spark cluster.
You can then submit spark jobs to using this information.
Submit spark wordcount program to spark cluster.
Code Block |
---|
$ sparkmaster=spark://10.43.4.220:7077
$ spark-submit --master ${sparkmaster} $SPARK_HOME/examples/src/main/python/wordcount.py $SPARK_HOME/README.md |
View help
$ spark-submit -h
If needed, you may optionally set the number of spark executor cores or executor memory available by supplying flags to spark-submit. Here's a submit spark program example, specifying total executor cores and memory:
$ spark-submit --total-executor-cores 48 --executor-memory 5G --master ${sparkmaster} $SPARK_HOME/examples/src/main/python/wordcount.py $SPARK_HOME/README.md
When you exit the job allocation or when your job ends, your spark master and slave processes will be killed.