Install Java Module into Puppet
/etc/puppetlabs/code/environments/production$ sudo /opt/puppetlabs/bin/puppet module install puppetlabs/java
Install Spark Module to /etc/puppetlabs/code/environments/production/manifests/spark.pp Note that this hard codes server names. Not ideal, but it's a starting point.
$master_hostname='spark-master.bpopp.net' class{'hadoop': realm => '', hdfs_hostname => $master_hostname, slaves => ['spark1.bpopp.net', 'spark2.bpopp.net'], } class{'spark': master_hostname => $master_hostname, hdfs_hostname => $master_hostname, historyserver_hostname => $master_hostname, yarn_enable => false, } node 'spark-master.bpopp.net' { include spark::master include spark::historyserver include hadoop::namenode include spark::hdfs } node /spark(1|2).bpopp.net/ { include spark::worker include hadoop::datanode } node 'client.bpopp.net' { include hadoop::frontend include spark::frontend }
Manual Configuration
SSH Setup
The master must be able to SSH to the slaves without a password. To do this, you typically use certificates from the master loaded to each slave. From the master:
ssh-keygen -t rsa ssh-copy-id spark@spark1.lab.bpopp.net ssh-copy-id spark@spark2.lab.bpopp.net ssh-copy-id spark@spark3.lab.bpopp.net
Make sure it worked by trying:
ssh spark@localhost
You shouldn't be prompted for a password.
Spark Config
/usr/local/spark/conf/slaves
# A Spark Worker will be started on each of the machines listed below. spark1 spark2 spark3 #spark4
/usr/local/spark/conf/spark-env.sh
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop --config /usr/local/hadoop/etc/hadoop classpath) export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)
/usr/local/spark/conf/spark-defaults.conf
# Example: spark.master spark://spark1.lab.bpopp.net:7077 spark.executor.memory 4500m spark.driver.memory 3g
Create a Spark Service
/lib/systemd/system/spark.service
[Unit] Description=Apache Spark Service [Service] User=spark Group=spark Type=forking ExecStart=/usr/local/spark/sbin/start-all.sh ExecStop=/usr/local/spark/sbin/stop-all.sh WorkingDirectory=/home/spark Environment=JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/ Environment=SPARK_HOME=/usr/local/spark TimeoutStartSec=2min Restart=on-failure PIDFile=spark-spark-org.apache.spark.deploy.master.Master-1.pid [Install] WantedBy=multi-user.target
Hadoop Config
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
<!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/spark/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/spark/hdfs/dfs</value> </property> </configuration>
/usr/local/hadoop/etc/hadoop/core-site.xml
<!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://0.0.0.0:9000</value> </property> <property> <name>hadoop.http.staticuser.user</name> <value>spark</value> </property> </configuration>
Create a Hadoop Service
/lib/systemd/system/hadoop.service
[Unit] Description=Hadoop DFS namenode and datanode [Service] User=spark Group=spark Type=forking ExecStart=/usr/local/hadoop/sbin/start-all.sh ExecStop=/usr/local/hadoop/sbin/stop-all.sh WorkingDirectory=/home/spark Environment=JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/ Environment=HADOOP_HOME=/usr/local/hadoop TimeoutStartSec=2min Restart=on-failure PIDFile=/tmp/hadoop-spark-namenode.pid [Install] WantedBy=multi-user.target
Jupyter Config
Install
sudo python3 -m pip install jupyter
ssh spark@localhost jupyter notebook --generate-config jupyter notebook password
/home/spark/.jupyter/jupyter_notebook_config.py
## The IP address the notebook server will listen on. c.NotebookApp.ip = '0.0.0.0' #default= localhost
Create Jupyter Service
/home/spark/.jupyter/env
PYSPARK_PYTHON=/usr/bin/python3 HADOOP_HOME=/usr/local/hadoop SPARK_DIST_CLASSPATH=/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/* SPARK_HOME=/usr/local/spark JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/ PYSPARK_SUBMIT_ARGS=--master spark://spark1.lab.bpopp.net:7077 pyspark-shell
Exit back to a sudo user
exit
/lib/systemd/system/jupyter.service
[Unit] Description=Jupyter Notebook Server [Service] Type=simple PIDFile=/run/jupyter.pid EnvironmentFile=/home/spark/.jupyter/env # Jupyter Notebook: change PATHs as needed for your system ExecStart=/usr/local/bin/jupyter notebook User=spark Group=spark WorkingDirectory=/home/spark/work Restart=always RestartSec=10 #KillMode=mixed [Install] WantedBy=multi-user.target
Start the Services
ssh spark@localhost /usr/local/spark/sbin/start-all.sh /usr/local/hadoop/sbin/start-all.sh
Or better yet, using the service:
sudo service spark start sudo service hadoop start sudo service jupyter start
Finally, schedule the services to start on boot
sudo systemctl enable spark sudo systemctl hadoop spark sudo systemctl jupyter spark
Install Some Packages
Install pyspark
sudo pip3 install pyspark sudo pip3 install plotly sudo pip3 install pandas