Installing Apache Spark via Puppet

Revision as of 18:22, 15 October 2020 by Bpopp (talk | contribs) (→‎Start the Services)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Puppet Setup

Install Java Module into Puppet

/etc/puppetlabs/code/environments/production$ sudo /opt/puppetlabs/bin/puppet module install puppetlabs/java

Install Spark Module to /etc/puppetlabs/code/environments/production/manifests/spark.pp Note that this hard codes server names. Not ideal, but it's a starting point.

$master_hostname='spark-master.bpopp.net'

class{'hadoop':
  realm         => '',
  hdfs_hostname => $master_hostname,
  slaves        => ['spark1.bpopp.net', 'spark2.bpopp.net'],
}

class{'spark':
  master_hostname        => $master_hostname,
  hdfs_hostname          => $master_hostname,
  historyserver_hostname => $master_hostname,
  yarn_enable            => false,
}

node 'spark-master.bpopp.net' {
  include spark::master
  include spark::historyserver
  include hadoop::namenode
  include spark::hdfs
}

node /spark(1|2).bpopp.net/ {
  include spark::worker
  include hadoop::datanode
}

node 'client.bpopp.net' {
  include hadoop::frontend
  include spark::frontend
}

Removing An Existing Connection from DHCP

Update Hosts

Debian adds a host line that needs to be removed. Remove spark1.lab.bpopp.net from /etc/hosts


Removing Existing DHCP connection

Ran into a situation where existing connections were not correctly removed from the DHCP server and was getting :

2020-01-22T19:18:14 048c816a [E] Failed to add DHCP reservation for spark1.lab.bpopp.net (192.168.2.157 / 00:50:56:a9:85:94): Entry already exists

Corrected this issue by running:

omshell
connect
new host
set name="spark1.lab.bpopp.net"
open 
remove

Removing DNS entries from pfsense

Also ran into issues w/ pfsense caching DNS entries and foreman not being able to overwrite them.

1) Use the Status > DHCP Leases to remove the existing entry

2) Remove them from the pfsense command line:

unbound-control -c /var/unbound/unbound.conf flush spark4.lab.bpopp.net

And then restart the unbound service from Status > Services > unbound.

Manual Configuration

SSH Setup

The master must be able to SSH to the slaves without a password. To do this, you typically use certificates from the master loaded to each slave. From the master:

ssh spark@localhost
ssh-keygen -t rsa
ssh-copy-id spark@spark1.lab.bpopp.net
ssh-copy-id spark@spark2.lab.bpopp.net
ssh-copy-id spark@spark3.lab.bpopp.net

Make sure it worked by trying:

ssh spark@localhost

You shouldn't be prompted for a password.

Spark Config

/usr/local/spark/conf/slaves

# A Spark Worker will be started on each of the machines listed below.
spark1
spark2
spark3
#spark4

/usr/local/spark/conf/spark-env.sh

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop --config /usr/local/hadoop/etc/hadoop classpath)
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)

/usr/local/spark/conf/spark-defaults.conf

# Example:
spark.master                     spark://spark1.lab.bpopp.net:7077
spark.executor.memory           4500m
spark.driver.memory             3g

Create a Spark Service

/lib/systemd/system/spark.service

[Unit]
Description=Apache Spark Service
[Service]
User=spark
Group=spark
Type=forking
ExecStart=/usr/local/spark/sbin/start-all.sh
ExecStop=/usr/local/spark/sbin/stop-all.sh
WorkingDirectory=/home/spark
Environment=JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/
Environment=SPARK_HOME=/usr/local/spark
TimeoutStartSec=2min
Restart=on-failure
PIDFile=spark-spark-org.apache.spark.deploy.master.Master-1.pid


[Install]
WantedBy=multi-user.target

Hadoop Config

/usr/local/hadoop/etc/hadoop/hdfs-site.xml


<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/home/spark/hdfs/namenode</value>
</property>

<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/home/spark/hdfs/dfs</value>
</property>
</configuration>

/usr/local/hadoop/etc/hadoop/core-site.xml

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://0.0.0.0:9000</value>
    </property>
   <property>
        <name>hadoop.http.staticuser.user</name>
        <value>spark</value>
   </property>

</configuration>

Create a Hadoop Service

/lib/systemd/system/hadoop.service

[Unit]
Description=Hadoop DFS namenode and datanode
[Service]
User=spark
Group=spark
Type=forking
ExecStart=/usr/local/hadoop/sbin/start-all.sh
ExecStop=/usr/local/hadoop/sbin/stop-all.sh
WorkingDirectory=/home/spark
Environment=JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/
Environment=HADOOP_HOME=/usr/local/hadoop
TimeoutStartSec=2min
Restart=on-failure
PIDFile=/tmp/hadoop-spark-namenode.pid


[Install]
WantedBy=multi-user.target

Jupyter Config

Install

sudo python3 -m pip install jupyter 
ssh spark@localhost
jupyter notebook --generate-config
jupyter notebook password

/home/spark/.jupyter/jupyter_notebook_config.py

## The IP address the notebook server will listen on.
c.NotebookApp.ip = '0.0.0.0'  #default= localhost


Create Jupyter Service

/home/spark/.jupyter/env

PYSPARK_PYTHON=/usr/bin/python3
HADOOP_HOME=/usr/local/hadoop
SPARK_DIST_CLASSPATH=/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*
SPARK_HOME=/usr/local/spark
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64/
PYSPARK_SUBMIT_ARGS=--master spark://spark1.lab.bpopp.net:7077 pyspark-shell
PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.7-src.zip

Exit back to a sudo user

exit

/lib/systemd/system/jupyter.service

[Unit]
Description=Jupyter Notebook Server

[Service]
Type=simple
PIDFile=/run/jupyter.pid

EnvironmentFile=/home/spark/.jupyter/env

# Jupyter Notebook: change PATHs as needed for your system
ExecStart=/usr/local/bin/jupyter notebook

User=spark
Group=spark
WorkingDirectory=/home/spark/work
Restart=always
RestartSec=10
#KillMode=mixed

[Install]
WantedBy=multi-user.target

Start the Services

ssh spark@localhost
/usr/local/spark/sbin/start-all.sh
/usr/local/hadoop/sbin/start-all.sh

Or better yet, using the service:

sudo service spark start
sudo service hadoop start
sudo service jupyter start

Finally, schedule the services to start on boot

sudo systemctl enable spark
sudo systemctl enable hadoop
sudo systemctl enable jupyter

Install Some Packages

Install pyspark

sudo pip3 install pyspark
sudo pip3 install plotly
sudo pip3 install pandas