Installing Apache Spark via Puppet: Difference between revisions

Line 86: Line 86:
# spark.serializer                org.apache.spark.serializer.KryoSerializer
# spark.serializer                org.apache.spark.serializer.KryoSerializer
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
</pre>
== Hadoop Config ==
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
<pre>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/home/spark/hdfs/namenode</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/home/spark/hdfs/dfs</value>
</property>
</configuration>
</pre>
</pre>

Revision as of 04:26, 9 January 2020

Install Java Module into Puppet

/etc/puppetlabs/code/environments/production$ sudo /opt/puppetlabs/bin/puppet module install puppetlabs/java

Install Spark Module to /etc/puppetlabs/code/environments/production/manifests/spark.pp Note that this hard codes server names. Not ideal, but it's a starting point.

$master_hostname='spark-master.bpopp.net'

class{'hadoop':
  realm         => '',
  hdfs_hostname => $master_hostname,
  slaves        => ['spark1.bpopp.net', 'spark2.bpopp.net'],
}

class{'spark':
  master_hostname        => $master_hostname,
  hdfs_hostname          => $master_hostname,
  historyserver_hostname => $master_hostname,
  yarn_enable            => false,
}

node 'spark-master.bpopp.net' {
  include spark::master
  include spark::historyserver
  include hadoop::namenode
  include spark::hdfs
}

node /spark(1|2).bpopp.net/ {
  include spark::worker
  include hadoop::datanode
}

node 'client.bpopp.net' {
  include hadoop::frontend
  include spark::frontend
}


SSH Setup

The master must be able to SSH to the slaves without a password. To do this, you typically use certificates from the master loaded to each slave. From the master:

ssh-keygen -t rsa
ssh-copy-id spark@spark1.lab.bpopp.net
ssh-copy-id spark@spark2.lab.bpopp.net
ssh-copy-id spark@spark3.lab.bpopp.net

Make sure it worked by trying:

ssh spark@localhost

You shouldn't be prompted for a password.


Spark Config

/usr/local/spark/conf/slaves

# A Spark Worker will be started on each of the machines listed below.
spark1
spark2
spark3
#spark4

/usr/local/spark/conf/spark-env.sh

export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop --config /usr/local/hadoop/etc/hadoop classpath)
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)

/usr/local/spark/conf/spark-defaults.conf

# Example:
spark.master                     spark://spark1.lab.bpopp.net:7077
#spark.driver.memory              2g
spark.executor.memory              2g
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

Hadoop Config

/usr/local/hadoop/etc/hadoop/hdfs-site.xml


<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/home/spark/hdfs/namenode</value>
</property>

<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/home/spark/hdfs/dfs</value>
</property>
</configuration>