Spark/Hadoop Cluster: Difference between revisions

No edit summary
Line 66: Line 66:
         <value>1</value>
         <value>1</value>
     </property>
     </property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/var/data/hadoop/hdfs/nn</value>
</property>
<property>
<name>dfs.checkpoint.dir</name>
<value>file:/var/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>dfs.checkpoints.edits.dir</name>
<value>file:/var/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/var/data/hadoop/hdfs/dn</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>


</configuration>
</configuration>

Revision as of 08:19, 29 January 2024

Getting Started

This assumes the spark/hadoop cluster were configured in a particular way. You can see the general configuration from the Foreman page, but in general, spark was configured in the /usr/local/spark directory and hadoop was installed to /usr/local/hadoop.

This is a good guide for general setup of a single-node cluster

Once everything is up and running, these URL's should be available:

Passwordless SSH from Master

To allow the spark master user to ssh to itself (for a local worker) and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user on the master server and doing:

ssh-keygen -t rsa -P ""

Once the key has been generated, it will be in /home/spark/.ssh/id_rsa (by default). Copy it to the authorized hosts file (to allow spark to ssh to itself):


cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Or, for each worker, do something like:

ssh-copy-id -i ~/.ssh/id_rsa.pub spark@localhost
ssh-copy-id -i ~/.ssh/id_rsa.pub spark@spark2.lab.bpopp.net


Binding Spark to External Interface

If you want to be able to connect to your spark-master from an external PC, you will probably need to make the following change. If you do a

lsof -i -P -n | grep LISTEN

You may notice that spark is binding to a 127.0.0.1:7077 interface. This won't allow external connections. To fix it, you need to make sure the /etc/hosts file is mapping to your hostname:

127.0.0.1       localhost
192.168.2.31    spark1.lab.bpopp.net    spark1

And then in /usr/local/spark/conf/spark-env.sh, add:

export SPARK_LOCAL_IP=spark1.lab.bpopp.net
export SPARK_MASTER_HOST=spark1.lab.bpopp.net

Starting Spark

su spark
cd /usr/local/spark/sbin
./start-all.sh

Hadoop Configuration

From /usr/local/hadoop/etc/hadoop/core-site.xml:


<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

	<property>
		<name>dfs.namenode.name.dir</name>
		<value>file:/var/data/hadoop/hdfs/nn</value>
	</property>
	<property>
		<name>dfs.checkpoint.dir</name>
		<value>file:/var/data/hadoop/hdfs/snn</value>
	</property>
	<property>
		<name>dfs.checkpoints.edits.dir</name>
		<value>file:/var/data/hadoop/hdfs/snn</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>file:/var/data/hadoop/hdfs/dn</value>
	</property>
	<property>
		<name>dfs.webhdfs.enabled</name>
		<value>true</value>
	</property>

</configuration>


From /usr/local/hadoop/etc/hadoop/hdfs-site.xml:


<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

</configuration>

Starting Hadoop

Note that the namenode needs to be formatted prior to startup or it will not work.

(assuming still spark user)

hdfs namenode -format
cd /usr/local/hadoop/sbin
./start-all.sh

Spark UI