Spark/Hadoop Cluster: Difference between revisions

(Created page with "To allow the spark user to ssh to itself and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user and doing: <pre>ssh-keygen -t rsa -P ""</pre> Once the key has been generated, it will be in /home/spark/.ssh/id_rsa (by default). Copy it to the authorized hosts file (to allow spark to ssh to itself): <pre>cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys</pre> Or, for each worker, do something like: <pre>ssh-copy-i...")
 
 
(39 intermediate revisions by the same user not shown)
Line 1: Line 1:
To allow the spark user to ssh to itself and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user and doing:
 
= Getting Started =
 
This assumes the spark/hadoop cluster were configured in a particular way. You can see the general configuration from the Foreman page, but in general, spark was configured in the /usr/local/spark directory and hadoop was installed to /usr/local/hadoop.
 
This is a [https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html good guide] for general setup of a single-node cluster
 
Once everything is up and running, these URL's should be available:
 
* Spark: [http://spark1.lab.bpopp.net:8080 http://spark1.lab.bpopp.net:8080]
* Hadoop: [http://spark1.lab.bpopp.net:9870 http://spark1.lab.bpopp.net:9870]
* Jupyter: [http://spark1.lab.bpopp.net:8889/ http://spark1.lab.bpopp.net:8889/]
* Zeppelin: [http://spark1.lab.bpopp.net:8890/ http://spark1.lab.bpopp.net:8890/]
 
== pyenv ==
 
Install dependencies
<pre>
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
</pre>
 
Install pyenv using pyenv-installer:
 
<pre>curl https://pyenv.run | bash</pre>
 
Add to ~/.profile
<pre>
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.profile
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.profile
echo 'eval "$(pyenv init -)"' >> ~/.profile</pre>
 
 
List versions
<pre>pyenv install -list</pre>
 
Install Version
<pre>pyenv install 3.10.13</pre>
 
= Passwordless SSH from Master =
 
To allow the spark master user to ssh to itself (for a local worker) and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user on the master server and doing:


<pre>ssh-keygen -t rsa -P ""</pre>
<pre>ssh-keygen -t rsa -P ""</pre>
Line 10: Line 52:
Or, for each worker, do something like:
Or, for each worker, do something like:


<pre>ssh-copy-id -i ~/.ssh/mykey spark@localhost</pre>
<pre>ssh-copy-id -i ~/.ssh/id_rsa.pub spark@localhost
ssh-copy-id -i ~/.ssh/id_rsa.pub spark@spark2.lab.bpopp.net</pre>
 
 
= Binding Spark to External Interface =
 
If you want to be able to connect to your spark-master from an external PC, you will probably need to make the following change. If you do a
 
<pre>lsof -i -P -n | grep LISTEN</pre>
 
You may notice that spark is binding to a 127.0.0.1:7077 interface. This won't allow external connections. To fix it, you need to make sure the /etc/hosts file is mapping to your hostname:
 
<pre>
127.0.0.1      localhost
192.168.2.31    spark1.lab.bpopp.net    spark1</pre>
 
And then in /usr/local/spark/conf/spark-env.sh, add:
 
<pre>
export SPARK_LOCAL_IP=spark1.lab.bpopp.net
export SPARK_MASTER_HOST=spark1.lab.bpopp.net
</pre>
 
= Hadoop 3.3 and Java =
 
Hadoop apparently doesn't like newer versions of Java. Lots of random errors and problems and I just got tired of fighting them. Downgrading to a [https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html very old version of the JRE] (1.8) seemed to fix many issues.
 
I copied the folder to the /usr/local folder and then referenced it from /usr/local/hadoop/etc/hadoop/hadoop-env.sh:
 
<pre>export JAVA_HOME=/usr/local/jre1.8.0_202</pre>
 
= Starting Spark =
 
<pre>
su spark
cd /usr/local/spark/sbin
./start-all.sh
</pre>
 
= Hadoop Configuration =
 
From /usr/local/hadoop/etc/hadoop/hdfs-site.xml:
 
<pre>
 
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
 
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/spark/hdfs/namenode</value>
</property>
 
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/spark/hdfs/datanode</value>
</property>
 
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
 
 
    <property>
      <name>dfs.permissions</name>
      <value>false</value>
    </property>
 
 
</configuration>
</pre>
 
 
From /usr/local/hadoop/etc/hadoop/core-site.xml:
 
<pre>
 
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
 
<property>
      <name>fs.defaultFS</name>
      <value>hdfs://spark1.lab.bpopp.net:9000</value>
</property>
 
 
</configuration>
</pre>
 
= Starting Hadoop =
 
Note that the namenode needs to be formatted prior to startup or it will not work.
 
(assuming still spark user)
<pre>
hdfs namenode -format
cd /usr/local/hadoop/sbin
./start-all.sh</pre>
 
= Spark UI =

Latest revision as of 22:17, 3 February 2024

Getting Started

This assumes the spark/hadoop cluster were configured in a particular way. You can see the general configuration from the Foreman page, but in general, spark was configured in the /usr/local/spark directory and hadoop was installed to /usr/local/hadoop.

This is a good guide for general setup of a single-node cluster

Once everything is up and running, these URL's should be available:

pyenv

Install dependencies

sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

Install pyenv using pyenv-installer:

curl https://pyenv.run | bash

Add to ~/.profile

echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.profile
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.profile
echo 'eval "$(pyenv init -)"' >> ~/.profile


List versions

pyenv install -list

Install Version

pyenv install 3.10.13

Passwordless SSH from Master

To allow the spark master user to ssh to itself (for a local worker) and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user on the master server and doing:

ssh-keygen -t rsa -P ""

Once the key has been generated, it will be in /home/spark/.ssh/id_rsa (by default). Copy it to the authorized hosts file (to allow spark to ssh to itself):


cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Or, for each worker, do something like:

ssh-copy-id -i ~/.ssh/id_rsa.pub spark@localhost
ssh-copy-id -i ~/.ssh/id_rsa.pub spark@spark2.lab.bpopp.net


Binding Spark to External Interface

If you want to be able to connect to your spark-master from an external PC, you will probably need to make the following change. If you do a

lsof -i -P -n | grep LISTEN

You may notice that spark is binding to a 127.0.0.1:7077 interface. This won't allow external connections. To fix it, you need to make sure the /etc/hosts file is mapping to your hostname:

127.0.0.1       localhost
192.168.2.31    spark1.lab.bpopp.net    spark1

And then in /usr/local/spark/conf/spark-env.sh, add:

export SPARK_LOCAL_IP=spark1.lab.bpopp.net
export SPARK_MASTER_HOST=spark1.lab.bpopp.net

Hadoop 3.3 and Java

Hadoop apparently doesn't like newer versions of Java. Lots of random errors and problems and I just got tired of fighting them. Downgrading to a very old version of the JRE (1.8) seemed to fix many issues.

I copied the folder to the /usr/local folder and then referenced it from /usr/local/hadoop/etc/hadoop/hadoop-env.sh:

export JAVA_HOME=/usr/local/jre1.8.0_202

Starting Spark

su spark
cd /usr/local/spark/sbin
./start-all.sh

Hadoop Configuration

From /usr/local/hadoop/etc/hadoop/hdfs-site.xml:


<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

	<property>
		<name>dfs.namenode.name.dir</name>
		<value>file:/home/spark/hdfs/namenode</value>
	</property>

	<property>
		<name>dfs.datanode.data.dir</name>
		<value>file:/home/spark/hdfs/datanode</value>
	</property>

	<property>
		<name>dfs.webhdfs.enabled</name>
		<value>true</value>
	</property>


    <property>
       <name>dfs.permissions</name>
       <value>false</value>
    </property>


</configuration>


From /usr/local/hadoop/etc/hadoop/core-site.xml:


<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

<property>
      <name>fs.defaultFS</name>
      <value>hdfs://spark1.lab.bpopp.net:9000</value>
</property>


</configuration>

Starting Hadoop

Note that the namenode needs to be formatted prior to startup or it will not work.

(assuming still spark user)

hdfs namenode -format
cd /usr/local/hadoop/sbin
./start-all.sh

Spark UI