Spark/Hadoop Cluster: Difference between revisions

No edit summary
 
(34 intermediate revisions by the same user not shown)
Line 1: Line 1:
To allow the spark master user to ssh to itself (for a local worker) and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user and doing:
 
= Getting Started =
 
This assumes the spark/hadoop cluster were configured in a particular way. You can see the general configuration from the Foreman page, but in general, spark was configured in the /usr/local/spark directory and hadoop was installed to /usr/local/hadoop.
 
This is a [https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html good guide] for general setup of a single-node cluster
 
Once everything is up and running, these URL's should be available:
 
* Spark: [http://spark1.lab.bpopp.net:8080 http://spark1.lab.bpopp.net:8080]
* Hadoop: [http://spark1.lab.bpopp.net:9870 http://spark1.lab.bpopp.net:9870]
* Jupyter: [http://spark1.lab.bpopp.net:8889/ http://spark1.lab.bpopp.net:8889/]
* Zeppelin: [http://spark1.lab.bpopp.net:8890/ http://spark1.lab.bpopp.net:8890/]
 
== pyenv ==
 
Install dependencies
<pre>
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
</pre>
 
Install pyenv using pyenv-installer:
 
<pre>curl https://pyenv.run | bash</pre>
 
Add to ~/.profile
<pre>
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.profile
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.profile
echo 'eval "$(pyenv init -)"' >> ~/.profile</pre>
 
 
List versions
<pre>pyenv install -list</pre>
 
Install Version
<pre>pyenv install 3.10.13</pre>
 
= Passwordless SSH from Master =
 
To allow the spark master user to ssh to itself (for a local worker) and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user on the master server and doing:


<pre>ssh-keygen -t rsa -P ""</pre>
<pre>ssh-keygen -t rsa -P ""</pre>
Line 12: Line 54:
<pre>ssh-copy-id -i ~/.ssh/id_rsa.pub spark@localhost
<pre>ssh-copy-id -i ~/.ssh/id_rsa.pub spark@localhost
ssh-copy-id -i ~/.ssh/id_rsa.pub spark@spark2.lab.bpopp.net</pre>
ssh-copy-id -i ~/.ssh/id_rsa.pub spark@spark2.lab.bpopp.net</pre>
= Binding Spark to External Interface =
If you want to be able to connect to your spark-master from an external PC, you will probably need to make the following change. If you do a
<pre>lsof -i -P -n | grep LISTEN</pre>
You may notice that spark is binding to a 127.0.0.1:7077 interface. This won't allow external connections. To fix it, you need to make sure the /etc/hosts file is mapping to your hostname:
<pre>
127.0.0.1      localhost
192.168.2.31    spark1.lab.bpopp.net    spark1</pre>
And then in /usr/local/spark/conf/spark-env.sh, add:
<pre>
export SPARK_LOCAL_IP=spark1.lab.bpopp.net
export SPARK_MASTER_HOST=spark1.lab.bpopp.net
</pre>
= Hadoop 3.3 and Java =
Hadoop apparently doesn't like newer versions of Java. Lots of random errors and problems and I just got tired of fighting them. Downgrading to a [https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html very old version of the JRE] (1.8) seemed to fix many issues.
I copied the folder to the /usr/local folder and then referenced it from /usr/local/hadoop/etc/hadoop/hadoop-env.sh:
<pre>export JAVA_HOME=/usr/local/jre1.8.0_202</pre>
= Starting Spark =
<pre>
su spark
cd /usr/local/spark/sbin
./start-all.sh
</pre>
= Hadoop Configuration =
From /usr/local/hadoop/etc/hadoop/hdfs-site.xml:
<pre>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/spark/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/spark/hdfs/datanode</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
    <property>
      <name>dfs.permissions</name>
      <value>false</value>
    </property>
</configuration>
</pre>
From /usr/local/hadoop/etc/hadoop/core-site.xml:
<pre>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
<property>
      <name>fs.defaultFS</name>
      <value>hdfs://spark1.lab.bpopp.net:9000</value>
</property>
</configuration>
</pre>
= Starting Hadoop =
Note that the namenode needs to be formatted prior to startup or it will not work.
(assuming still spark user)
<pre>
hdfs namenode -format
cd /usr/local/hadoop/sbin
./start-all.sh</pre>
= Spark UI =

Latest revision as of 22:17, 3 February 2024

Getting Started

This assumes the spark/hadoop cluster were configured in a particular way. You can see the general configuration from the Foreman page, but in general, spark was configured in the /usr/local/spark directory and hadoop was installed to /usr/local/hadoop.

This is a good guide for general setup of a single-node cluster

Once everything is up and running, these URL's should be available:

pyenv

Install dependencies

sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

Install pyenv using pyenv-installer:

curl https://pyenv.run | bash

Add to ~/.profile

echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.profile
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.profile
echo 'eval "$(pyenv init -)"' >> ~/.profile


List versions

pyenv install -list

Install Version

pyenv install 3.10.13

Passwordless SSH from Master

To allow the spark master user to ssh to itself (for a local worker) and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user on the master server and doing:

ssh-keygen -t rsa -P ""

Once the key has been generated, it will be in /home/spark/.ssh/id_rsa (by default). Copy it to the authorized hosts file (to allow spark to ssh to itself):


cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Or, for each worker, do something like:

ssh-copy-id -i ~/.ssh/id_rsa.pub spark@localhost
ssh-copy-id -i ~/.ssh/id_rsa.pub spark@spark2.lab.bpopp.net


Binding Spark to External Interface

If you want to be able to connect to your spark-master from an external PC, you will probably need to make the following change. If you do a

lsof -i -P -n | grep LISTEN

You may notice that spark is binding to a 127.0.0.1:7077 interface. This won't allow external connections. To fix it, you need to make sure the /etc/hosts file is mapping to your hostname:

127.0.0.1       localhost
192.168.2.31    spark1.lab.bpopp.net    spark1

And then in /usr/local/spark/conf/spark-env.sh, add:

export SPARK_LOCAL_IP=spark1.lab.bpopp.net
export SPARK_MASTER_HOST=spark1.lab.bpopp.net

Hadoop 3.3 and Java

Hadoop apparently doesn't like newer versions of Java. Lots of random errors and problems and I just got tired of fighting them. Downgrading to a very old version of the JRE (1.8) seemed to fix many issues.

I copied the folder to the /usr/local folder and then referenced it from /usr/local/hadoop/etc/hadoop/hadoop-env.sh:

export JAVA_HOME=/usr/local/jre1.8.0_202

Starting Spark

su spark
cd /usr/local/spark/sbin
./start-all.sh

Hadoop Configuration

From /usr/local/hadoop/etc/hadoop/hdfs-site.xml:


<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

	<property>
		<name>dfs.namenode.name.dir</name>
		<value>file:/home/spark/hdfs/namenode</value>
	</property>

	<property>
		<name>dfs.datanode.data.dir</name>
		<value>file:/home/spark/hdfs/datanode</value>
	</property>

	<property>
		<name>dfs.webhdfs.enabled</name>
		<value>true</value>
	</property>


    <property>
       <name>dfs.permissions</name>
       <value>false</value>
    </property>


</configuration>


From /usr/local/hadoop/etc/hadoop/core-site.xml:


<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

<property>
      <name>fs.defaultFS</name>
      <value>hdfs://spark1.lab.bpopp.net:9000</value>
</property>


</configuration>

Starting Hadoop

Note that the namenode needs to be formatted prior to startup or it will not work.

(assuming still spark user)

hdfs namenode -format
cd /usr/local/hadoop/sbin
./start-all.sh

Spark UI