Spark/Hadoop Cluster: Difference between revisions
No edit summary |
(→pyenv) |
||
(36 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
To allow the spark user to ssh to itself and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user and doing: | |||
= Getting Started = | |||
This assumes the spark/hadoop cluster were configured in a particular way. You can see the general configuration from the Foreman page, but in general, spark was configured in the /usr/local/spark directory and hadoop was installed to /usr/local/hadoop. | |||
This is a [https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html good guide] for general setup of a single-node cluster | |||
Once everything is up and running, these URL's should be available: | |||
* Spark: [http://spark1.lab.bpopp.net:8080 http://spark1.lab.bpopp.net:8080] | |||
* Hadoop: [http://spark1.lab.bpopp.net:9870 http://spark1.lab.bpopp.net:9870] | |||
* Jupyter: [http://spark1.lab.bpopp.net:8889/ http://spark1.lab.bpopp.net:8889/] | |||
* Zeppelin: [http://spark1.lab.bpopp.net:8890/ http://spark1.lab.bpopp.net:8890/] | |||
== pyenv == | |||
Install dependencies | |||
<pre> | |||
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \ | |||
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \ | |||
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev | |||
</pre> | |||
Install pyenv using pyenv-installer: | |||
<pre>curl https://pyenv.run | bash</pre> | |||
Add to ~/.profile | |||
<pre> | |||
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.profile | |||
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.profile | |||
echo 'eval "$(pyenv init -)"' >> ~/.profile</pre> | |||
List versions | |||
<pre>pyenv install -list</pre> | |||
Install Version | |||
<pre>pyenv install 3.10.13</pre> | |||
= Passwordless SSH from Master = | |||
To allow the spark master user to ssh to itself (for a local worker) and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user on the master server and doing: | |||
<pre>ssh-keygen -t rsa -P ""</pre> | <pre>ssh-keygen -t rsa -P ""</pre> | ||
Line 10: | Line 52: | ||
Or, for each worker, do something like: | Or, for each worker, do something like: | ||
<pre>ssh-copy-id -i ~/.ssh/ | <pre>ssh-copy-id -i ~/.ssh/id_rsa.pub spark@localhost | ||
ssh-copy-id -i ~/.ssh/ | ssh-copy-id -i ~/.ssh/id_rsa.pub spark@spark2.lab.bpopp.net</pre> | ||
= Binding Spark to External Interface = | |||
If you want to be able to connect to your spark-master from an external PC, you will probably need to make the following change. If you do a | |||
<pre>lsof -i -P -n | grep LISTEN</pre> | |||
You may notice that spark is binding to a 127.0.0.1:7077 interface. This won't allow external connections. To fix it, you need to make sure the /etc/hosts file is mapping to your hostname: | |||
<pre> | |||
127.0.0.1 localhost | |||
192.168.2.31 spark1.lab.bpopp.net spark1</pre> | |||
And then in /usr/local/spark/conf/spark-env.sh, add: | |||
<pre> | |||
export SPARK_LOCAL_IP=spark1.lab.bpopp.net | |||
export SPARK_MASTER_HOST=spark1.lab.bpopp.net | |||
</pre> | |||
= Hadoop 3.3 and Java = | |||
Hadoop apparently doesn't like newer versions of Java. Lots of random errors and problems and I just got tired of fighting them. Downgrading to a [https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html very old version of the JRE] (1.8) seemed to fix many issues. | |||
I copied the folder to the /usr/local folder and then referenced it from /usr/local/hadoop/etc/hadoop/hadoop-env.sh: | |||
<pre>export JAVA_HOME=/usr/local/jre1.8.0_202</pre> | |||
= Starting Spark = | |||
<pre> | |||
su spark | |||
cd /usr/local/spark/sbin | |||
./start-all.sh | |||
</pre> | |||
= Hadoop Configuration = | |||
From /usr/local/hadoop/etc/hadoop/hdfs-site.xml: | |||
<pre> | |||
<configuration> | |||
<property> | |||
<name>dfs.replication</name> | |||
<value>1</value> | |||
</property> | |||
<property> | |||
<name>dfs.namenode.name.dir</name> | |||
<value>file:/home/spark/hdfs/namenode</value> | |||
</property> | |||
<property> | |||
<name>dfs.datanode.data.dir</name> | |||
<value>file:/home/spark/hdfs/datanode</value> | |||
</property> | |||
<property> | |||
<name>dfs.webhdfs.enabled</name> | |||
<value>true</value> | |||
</property> | |||
<property> | |||
<name>dfs.permissions</name> | |||
<value>false</value> | |||
</property> | |||
</configuration> | |||
</pre> | |||
From /usr/local/hadoop/etc/hadoop/core-site.xml: | |||
<pre> | |||
<configuration> | |||
<property> | |||
<name>dfs.replication</name> | |||
<value>1</value> | |||
</property> | |||
<property> | |||
<name>fs.defaultFS</name> | |||
<value>hdfs://spark1.lab.bpopp.net:9000</value> | |||
</property> | |||
</configuration> | |||
</pre> | |||
= Starting Hadoop = | |||
Note that the namenode needs to be formatted prior to startup or it will not work. | |||
(assuming still spark user) | |||
<pre> | |||
hdfs namenode -format | |||
cd /usr/local/hadoop/sbin | |||
./start-all.sh</pre> | |||
= Spark UI = |
Latest revision as of 22:17, 3 February 2024
Getting Started
This assumes the spark/hadoop cluster were configured in a particular way. You can see the general configuration from the Foreman page, but in general, spark was configured in the /usr/local/spark directory and hadoop was installed to /usr/local/hadoop.
This is a good guide for general setup of a single-node cluster
Once everything is up and running, these URL's should be available:
- Spark: http://spark1.lab.bpopp.net:8080
- Hadoop: http://spark1.lab.bpopp.net:9870
- Jupyter: http://spark1.lab.bpopp.net:8889/
- Zeppelin: http://spark1.lab.bpopp.net:8890/
pyenv
Install dependencies
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \ libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \ libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
Install pyenv using pyenv-installer:
curl https://pyenv.run | bash
Add to ~/.profile
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.profile echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.profile echo 'eval "$(pyenv init -)"' >> ~/.profile
List versions
pyenv install -list
Install Version
pyenv install 3.10.13
Passwordless SSH from Master
To allow the spark master user to ssh to itself (for a local worker) and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user on the master server and doing:
ssh-keygen -t rsa -P ""
Once the key has been generated, it will be in /home/spark/.ssh/id_rsa (by default). Copy it to the authorized hosts file (to allow spark to ssh to itself):
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Or, for each worker, do something like:
ssh-copy-id -i ~/.ssh/id_rsa.pub spark@localhost ssh-copy-id -i ~/.ssh/id_rsa.pub spark@spark2.lab.bpopp.net
Binding Spark to External Interface
If you want to be able to connect to your spark-master from an external PC, you will probably need to make the following change. If you do a
lsof -i -P -n | grep LISTEN
You may notice that spark is binding to a 127.0.0.1:7077 interface. This won't allow external connections. To fix it, you need to make sure the /etc/hosts file is mapping to your hostname:
127.0.0.1 localhost 192.168.2.31 spark1.lab.bpopp.net spark1
And then in /usr/local/spark/conf/spark-env.sh, add:
export SPARK_LOCAL_IP=spark1.lab.bpopp.net export SPARK_MASTER_HOST=spark1.lab.bpopp.net
Hadoop 3.3 and Java
Hadoop apparently doesn't like newer versions of Java. Lots of random errors and problems and I just got tired of fighting them. Downgrading to a very old version of the JRE (1.8) seemed to fix many issues.
I copied the folder to the /usr/local folder and then referenced it from /usr/local/hadoop/etc/hadoop/hadoop-env.sh:
export JAVA_HOME=/usr/local/jre1.8.0_202
Starting Spark
su spark cd /usr/local/spark/sbin ./start-all.sh
Hadoop Configuration
From /usr/local/hadoop/etc/hadoop/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/spark/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/spark/hdfs/datanode</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
From /usr/local/hadoop/etc/hadoop/core-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://spark1.lab.bpopp.net:9000</value> </property> </configuration>
Starting Hadoop
Note that the namenode needs to be formatted prior to startup or it will not work.
(assuming still spark user)
hdfs namenode -format cd /usr/local/hadoop/sbin ./start-all.sh