Spark/Hadoop Cluster: Difference between revisions
(→pyenv) |
|||
(12 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
Once everything is up and running, these URL's should be available: | Once everything is up and running, these URL's should be available: | ||
* [http://spark1.lab.bpopp.net:8080 http://spark1.lab.bpopp.net:8080] | * Spark: [http://spark1.lab.bpopp.net:8080 http://spark1.lab.bpopp.net:8080] | ||
* [http://spark1.lab.bpopp.net:9870 http://spark1.lab.bpopp.net:9870] | * Hadoop: [http://spark1.lab.bpopp.net:9870 http://spark1.lab.bpopp.net:9870] | ||
* Jupyter: [http://spark1.lab.bpopp.net:8889/ http://spark1.lab.bpopp.net:8889/] | |||
* Zeppelin: [http://spark1.lab.bpopp.net:8890/ http://spark1.lab.bpopp.net:8890/] | |||
== pyenv == | |||
Install dependencies | |||
<pre> | |||
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \ | |||
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \ | |||
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev | |||
</pre> | |||
Install pyenv using pyenv-installer: | |||
<pre>curl https://pyenv.run | bash</pre> | |||
Add to ~/.profile | |||
<pre> | |||
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.profile | |||
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.profile | |||
echo 'eval "$(pyenv init -)"' >> ~/.profile</pre> | |||
List versions | |||
<pre>pyenv install -list</pre> | |||
Install Version | |||
<pre>pyenv install 3.10.13</pre> | |||
= Passwordless SSH from Master = | = Passwordless SSH from Master = | ||
Line 46: | Line 74: | ||
export SPARK_MASTER_HOST=spark1.lab.bpopp.net | export SPARK_MASTER_HOST=spark1.lab.bpopp.net | ||
</pre> | </pre> | ||
= Hadoop 3.3 and Java = | |||
Hadoop apparently doesn't like newer versions of Java. Lots of random errors and problems and I just got tired of fighting them. Downgrading to a [https://www.oracle.com/java/technologies/javase/javase8-archive-downloads.html very old version of the JRE] (1.8) seemed to fix many issues. | |||
I copied the folder to the /usr/local folder and then referenced it from /usr/local/hadoop/etc/hadoop/hadoop-env.sh: | |||
<pre>export JAVA_HOME=/usr/local/jre1.8.0_202</pre> | |||
= Starting Spark = | = Starting Spark = | ||
Line 57: | Line 93: | ||
= Hadoop Configuration = | = Hadoop Configuration = | ||
From /usr/local/hadoop/etc/hadoop/ | From /usr/local/hadoop/etc/hadoop/hdfs-site.xml: | ||
<pre> | <pre> | ||
Line 69: | Line 105: | ||
<property> | <property> | ||
<name>dfs.namenode.name.dir</name> | <name>dfs.namenode.name.dir</name> | ||
<value>file:/ | <value>file:/home/spark/hdfs/namenode</value> | ||
</property> | </property> | ||
<property> | <property> | ||
<name>dfs.datanode.data.dir</name> | <name>dfs.datanode.data.dir</name> | ||
<value>file:/ | <value>file:/home/spark/hdfs/datanode</value> | ||
</property> | </property> | ||
Line 91: | Line 117: | ||
<value>true</value> | <value>true</value> | ||
</property> | </property> | ||
<property> | |||
<name>dfs.permissions</name> | |||
<value>false</value> | |||
</property> | |||
</configuration> | </configuration> | ||
Line 96: | Line 129: | ||
From /usr/local/hadoop/etc/hadoop/ | From /usr/local/hadoop/etc/hadoop/core-site.xml: | ||
<pre> | <pre> | ||
Line 105: | Line 138: | ||
<value>1</value> | <value>1</value> | ||
</property> | </property> | ||
<property> | |||
<name>fs.defaultFS</name> | |||
<value>hdfs://spark1.lab.bpopp.net:9000</value> | |||
</property> | |||
</configuration> | </configuration> |
Latest revision as of 22:17, 3 February 2024
Getting Started
This assumes the spark/hadoop cluster were configured in a particular way. You can see the general configuration from the Foreman page, but in general, spark was configured in the /usr/local/spark directory and hadoop was installed to /usr/local/hadoop.
This is a good guide for general setup of a single-node cluster
Once everything is up and running, these URL's should be available:
- Spark: http://spark1.lab.bpopp.net:8080
- Hadoop: http://spark1.lab.bpopp.net:9870
- Jupyter: http://spark1.lab.bpopp.net:8889/
- Zeppelin: http://spark1.lab.bpopp.net:8890/
pyenv
Install dependencies
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \ libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \ libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
Install pyenv using pyenv-installer:
curl https://pyenv.run | bash
Add to ~/.profile
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.profile echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.profile echo 'eval "$(pyenv init -)"' >> ~/.profile
List versions
pyenv install -list
Install Version
pyenv install 3.10.13
Passwordless SSH from Master
To allow the spark master user to ssh to itself (for a local worker) and also the workers, you need ssh passwordless to be enabled. This can be done by logging into the spark user on the master server and doing:
ssh-keygen -t rsa -P ""
Once the key has been generated, it will be in /home/spark/.ssh/id_rsa (by default). Copy it to the authorized hosts file (to allow spark to ssh to itself):
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Or, for each worker, do something like:
ssh-copy-id -i ~/.ssh/id_rsa.pub spark@localhost ssh-copy-id -i ~/.ssh/id_rsa.pub spark@spark2.lab.bpopp.net
Binding Spark to External Interface
If you want to be able to connect to your spark-master from an external PC, you will probably need to make the following change. If you do a
lsof -i -P -n | grep LISTEN
You may notice that spark is binding to a 127.0.0.1:7077 interface. This won't allow external connections. To fix it, you need to make sure the /etc/hosts file is mapping to your hostname:
127.0.0.1 localhost 192.168.2.31 spark1.lab.bpopp.net spark1
And then in /usr/local/spark/conf/spark-env.sh, add:
export SPARK_LOCAL_IP=spark1.lab.bpopp.net export SPARK_MASTER_HOST=spark1.lab.bpopp.net
Hadoop 3.3 and Java
Hadoop apparently doesn't like newer versions of Java. Lots of random errors and problems and I just got tired of fighting them. Downgrading to a very old version of the JRE (1.8) seemed to fix many issues.
I copied the folder to the /usr/local folder and then referenced it from /usr/local/hadoop/etc/hadoop/hadoop-env.sh:
export JAVA_HOME=/usr/local/jre1.8.0_202
Starting Spark
su spark cd /usr/local/spark/sbin ./start-all.sh
Hadoop Configuration
From /usr/local/hadoop/etc/hadoop/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/spark/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/spark/hdfs/datanode</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
From /usr/local/hadoop/etc/hadoop/core-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://spark1.lab.bpopp.net:9000</value> </property> </configuration>
Starting Hadoop
Note that the namenode needs to be formatted prior to startup or it will not work.
(assuming still spark user)
hdfs namenode -format cd /usr/local/hadoop/sbin ./start-all.sh