Single Node Apache Spark, Hive and Hadoop Installation

Step by step single node Apache Hadoop, Apache Spark with MySql Hive metastore

1) Download and Install SSH

sudo apt-get install ssh
ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

2) Download and Install Oracle Java 8

sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

3) Download and Install MySql Server

sudo apt-get install mysql-server
mysql -u root -p
CREATE DATABASE metastore;
USE metastore;

SELECT * from information_schema.user_privileges;

# to allow remote connections also on this user run below statement
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'password' WITH GRANT OPTION;
FLUSH PRIVILEGES;

SELECT * from information_schema.user_privileges;

quit;

now try to connect to mysql to make sure connectivity is setup properly

telnet 127.0.0.1 3306

change /etc/mysql/mysql.conf.d/mysqld.cnf file to allow remote connection to mysql

Change line
bind-address = 127.0.0.1
to
#bind-address = 127.0.0.1

now use telnet again to verify

telnet xx.xx.xx.xx 3306

4) Download and Configure Apache Hadoop

Download Hadoop binary tarball http://hadoop.apache.org/releases.html
Extract this tarball and move this folder to get link /usr/local folder using below command and also change permissions and owner for this folder. Other folders also needs to be created as mentioned below.

sudo mv /home/spark/Desktop/hadoop /usr/local/

sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
sudo mkdir -p /usr/local/hadooptmp

sudo chmod -R 775 /usr/local/hadoop_store/hdfs/namenode
sudo chmod -R 775 /usr/local/hadoop_store/hdfs/datanode
sudo chmod -R 775 /usr/local/hadooptmp
sudo chmod -R 775 /usr/local/hadoop

sudo chown spark:spark -R /usr/local/hadoop_store
sudo chown spark:spark -R /usr/local/hadooptmp
sudo chown spark:spark -R /usr/local/hadoop

Create/Edit below files present at location http://mimacleaning.com/write-an-essay-phrases /usr/local/hadoop/etc/hadoop
hdfs-site.xml

<configuration>
 <property>
  <name>dfs.replication</name>
  <value>1</value>
 </property>
<property>
   <name>dfs.namenode.name.dir</name>
   <value>file:///usr/local/hadoop_store/hdfs/namenode</value>
 </property>
 <property>
   <name>dfs.datanode.data.dir</name>
   <value>file:///usr/local/hadoop_store/hdfs/datanode</value>
 </property>
</configuration>

core-site.xml

<configuration>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/usr/local/hadooptmp</value>
 </property>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
 </property>
</configuration>

5) Download and Configure Apache Hive

Download Hive binary tarballs https://hive.apache.org/downloads.html
Extract this tarball and move this folder to /usr/local folder using below command and also change permissions for this folder

sudo mv /home/spark/Desktop/hive /usr/local/
sudo chmod -R 775 /usr/local/hive

#  Create Hive directories within HDFS. The directory ‘warehouse’ is the location to store the table or data related to hive.
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir /tmp

#  In this command, we are giving write permission to the group:
hdfs dfs -chmod g+w /user/hive/warehouse
hdfs dfs -chmod g+w /tmp

gedit conf/hive-env.sh

# Set HADOOP_HOME to point to a specific hadoop install directory
export HADOOP_HOME=/usr/local/hadoop

# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/usr/local/hive/conf

bin/schematool -initSchema -dbType mysql

# Set environment variables as shown in section 7

Now configure conf/hive-site.xml as shown below

<configuration>
   <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://127.0.0.1:3306/metastore?createDatabaseIfNotExist=true</value>
      <description>metadata is stored in a MySQL server</description>
   </property>
  <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
      <description>MySQL JDBC driver class</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hiveuser</value>
      <description>user name for connecting to mysql server</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>hivepassword</value>
      <description>password for connecting to mysql server</description>
   </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value> <-- for HDFS PATH -->
    <value>file:///media/sf_spark-datawarehouse</value> <-- for LOCAL PATH -->
    <description>Hive warehouse directory</description>
  </property>
  <property>
    <name>hive.server2.authentication</name>
    <value>NOSASL</value>
  </property>
</configuration>

Also note warehouse path specified here is saved into metastore db and is used going forward in hive and spark as default warehouse location

For hive thrift server authentication see https://www.simba.com/products/Spark/doc/JDBC_InstallGuide/content/jdbc/hi/authoptions.htm

# Make sure u can start hive using below command
hive --hiveconf hive.root.logger=DEBUG,console

# Run this command to create DATABASE and see if it is created at warehouse location
CREATE DATABASE hivetest;

# now login into mysql and see
mysql -u root -p
SELECT * FROM metastore.DBS;
SELECT * FROM metastore.VERSION;
SELECT * FROM metastore.GLOBAL_PRIVS;
SELECT * FROM metastore.ROLES;

6) Download and Configure Apache Spark

Download Spark binary tarballs http://spark.apache.org/downloads.html
Extract this tarball and move this folder to /usr/local folder using below command and also change permissions for this folder

sudo mv /home/spark/Desktop/spark /usr/local/
sudo chmod -R 775 /usr/local/spark

Copy file hive-site.xml in directory /usr/local/spark
Download SQL Server and MySql JDBC Driver and move these jars in /usr/local/spark/jars folders

7) Set ENVIRONMENT variables

gedit ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HIVE_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$SPARK_HOME/bin
#HADOOP VARIABLES END

8) Format and Start HDFS Cluster

hadoop namenode -format
start-dfs.sh

9) Start SPARK Cluster

spark-class org.apache.spark.deploy.master.Master
spark-class org.apache.spark.deploy.worker.Worker spark://your-ip-address:7077 --cores 12 --memory 10G

10) Run Application SPARK SHELL

spark-shell --executor-memory 3G --driver-memory 2G --master spark://your-ip-address:7077 --conf spark.cores.max=5

11) Monitoring Process, Memory

https://askubuntu.com/questions/9642/how-can-i-monitor-the-memory-usage

jps | sort
ps aux | grep java | grep -v grep
netstat -plten | grep java | sort -k9
htop
watch ps_mem.py #provided in the link above

12) Download and Install Zeppelin

https://zeppelin.apache.org/
Edit file buy generic Misoprostol online /usr/local/zeppelin/conf/zeppelin-env.sh

export SPARK_HOME=/usr/local/spark
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export MASTER=spark://your-ip-address:7077
export ZEPPELIN_INTP_CLASSPATH_OVERRIDES=/usr/local/hive/conf
export SPARK_SUBMIT_OPTIONS="--jars /usr/local/hadoopjars/mysql-connector-java.jar,/usr/local/hadoopjars/sqljdbc4.jar"

Edit file /usr/local/zeppelin/conf/zeppelin-site.xml and change port to something other than 8080

<configuration>
<property>
<name>zeppelin.server.port</name>
<value>8082</value>
<description>Server port.</description>
</property>
</configuration>

Make sure all the executable sh files in bin directory are able to run with proper permissions

 sudo chmod -R 775 /usr/local/zeppelin/bin

Run interpreter and zeppelin seperately on windows
https://issues.apache.org/jira/browse/ZEPPELIN-1584
.\bin\interpreter.cmd -d E:\zeppelin\interpreter\spark\ -p 1234 -l E:\zeppelin\local-repo\2C9VKCB5N

13) Running Spark Hive Thrift Server

/usr/local/spark/sbin/start-thriftserver.sh --conf spark.sql.warehouse.dir=file:///media/sf_spark-datawarehouse \
  --conf hive.server2.authentication=NOSASL --conf hive.server2.thrift.bind.host=10.0.2.15 \
  --conf hive.server2.trift.port=10000 --conf spark.sql.shuffle.partitions=4 \
  --conf spark.default.parallelism=4 --conf spark.driver.maxResultSize=1G \
  --conf spark.scheduler.mode=FAIR --conf spark.cores.max=16 \
  --executor-memory 6G --driver-memory 2G --driver-cores 1 --master spark://10.0.2.15:7077

14) Create Password less public key private key authentication

1) Create public key and private key using putty
2) open public key file and convert multiline key to single line
3) copy this public key to ~/.ssh/authorized_keys using below command
echo ###copied-public-key### > .ssh/authorized_keys