Running Apache Spark Application on Cloudera Quickstart CDH5

#####==== Running 1st Spark application in Cloudera CDH5 ====######
CDH5 Shared Folder :
https://www.youtube.com/watch?v=5Ijhj2IcdFQ&list=LL5ex7s65w00iV4JtmIC-ePQ&index=3&t=0s
su (password : clodera)
mount -t vboxsf Shared_Folder_CDH5_Windows10 /home/cloudera/shared_folder_cdh5_windows
Add dependencies in build.sbt
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += “org.apache.spark” %% “spark-core” % “2.4.3”
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += “org.apache.spark” %% “spark-sql” % “2.4.3”
// https://mvnrepository.com/artifact/org.apache.spark/spark-hive
libraryDependencies += “org.apache.spark” %% “spark-hive” % “2.4.3” % “provided”
1. In Local mode
Set the applicaion input parameters in
Edit Configuration
Program Arguments (separated by space)
Run the application
2. In Master/Slave mode
Install Scala
Install SBT
Set bin directories of SBT and SCALA in PATH variable in file .bashrc of root as well as cloudera user.
Once code is written in Intellij, follow below steps :
open terminal
go to project path
cd /home/cloudera/IdeaProjects/CFamilySparkSample/
compile
package
now check the directory for the created JAR file : /home/cloudera/IdeaProjects/CFamilySparkSample/target/scala-2.12/
copy the jar file to a desired path in LFS
Now, start Spark master and slaves and then perform spark-submit
======== Start Spark Master Machine ========
cd /usr/lib/spark/sbin/
sudo ./spark-master.sh
Now, visit Spark UI
<CDH5_ip_address>:18080
On that web page, spark master URL must be shown :
spark://quickstart.cloudera:7077 (something like this)
======== Start Spark Slave Machines ========
sudo ./start-slave.sh spark://quickstart.cloudera:7077
Now, again visit Spark UI and check if worker is started
<CDH5_ip_address>:18080
======== Submit spark application to master ========
spark-submit \
— class FirstSparkClass \
file:///home/cloudera/Desktop/cfamilysparksample_2.12–0.1.jar \
spark://quickstart.cloudera:7077 \
OlympicsDataAnalytics \
/home/cloudera/Spark_Datasets/Spark_Input_Datasets/olympic_games.csv \
/home/cloudera/Spark_Datasets/Spark_Input_Datasets/Output1/
If the code gives error “java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$”
then use spark context in the code instead of SparkSession (Because CDH5 uses 1.6 and your spark code uses 2.12)