Sarthak Khore
2 min readMay 29, 2021

Running Apache Spark Application on Cloudera Quickstart CDH5

#####==== Running 1st Spark application in Cloudera CDH5 ====######

CDH5 Shared Folder :
https://www.youtube.com/watch?v=5Ijhj2IcdFQ&list=LL5ex7s65w00iV4JtmIC-ePQ&index=3&t=0s

su (password : clodera)
mount -t vboxsf Shared_Folder_CDH5_Windows10 /home/cloudera/shared_folder_cdh5_windows

Add dependencies in build.sbt

// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += “org.apache.spark” %% “spark-core” % “2.4.3”

// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += “org.apache.spark” %% “spark-sql” % “2.4.3”

// https://mvnrepository.com/artifact/org.apache.spark/spark-hive
libraryDependencies += “org.apache.spark” %% “spark-hive” % “2.4.3” % “provided”

1. In Local mode

Set the applicaion input parameters in
Edit Configuration
Program Arguments (separated by space)
Run the application

2. In Master/Slave mode

Install Scala
Install SBT
Set bin directories of SBT and SCALA in PATH variable in file .bashrc of root as well as cloudera user.

Once code is written in Intellij, follow below steps :
open terminal
go to project path
cd /home/cloudera/IdeaProjects/CFamilySparkSample/
compile
package
now check the directory for the created JAR file : /home/cloudera/IdeaProjects/CFamilySparkSample/target/scala-2.12/
copy the jar file to a desired path in LFS

Now, start Spark master and slaves and then perform spark-submit

======== Start Spark Master Machine ========

cd /usr/lib/spark/sbin/
sudo ./spark-master.sh

Now, visit Spark UI
<CDH5_ip_address>:18080

On that web page, spark master URL must be shown :
spark://quickstart.cloudera:7077 (something like this)

======== Start Spark Slave Machines ========

sudo ./start-slave.sh spark://quickstart.cloudera:7077

Now, again visit Spark UI and check if worker is started
<CDH5_ip_address>:18080

======== Submit spark application to master ========

spark-submit \
— class FirstSparkClass \
file:///home/cloudera/Desktop/cfamilysparksample_2.12–0.1.jar \
spark://quickstart.cloudera:7077 \
OlympicsDataAnalytics \
/home/cloudera/Spark_Datasets/Spark_Input_Datasets/olympic_games.csv \
/home/cloudera/Spark_Datasets/Spark_Input_Datasets/Output1/

If the code gives error “java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$”
then use spark context in the code instead of SparkSession (Because CDH5 uses 1.6 and your spark code uses 2.12)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Sarthak Khore
Sarthak Khore

Written by Sarthak Khore

Microsoft Azure Certified Data Engineer (DP200+DP201) | Cloudera CCA175 Certified | Hortonworks HDPCD Certified | Confluent Kafka Certified | Tableau Certified

No responses yet

Write a response