Problem Scenario 55 : You have been given below code snippet.
val pairRDDI = sc.parallelize(List( ("cat",2), ("cat", 5), ("book", 4),("cat", 12))) val
pairRDD2 = sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("cat", 12)))
operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(String, (Option[lnt], Option[lnt]))] = Array((book,(Some(4},None)),
(mouse,(None,Some(4))), (cup,(None,Some(5))), (cat,(Some(2),Some(2)),
(cat,(Some(2),Some(12))), (cat,(Some(5),Some(2))), (cat,(Some(5),Some(12))),
(cat,(Some(12),Some(2))), (cat,(Some(12),Some(12)))J
Problem Scenario 5 : You have been given following mysql database details.
user=retail_dba password=cloudera database=retail_db jdbc URL = jdbc:mysql://quickstart:3306/retail_db Please accomplish following activities.
1.
List all the tables using sqoop command from retail_db
2.
Write simple sqoop eval command to check whether you have permission to read database tables or not.
3.
Import all the tables as avro files in /user/hive/warehouse/retail cca174.db
4.
Import departments table as a text file in /user/cloudera/departments.
Problem Scenario 52 : You have been given below code snippet.
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
Operation_xyz
Write a correct code snippet for Operation_xyz which will produce below output.
scalaxollection.Map[lnt,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> S, 2 -> 3, 4 -> 2, 7 ->
1)
Problem Scenario 70 : Write down a Spark Application using Python, In which it read a
file "Content.txt" (On hdfs) with following content. Do the word count and save the
results in a directory called "problem85" (On hdfs)
Content.txt
Hello this is ABCTECH.com
This is XYZTECH.com
Apache Spark Training
This is Spark Learning Session Spark is faster than MapReduce
Problem Scenario 15 : You have been given following mysql database details as well as other info. user=retail_dba password=cloudera database=retail_db jdbc URL = jdbc:mysql://quickstart:3306/retail_db Please accomplish following activities.
1.
In mysql departments table please insert following record. Insert into departments values(9999, '"Data Science"1);
2.
Now there is a downstream system which will process dumps of this file. However, system is designed the way that it can process only files if fields are enlcosed in(') single quote and separate of the field should be (-} and line needs to be terminated by : (colon).
3.
If data itself contains the " (double quote } than it should be escaped by \.
4.
Please import the departments table in a directory called departments_enclosedby and file should be able to process by downstream system.
Problem Scenario 48 : You have been given below Python code snippet, with intermediate
output.
We want to take a list of records about people and then we want to sum up their ages and
count them.
So for this example the type in the RDD will be a Dictionary in the format of {name: NAME,
age:AGE, gender:GENDER}.
The result type will be a tuple that looks like so (Sum of Ages, Count)
people = []
people.append({'name':'Amit', 'age':45,'gender':'M'})
people.append({'name':'Ganga', 'age':43,'gender':'F'})
people.append({'name':'John', 'age':28,'gender':'M'})
people.append({'name':'Lolita', 'age':33,'gender':'F'})
people.append({'name':'Dont Know', 'age':18,'gender':'T'})
peopleRdd=sc.parallelize(people) //Create an RDD
peopleRdd.aggregate((0,0), seqOp, combOp) //Output of above line : 167, 5)
Now define two operation seqOp and combOp , such that
seqOp : Sum the age of all people as well count them, in each partition. combOp :
Combine results from all partitions.
Problem Scenario 81 : You have been given MySQL DB with following details. You have been given following product.csv file product.csv productID,productCode,name,quantity,price 1001,PEN,Pen Red,5000,1.23 1002,PEN,Pen Blue,8000,1.25 1003,PEN,Pen Black,2000,1.25 1004,PEC,Pencil 2B,10000,0.48 1005,PEC,Pencil 2H,8000,0.49 1006,PEC,Pencil HB,0,9999.99 Now accomplish following activities.
1.
Create a Hive ORC table using SparkSql
2.
Load this data in Hive table.
3.
Create a Hive parquet table using SparkSQL and load data in it.
Problem Scenario 34 : You have given a file named spark6/user.csv. Data is given below: user.csv id,topic,hits Rahul,scala,120 Nikita,spark,80 Mithun,spark,1 myself,cca175,180 Now write a Spark code in scala which will remove the header part and create RDD of values as below, for all rows. And also if id is myself" than filter out row. Map(id -> om, topic -> scala, hits -> 120)
Problem Scenario 43 : You have been given following code snippet.
val grouped = sc.parallelize(Seq(((1,"twoM), List((3,4), (5,6)))))
val flattened = grouped.flatMap {A =>
groupValues.map { value => B }
}
You need to generate following output.
Hence replace A and B
Array((1,two,3,4),(1,two,5,6))
Problem Scenario 3: You have been given MySQL DB with following details. user=retail_dba password=cloudera database=retail_db table=retail_db.categories jdbc URL = jdbc:mysql://quickstart:3306/retail_db Please accomplish following activities.
1.
Import data from categories table, where category=22 (Data should be stored in categories subset)
2.
Import data from categories table, where category>22 (Data should be stored in categories_subset_2)
3.
Import data from categories table, where category between 1 and 22 (Data should be stored in categories_subset_3)
4.
While importing catagories data change the delimiter to '|' (Data should be stored in categories_subset_S)
5.
Importing data from catagories table and restrict the import to category_name,category id columns only with delimiter as '|'
6.
Add null values in the table using below SQL statement ALTER TABLE categories modify category_department_id int(11); INSERT INTO categories values (eO.NULL.'TESTING');
7.
Importing data from catagories table (In categories_subset_17 directory) using '|' delimiter and categoryjd between 1 and 61 and encode null values for both string and non string columns.
8.
Import entire schema retail_db in a directory categories_subset_all_tables