Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?
A. 1, 10
B. 1, 8
C. 10
D. 7, 9, 10
E. 1, 4, 6, 9
Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format month/day/year in column transactionDateFormatted?
Excerpt of DataFrame transactionsDf:
A. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/ MM/yyyy"))
B. transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime ("transactionDateFormatted", format="MM/dd/yyyy"))
C. transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFor matted")
D. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/ dd/yyyy"))
E. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))
Which of the following code blocks applies the boolean-returning Python function evaluateTestSuccess to column storeId of DataFrame transactionsDf as a user-defined function?
A. 1.from pyspark.sql import types as T 2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.BooleanType()) 3.transactionsDf.withColumn("result", evaluateTestSuccessUDF(col("storeId")))
B. 1.evaluateTestSuccessUDF = udf(evaluateTestSuccess) 2.transactionsDf.withColumn("result", evaluateTestSuccessUDF(storeId))
C. 1.from pyspark.sql import types as T 2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.IntegerType()) 3.transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))
D. 1.evaluateTestSuccessUDF = udf(evaluateTestSuccess) 2.transactionsDf.withColumn("result", evaluateTestSuccessUDF(col("storeId")))
E. 1.from pyspark.sql import types as T 2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.BooleanType()) 3.transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))
The code block shown below should read all files with the file ending .png in directory path into Spark. Choose the answer that correctly fills the blanks in the code block to accomplish this.
spark.__1__.__2__(__3__).option(__4__, "*.png").__5__(path)
A. 1. read()
2.
format
3.
"binaryFile"
4.
"recursiveFileLookup"
5.
load
B. 1. read
2.
format
3.
"binaryFile"
4.
"pathGlobFilter"
5.
load
C. 1. read
2.
format
3.
binaryFile
4.
pathGlobFilter
5.
load
D. 1. open
2.
format
3.
"image"
4.
"fileType"
5.
open
E. 1. open
2.
as
3.
"binaryFile"
4.
"pathGlobFilter"
5.
load
Which of the following describes properties of a shuffle?
A. Operations involving shuffles are never evaluated lazily.
B. Shuffles involve only single partitions.
C. Shuffles belong to a class known as "full transformations".
D. A shuffle is one of many actions in Spark.
E. In a shuffle, Spark writes data to disk.
The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient
executor memory is available, in a fault-tolerant way. Find the error.
Code block:
transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)
A. Caching is not supported in Spark, data are always recomputed.
B. Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.
C. The storage level is inappropriate for fault-tolerant storage.
D. The code block uses the wrong operator for caching.
E. The DataFrameWriter needs to be invoked.
The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to numeric and nullable column predError in DataFrame transactionsDf.
Find the error.
Code block:
1.def add_2_if_geq_3(x):
2.
if x is None:
3.
return x
4.
elif x >= 3:
5.
return x+2
6.
return x
7.
8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)
9.
10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))
A. The operator used to adding the column does not add column predErrorAdded to the DataFrame.
B. Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
C. The udf() method does not declare a return type.
D. UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
E. The Python function is unable to handle null values, resulting in the code block crashing on execution.
The code block displayed below contains an error. The code block should return all rows of DataFrame transactionsDf, but including only columns storeId and predError. Find the error.
Code block:
spark.collect(transactionsDf.select("storeId", "predError"))
A. Instead of select, DataFrame transactionsDf needs to be filtered using the filter operator.
B. Columns storeId and predError need to be represented as a Python list, so they need to be wrapped in brackets ([]).
C. The take method should be used instead of the collect method.
D. Instead of collect, collectAsRows needs to be called.
E. The collect method is not a method of the SparkSession object.
Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000row DataFrame itemsDf, without any duplicates, returning the same rows even if the code block is run twice?
A. itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)
B. itemsDf.sample(fraction=0.1, seed=87238)
C. itemsDf.sample(fraction=1000, seed=98263)
D. itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)
E. itemsDf.sample(fraction=0.1)
In which order should the code blocks shown below be run in order to return the number of records that are not empty in column value in the DataFrame resulting from an inner join of DataFrame transactionsDf and itemsDf on columns productId and itemId, respectively?
1.
.filter(~isnull(col('value')))
2.
.count()
3.
transactionsDf.join(itemsDf, col("transactionsDf.productId")==col("itemsDf.itemId"))
4.
transactionsDf.join(itemsDf, transactionsDf.productId==itemsDf.itemId, how='inner')
5.
.filter(col('value').isnotnull())
6.
.sum(col('value'))
A. 4, 1, 2
B. 3, 1, 6
C. 3, 1, 2
D. 3, 5, 2
E. 4, 6