A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model bycomparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.
Which of the following possible explanations for this difference is invalid?
A. The second model is much more accurate than the first model
B. The data scientist failed to exponentiate the predictions in the second model prior tocomputingthe RMSE
C. The datascientist failed to take the logof the predictions in the first model prior to computingthe RMSE
D. The first model is much more accurate than the second model
E. The RMSE is an invalid evaluation metric for regression problems
A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrametrain_dfto train the model.
The Spark DataFrametrain_dfhas the following schema:
The machine learning engineer shares the following code block:
Which of the following changes does the machine learning engineer need to make to complete the task?
A. They need to call the transform method on train df
B. They need to convert the features column to be a vector
C. They do not need to make any changes
D. They need to utilize a Pipeline to fit the model
E. They need to split thefeaturescolumn out into one column for each feature
A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.
Which of the following lines of code can the data scientist run to accomplish the task?
A. spark_df.summary ()
B. spark_df.stats()
C. spark_df.describe().head()
D. spark_df.printSchema()
E. spark_df.toPandas()
Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?
A. The vectorized pandas UDFs allow for the use of type hints
B. The vectorized pandas UDFs process data in batches rather than one row at a time
C. The vectorized pandas UDFs allow for pandas API use inside of the function
D. The vectorized pandas UDFs work on distributed DataFrames
E. The vectorized pandas UDFs process data in memory rather than spilling to disk
A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.
Which of the following classification metrics should be used to evaluate the model?
A. RMSE
B. Precision
C. Area under the residual operating curve
D. Accuracy
E. Recall
A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library'sfminoperation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with theobjective_functionbeing passed as an argument tofmin.
They use the following code block to create theobjective_function:
Which of the following changes does the data scientist need to make to theirobjective_functionin order to produce a more accurate model?
A. Add test set validation process
B. Add a random_state argument to the RandomForestRegressor operation
C. Remove the mean operation that is wrapping the cross_val_score operation
D. Replace the r2 return value with-r2
E. Replace the fmin operation with the fmax operation
A data scientist is developing a single-node machine learning model. They have a large number of model configurations to test as a part of their experiment. As a result, the model tuning process takes too long to complete. Which of the following approaches can be used to speed up the model tuning process?
A. Implement MLflow Experiment Tracking
B. Scale up with Spark ML
C. Enable autoscaling clusters
D. Parallelize with Hyperopt
Which of the following machine learning algorithms typically uses bagging?
A. IGradient boosted trees
B. K-means
C. Random forest
D. Decision tree
A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the pathmodel_urifor the DataFramebatch_df.
batch_dfhas the following schema:
customer_id STRING
The machine learning engineer runs the following code block to perform inference onbatch_dfusing the linear regression model atmodel_uri:
In which situation will the machine learning engineer's code block perform the desired inference?
A. When the Feature Store feature set was logged with the model at model_uri
B. When all of the features used by the model at model_uri are in a Spark DataFrame in the PySpark
C. When the model at model_uri only uses customer_id as a feature
D. This code block will not perform the desired inference in any situation.
E. When all of the features used by the model at model_uri are in a single Feature Store table
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0.
Which of the following code blocks will accomplish this task?
A. spark_df.loc[:,spark_df["discount"] <= 0]
B. spark_df[spark_df["discount"] <= 0]
C. spark_df.filter (col("discount") <= 0)
D. spark_df.loc(spark_df["discount"] <= 0, :]