New Year Special Limited Time Flat 70% Discount offer - Ends in 0d 00h 00m 00s - Coupon code: 70spcl

Databricks Databricks-Machine-Learning-Associate Databricks Certified Machine Learning Associate Exam Exam Practice Test

Page: 1 / 7
Total 74 questions

Databricks Certified Machine Learning Associate Exam Questions and Answers

Question 1

A data scientist is utilizing MLflow Autologging to automatically track their machine learning experiments. After completing a series of runs for the experiment experiment_id, the data scientist wants to identify the run_id of the run with the best root-mean-square error (RMSE).

Which of the following lines of code can be used to identify the run_id of the run with the best RMSE in experiment_id?

A)

Question # 1

B)

Question # 1

C)

Question # 1

D)

Question # 1

Options:

A.

OptionA

B.

Option B

C.

Option C

D.

Option D

Question 2

A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.

Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

Options:

A.

A holdout set is not necessary when using a train-validation split

B.

Reproducibility is achievable when using a train-validation split

C.

Fewer hyperparameter values need to be tested when usinga train-validation split

D.

Bias is avoidable when using a train-validation split

E.

Fewer models need to be trained when using a train-validation split

Question 3

A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.

Which of the following approaches will guarantee a reproducible training and test set for each model?

Options:

A.

Manually configure the cluster

B.

Write out the split data sets to persistent storage

C.

Set a speed in the data splitting operation

D.

Manually partition the input data

Question 4

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library'sfminoperation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with theobjective_functionbeing passed as an argument tofmin.

They use the following code block to create theobjective_function:

Question # 4

Which of the following changes does the data scientist need to make to theirobjective_functionin order to produce a more accurate model?

Options:

A.

Add test set validation process

B.

Add a random_state argument to the RandomForestRegressor operation

C.

Remove the mean operation that is wrapping the cross_val_score operation

D.

Replace the r2 return value with -r2

E.

Replace the fmin operation with the fmax operation

Question 5

Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?

Options:

A.

TrainValidationSplit

B.

DataFrame.where

C.

CrossValidator

D.

TrainValidationSplitModel

E.

DataFrame.randomSplit

Question 6

A team is developing guidelines on when to use various evaluation metrics for classification problems. The team needs to provide input on when to use the F1 score over accuracy.

Question # 6

Which of the following suggestions should the team include in their guidelines?

Options:

A.

The F1 score should be utilized over accuracy when the number of actual positive cases is identical to the number of actual negative cases.

B.

The F1 score should be utilized over accuracy when there are greater than two classes in the target variable.

C.

The F1 score should be utilized over accuracy when there is significant imbalance between positive and negative classes and avoiding false negatives is a priority.

D.

The F1 score should be utilized over accuracy when identifying true positives and true negatives are equally important to the business problem.

Question 7

A data scientist is working with a feature set with the following schema:

Question # 7

Thecustomer_idcolumn is the primary key in the feature set. Each of the columns in the feature set has missing values. They want to replace the missing values by imputing a common value for each feature.

Which of the following lists all of the columns in the feature set that need to be imputed using the most common value of the column?

Options:

A.

customer_id, loyalty_tier

B.

loyalty_tier

C.

units

D.

spend

E.

customer_id

Question 8

A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective functionobjective_functionand they have defined the search spacesearch_space.

As a result, they have the following code block:

Question # 8

Which of the following changes do they need to make to the above code block in order to accomplish the task?

Options:

A.

Change SparkTrials() to Trials()

B.

Reduce num_evals to be less than 10

C.

Change fmin() to fmax()

D.

Remove the trials=trials argument

E.

Remove the algo=tpe.suggest argument

Question 9

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrametrain_dfto train the model.

The Spark DataFrametrain_dfhas the following schema:

Question # 9

The machine learning engineer shares the following code block:

Question # 9

Which of the following changes does the machine learning engineer need to make to complete the task?

Options:

A.

They need to call the transform method on train df

B.

They need to convert the features column to be a vector

C.

They do not need to make any changes

D.

They need to utilize a Pipeline to fit the model

E.

They need to split thefeaturescolumn out into one column for each feature

Question 10

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.

Logistic regression

B.

Spark ML cannot distribute linear regression training

C.

Iterative optimization

D.

Least-squares method

E.

Singular value decomposition

Question 11

A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.

Which of the following lines of code can the data scientist run to accomplish the task?

Options:

A.

spark_df.summary ()

B.

spark_df.stats()

C.

spark_df.describe().head()

D.

spark_df.printSchema()

E.

spark_df.toPandas()

Question 12

A machine learning engineer is trying to scale a machine learning pipeline by distributing its single-node model tuning process. After broadcasting the entire training data onto each core, each core in the cluster can train one model at a time. Because the tuning process is still running slowly, the engineer wants to increase the level of parallelism from 4 cores to 8 cores to speed up the tuning process. Unfortunately, the total memory in the cluster cannot be increased.

In which of the following scenarios will increasing the level of parallelism from 4 to 8 speed up the tuning process?

Options:

A.

When the tuning process in randomized

B.

When the entire data can fit on each core

C.

When the model is unable to be parallelized

D.

When the data is particularly long in shape

E.

When the data is particularly wide in shape

Question 13

A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

Question # 13

Which of the following is a negative consequence of includingpipelineas the estimator in the cross-validation process rather thanrfras the estimator?

Options:

A.

The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode

B.

The process will leak data from the training set to the test set during the evaluation phase

C.

The process will be unable to parallelize tuning due to the distributed nature of pipeline

D.

The process will leak data prep information from the validation sets to the training sets for each model

Question 14

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

Options:

A.

One-hot encoding is not supported by most machine learning libraries.

B.

One-hot encoding is dependent on the target variable's values which differ for each application.

C.

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

D.

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

E.

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

Question 15

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.

Which of the following code blocks will accomplish this task?

Options:

A.

spark_df[spark_df["price"] > 0]

B.

spark_df.filter(col("price") > 0)

C.

SELECT * FROM spark_df WHERE price > 0

D.

spark_df.loc[spark_df["price"] > 0,:]

E.

spark_df.loc[:,spark_df["price"] > 0]

Question 16

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.

Logistic regression

B.

Singular value decomposition

C.

Iterative optimization

D.

Least-squares method

Question 17

Which of the following statements describes a Spark ML estimator?

Options:

A.

An estimator is a hyperparameter arid that can be used to train a model

B.

An estimator chains multiple alqorithms toqether to specify an ML workflow

C.

An estimator is a trained ML model which turns a DataFrame with features into a DataFrame with predictions

D.

An estimator is an alqorithm which can be fit on a DataFrame to produce a Transformer

E.

An estimator is an evaluation tool to assess to the quality of a model

Question 18

A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFramefeatures_df. A list of the names of the string columns is assigned to theinput_columnsvariable.

They have developed this code block to accomplish this task:

Question # 18

The code block is returning an error.

Which of the following adjustments does the data scientist need to make to accomplish this task?

Options:

A.

They need to specify the method parameter to the OneHotEncoder.

B.

They need to remove the line with the fit operation.

C.

They need to use Stringlndexer prior to one-hot encodinq the features.

D.

They need to useVectorAssemblerprior to one-hot encoding the features.

Question 19

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

Options:

A.

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

B.

One-hot encoding is dependent on the target variable’s values which differ for each apaplication.

C.

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

D.

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

Question 20

Which of the following evaluation metrics is not suitable to evaluate runs in AutoML experiments for regression problems?

Options:

A.

F1

B.

R-squared

C.

MAE

D.

MSE

Question 21

A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.

Which of the following terms is used to describe this combination of models?

Options:

A.

Bootstrap aggregation

B.

Support vector machines

C.

Bucketing

D.

Ensemble learning

E.

Stacking

Question 22

A machine learning engineer has identified the best run from an MLflow Experiment. They have stored the run ID in the run_id variable and identified the logged model name as "model". They now want to register that model in the MLflow Model Registry with the name "best_model".

Which lines of code can they use to register the model associated with run_id to the MLflow Model Registry?

Options:

A.

mlflow.register_model(run_id, "best_model")

B.

mlflow.register_model(f"runs:/{run_id}/model”, "best_model”)

C.

millow.register_model(f"runs:/{run_id)/model")

D.

mlflow.register_model(f"runs:/{run_id}/best_model", "model")

Page: 1 / 7
Total 74 questions