Big 11.11 Sale Limited Time Flat 70% Discount offer - Ends in 0d 00h 00m 00s - Coupon code: 70spcl

Databricks Databricks-Certified-Data-Engineer-Associate Databricks Certified Data Engineer Associate Exam Exam Practice Test

Page: 1 / 15
Total 153 questions

Databricks Certified Data Engineer Associate Exam Questions and Answers

Question 1

A data engineer has joined an existing project and they see the following query in the project repository:

CREATE STREAMING LIVE TABLE loyal_customers AS

SELECT customer_id -

FROM STREAM(LIVE.customers)

WHERE loyalty_level = 'high';

Which of the following describes why the STREAM function is included in the query?

Options:

A.

The STREAM function is not needed and will cause an error.

B.

The table being created is a live table.

C.

The customers table is a streaming live table.

D.

The customers table is a reference to a Structured Streaming query on a PySpark DataFrame.

E.

The data in the customers table has been updated since its last run.

Question 2

Which of the following is a benefit of the Databricks Lakehouse Platform embracing open source technologies?

Options:

A.

Cloud-specific integrations

B.

Simplified governance

C.

Ability to scale storage

D.

Ability to scale workloads

E.

Avoiding vendor lock-in

Question 3

A data engineer wants to create a new table containing the names of customers that live in France.

They have written the following command:

Question # 3

A senior data engineer mentions that it is organization policy to include a table property indicating that the new table includes personally identifiable information (PII).

Which of the following lines of code fills in the above blank to successfully complete the task?

Options:

A.

There is no way to indicate whether a table contains PII.

B.

"COMMENT PII"

C.

TBLPROPERTIES PII

D.

COMMENT "Contains PII"

E.

PII

Question 4

Identify the impact of ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE for a constraint violation.

A data engineer has created an ETL pipeline using Delta Live table to manage their company travel reimbursement detail, they want to ensure that the if the location details has not been provided by the employee, the pipeline needs to be terminated.

How can the scenario be implemented?

Options:

A.

CONSTRAINT valid_location EXPECT (location = NULL)

B.

CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL UPDATE

C.

CONSTRAINT valid_location EXPECT (location != NULL) ON DROP ROW

D.

CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL

Question 5

A data engineer wants to reduce costs and optimize cloud spending. The data engineer has decided to use Databricks Serverless for lowering cloud costs while maintaining existing SLAs.

What is the first step in migrating to Databricks Serverless?

Options:

A.

Legacy Ingestion pipelines that include ingestion from sources API's, files, JDBC/ODBC connections

B.

Low frequency Bl Dashboarding and Adhoc SQL Analytics

C.

A frequently running and efficient Python-based data transformation pipeline compatible with the latest Databricks runtime and Unity Catalog

D.

A frequently running and efficient Scala-based data transformation pipeline compatible with the latest Databricks runtime and Unity Catalog

Question 6

A data engineer is designing an ETL pipeline to process both streaming and batch data from multiple sources The pipeline must ensure data quality, handle schema evolution, and provide easy maintenance. The team is considering using Delta Live Tables (DLT) in Databricks to achieve these goals. They want to understand the key features and benefits of DLT that make it suitable for this use case.

Why is Delta Live Tables (DLT) an appropriate choice?

Options:

A.

Automatic data quality checks, built-in support for schema evolution, and declarative pipeline development

B.

Manual schema enforcement, high operational overhead, and limited scalability

C.

Requires custom code for data quality checks, no support for streaming data, and complex pipeline maintenance

D.

Supports only batch processing, no data versioning, and high infrastructure costs

Question 7

A data engineer is working in a Python notebook on Databricks to process data, but notices that the output is not as expected. The data engineer wants to investigate the issue by stepping through the code and checking the values of certain variables during execution.

Which tool should the data engineer use to inspect the code execution and variables in real-time?

Options:

A.

Python Notebook Interactive Debugger

B.

Cluster Logs

C.

SQL Analytics

D.

Job Execution Dashboard

Question 8

Which of the following describes a scenario in which a data team will want to utilize cluster pools?

Options:

A.

An automated report needs to be refreshed as quickly as possible.

B.

An automated report needs to be made reproducible.

C.

An automated report needs to be tested to identify errors.

D.

An automated report needs to be version-controlled across multiple collaborators.

E.

An automated report needs to be runnable by all stakeholders.

Question 9

A global retail company sells products across multiple categories (e.g.. Electronics, Clothing) and regions (e.g.. North. South, East. West). The sales team has provided the data engineer with a PySpark dataframe named sales_df as below and the team wants the data engineer to analyze the sales data to help them make strategic decisions.

Question # 9

Options:

A.

Category_sales = sales df.groupBy("category").agg(sum("sales amount") .alias ("total sales amount"))

B.

Category_sales = sales_df.sum("3ales_amount"). g-1- upBy("categcryn).alias("toLal_sales_amount))

C.

Category_sale: .es df -agg (sum ("sales amount") .-;r*i:rRy ("category") .alias ("total sa.en amount"))

D.

Category_sales = sales_df.groupBy("reqion"). agq(sum("sales_amountn).alias(ntotal_sales_amount''))

Question 10

Which TWO items are characteristics of the Gold Layer?

Choose 2 answers

Options:

A.

Read-optimized

B.

Normalised

C.

Raw Data

D.

Historical lineage

E.

De-normalised

Question 11

Which of the following tools is used by Auto Loader process data incrementally?

Options:

A.

Checkpointing

B.

Spark Structured Streaming

C.

Data Explorer

D.

Unity Catalog

E.

Databricks SQL

Question 12

Which of the following commands will return the location of database customer360?

Options:

A.

DESCRIBE LOCATION customer360;

B.

DROP DATABASE customer360;

C.

DESCRIBE DATABASE customer360;

D.

ALTER DATABASE customer360 SET DBPROPERTIES ('location' = '/user'};

E.

USE DATABASE customer360;

Question 13

A data engineer is writing a script that is meant to ingest new data from cloud storage. In the event of the Schema change, the ingestion should fail. It should fail until the changes downstream source can be found and verified as intended changes.

Which command will meet the requirements?

Options:

A.

addNewColumns

B.

failOnNewColumns

C.

rescue

D.

none

Question 14

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:

Question # 14

If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?

Options:

A.

processingTime(1)

B.

trigger(availableNow=True)

C.

trigger(parallelBatch=True)

D.

trigger(processingTime="once")

E.

trigger(continuous="once")

Question 15

Question # 15

Calculate the total sales amount for each region and store the results in a new dataframe called region_sales.

Given the expected result:

Question # 15

Which code will generate the expected result?

Options:

A.

region_sales = sales_df.groupBy("region").agg(sum("sales_amountM).alias("total_sales_amount"))

B.

region_sales = sales_df. sum ("salen_aiTiount") . groupBy ("region") .alias ("total_sale3_amount")

C.

region_sales= sales_df.groupBy("category").sum(nsales_amount").alias("t_otal_sales_amounl")

D.

region sales - sales_df.agg(sum("sales_amount").groupBy("region").alias("total sales amount"))

Question 16

A data engineer needs to combine sales data from an on-premises PostgreSQL database with customer data in Azure Synapse for a comprehensive report. The goal is to avoid data duplication and ensure up-to-date information

How should the data engineer achieve this using Databricks?

Options:

A.

Develop custom ETL pipelines to ingest data into Databricks

B.

Use Lakehouse Federation to query both data sources directly

C.

Manually synchronize data from both sources into a single database

D.

Export data from both sources to CSV files and upload them to Databricks

Question 17

Which SQL keyword can be used to convert a table from a long format to a wide format?

Options:

A.

TRANSFORM

B.

PIVOT

C.

SUM

D.

CONVERT

Question 18

A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF).

Which of the following code blocks creates this SQL UDF?

Options:

A.

Option A18

B.

Option B18

C.

Option C18

D.

Option D18

E.

Option E18

Question 19

A data engineer has been given a new record of data:

id STRING = 'a1'

rank INTEGER = 6

rating FLOAT = 9.4

Which of the following SQL commands can be used to append the new record to an existing Delta table my_table?

Options:

A.

INSERT INTO my_table VALUES ('a1', 6, 9.4)

B.

my_table UNION VALUES ('a1', 6, 9.4)

C.

INSERT VALUES ( 'a1' , 6, 9.4) INTO my_table

D.

UPDATE my_table VALUES ('a1', 6, 9.4)

E.

UPDATE VALUES ('a1', 6, 9.4) my_table

Question 20

Which tool is used by Auto Loader to process data incrementally?

Options:

A.

Spark Structured Streaming

B.

Unity Catalog

C.

Checkpointing

D.

Databricks SQL

Question 21

A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables.

Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?

Options:

A.

CREATE TABLE all_transactions AS

SELECT * FROM march_transactions

INNER JOIN SELECT * FROM april_transactions;

B.

CREATE TABLE all_transactions AS

SELECT * FROM march_transactions

UNION SELECT * FROM april_transactions;

C.

CREATE TABLE all_transactions AS

SELECT * FROM march_transactions

OUTER JOIN SELECT * FROM april_transactions;

D.

CREATE TABLE all_transactions AS

SELECT * FROM march_transactions

INTERSECT SELECT * from april_transactions;

E.

CREATE TABLE all_transactions AS

SELECT * FROM march_transactions

MERGE SELECT * FROM april_transactions;

Question 22

A data engineer wants to schedule their Databricks SQL dashboard to refresh every hour, but they only want the associated SQL endpoint to be running when it is necessary. The dashboard has multiple queries on multiple datasets associated with it. The data that feeds the dashboard is automatically processed using a Databricks Job.

Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?

Options:

A.

They can turn on the Auto Stop feature for the SQL endpoint.

B.

They can ensure the dashboard's SQL endpoint is not one of the included query's SQL endpoint.

C.

They can reduce the cluster size of the SQL endpoint.

D.

They can ensure the dashboard's SQL endpoint matches each of the queries' SQL endpoints.

E.

They can set up the dashboard's SQL endpoint to be serverless.

Question 23

Which of the following can be used to simplify and unify siloed data architectures that are specialized for specific use cases?

Options:

A.

None of these

B.

Data lake

C.

Data warehouse

D.

All of these

E.

Data lakehouse

Question 24

A data engineer is working with two tables. Each of these tables is displayed below in its entirety.

Question # 24

The data engineer runs the following query to join these tables together:

Question # 24

Which of the following will be returned by the above query?

Question # 24

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Question 25

An organization is looking for an optimized storage layer that supports ACID transactions and schema enforcement. Which technology should the organization use?

Options:

A.

Cloud File Storage

B.

Unity Catalog

C.

Data lake

D.

Delta Lake

Question 26

A data engineer wants to create a new table containing the names of customers who live in France.

They have written the following command:

CREATE TABLE customersInFrance

_____ AS

SELECT id,

firstName,

lastName

FROM customerLocations

WHERE country = ’FRANCE’;

A senior data engineer mentions that it is organization policy to include a table property indicating that the new table includes personally identifiable information (Pll).

Which line of code fills in the above blank to successfully complete the task?

Options:

A.

COMMENT "Contains PIT

B.

511

C.

"COMMENT PII"

D.

TBLPROPERTIES PII

Question 27

A data engineer has created a new database using the following command:

CREATE DATABASE IF NOT EXISTS customer360;

In which of the following locations will the customer360 database be located?

Options:

A.

dbfs:/user/hive/database/customer360

B.

dbfs:/user/hive/warehouse

C.

dbfs:/user/hive/customer360

D.

More information is needed to determine the correct response

Question 28

A Delta Live Table pipeline includes two datasets defined using streaming live table. Three datasets are defined against Delta Lake table sources using live table.

The table is configured to run in Production mode using the Continuous Pipeline Mode.

What is the expected outcome after clicking Start to update the pipeline assuming previously unprocessed data exists and all definitions are valid?

Options:

A.

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

B.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

C.

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

D.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

Question 29

A data engineer needs to optimize the data layout and query performance for an e-commerce transactions Delta table. The table is partitioned by "purchase_date" a date column which helps with time-based queries but does not optimize searches on user statistics "customer_id", a high-cardinality column.

The table is usually queried with filters on "customer_i

d" within specific date ranges, but since this data is spread across multiple files in each partition, it results in full partition scans and increased runtime and costs.

How should the data engineer optimize the Data Layout for efficient reads?

Options:

A.

Alter table implementing liquid clustering on "customerid" while keeping the existing partitioning.

B.

Alter the table to partition by "customer_id".

C.

Enable delta caching on the cluster so that frequent reads are cached for performance.

D.

Alter the table implementing liquid clustering by "customer_id" and "purchase_date".

Question 30

Which of the following commands will return the number of null values in the member_id column?

Options:

A.

SELECT count(member_id) FROM my_table;

B.

SELECT count(member_id) - count_null(member_id) FROM my_table;

C.

SELECT count_if(member_id IS NULL) FROM my_table;

D.

SELECT null(member_id) FROM my_table;

E.

SELECT count_null(member_id) FROM my_table;

Question 31

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Development mode using the Continuous Pipeline Mode.

Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

Options:

A.

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

B.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.

C.

All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.

D.

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

E.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

Question 32

A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location.

Which of the following data entities should the data engineer create?

Options:

A.

Database

B.

Function

C.

View

D.

Temporary view

E.

Table

Question 33

A data engineer needs access to a table new_table, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is.

Which of the following approaches can be used to identify the owner of new_table?

Options:

A.

Review the Permissions tab in the table's page in Data Explorer

B.

All of these options can be used to identify the owner of the table

C.

Review the Owner field in the table's page in Data Explorer

D.

Review the Owner field in the table's page in the cloud storage solution

E.

There is no way to identify the owner of the table

Question 34

A data engineering team has noticed that their Databricks SQL queries are running too slowly when they are submitted to a non-running SQL endpoint. The data engineering team wants this issue to be resolved.

Which of the following approaches can the team use to reduce the time it takes to return results in this scenario?

Options:

A.

They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to "Reliability Optimized."

B.

They can turn on the Auto Stop feature for the SQL endpoint.

C.

They can increase the cluster size of the SQL endpoint.

D.

They can turn on the Serverless feature for the SQL endpoint.

E.

They can increase the maximum bound of the SQL endpoint's scaling range

Question 35

A data engineer has a Job that has a complex run schedule, and they want to transfer that schedule to other Jobs.

Rather than manually selecting each value in the scheduling form in Databricks, which of the following tools can the data engineer use to represent and submit the schedule programmatically?

Options:

A.

pyspark.sql.types.DateType

B.

datetime

C.

pyspark.sql.types.TimestampType

D.

Cron syntax

E.

There is no way to represent and submit this information programmatically

Question 36

A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to an ELT job. The ELT job has its Databricks SQL query that returns the number of input records containing unexpected NULL values. The data engineer wants their entire team to be notified via a messaging webhook whenever this value reaches 100.

Which of the following approaches can the data engineer use to notify their entire team via a messaging webhook whenever the number of NULL values reaches 100?

Options:

A.

They can set up an Alert with a custom template.

B.

They can set up an Alert with a new email alert destination.

C.

They can set up an Alert with a new webhook alert destination.

D.

They can set up an Alert with one-time notifications.

E.

They can set up an Alert without notifications.

Question 37

Which of the following code blocks will remove the rows where the value in column age is greater than 25 from the existing Delta table my_table and save the updated table?

Options:

A.

SELECT * FROM my_table WHERE age > 25;

B.

UPDATE my_table WHERE age > 25;

C.

DELETE FROM my_table WHERE age > 25;

D.

UPDATE my_table WHERE age <= 25;

E.

DELETE FROM my_table WHERE age <= 25;

Question 38

Which method should a Data Engineer apply to ensure Workflows are being triggered on schedule?

Options:

A.

Scheduled Workflows require an always-running cluster, which is more expensive but reduces processing latency.

B.

Scheduled Workflows process data as it arrives at configured sources.

C.

Scheduled Workflows can reduce resource consumption and expense since the cluster runs only long enough to execute the pipeline.

D.

Scheduled Workflows run continuously until manually stopped.

Question 39

A data engineer needs to ingest from both streaming and batch sources for a firm that relies on highly accurate data. Occasionally, some of the data picked up by the sensors that provide a streaming input are outside the expected parameters. If this occurs, the data must be dropped, but the stream should not fail.

Which feature of Delta Live Tables meets this requirement?

Options:

A.

Monitoring

B.

Change Data Capture

C.

Expectations

D.

Error Handling

Question 40

A data engineer is working on a Databricks project that utilizes cloud storage. The data engineer wants to load several json files from containers on a storage account as soon as the file arrives within the storage account.

Which syntax should the data engineer follow to first load the files into a dataframe and check that it is working as expected using Python?

Options:

A.

df = spark.readStream.format("json").load("input/path")

B.

df = spark.readStream.format("cloud"),option("json").load("/input/path")

C.

df = spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .load("/input/path")

D.

df = spark.read.json("inp i./path")

Question 41

A new data engineering team team has been assigned to an ELT project. The new data engineering team will need full privileges on the table sales to fully manage the project.

Which command can be used to grant full permissions on the database to the new data engineering team?

Options:

A.

grant all privileges on table sales TO team;

B.

GRANT SELECT ON TABLE sales TO team;

C.

GRANT SELECT CREATE MODIFY ON TABLE sales TO team;

D.

GRANT ALL PRIVILEGES ON TABLE team TO sales;

Question 42

Which of the following describes the type of workloads that are always compatible with Auto Loader?

Options:

A.

Dashboard workloads

B.

Streaming workloads

C.

Machine learning workloads

D.

Serverless workloads

E.

Batch workloads

Question 43

A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location "/transactions/raw".

Today, the data engineer runs the following command to complete this task:

Question # 43

After running the command today, the data engineer notices that the number of records in table transactions has not changed.

Which of the following describes why the statement might not have copied any new records into the table?

Options:

A.

The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.

B.

The names of the files to be copied were not included with the FILES keyword.

C.

The previous day’s file has already been copied into the table.

D.

The PARQUET file format does not support COPY INTO.

E.

The COPY INTO statement requires the table to be refreshed to view the copied rows.

Question 44

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:

Question # 44

Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?

Options:

A.

Replace predict with a stream-friendly prediction function

B.

Replace schema(schema) with option ("maxFilesPerTrigger", 1)

C.

Replace "transactions" with the path to the location of the Delta table

D.

Replace format("delta") with format("stream")

E.

Replace spark.read with spark.readStream

Question 45

A data engineer has written a function in a Databricks Notebook to calculate the population of bacteria in a given medium.

Question # 45

Analysts use this function in the notebook and sometimes provide input arguments of the wrong data type, which can cause errors during execution.

Which Databricks feature will help the data engineer quickly identify if an incorrect data type has been provided as input?

Options:

A.

The Data Engineer should add print statements to find out what the variable is.

B.

The Databricks debugger enables breakpoints that will raise an error if the wrong data type is submitted

C.

The Spark User interface has a debug tab that contains the variables that are used in this session.

D.

The Databricks debugger enables the use of a variable explorer to see at a glance the value of the variables.

Page: 1 / 15
Total 153 questions