Spring Sale Limited Time Flat 70% Discount offer - Ends in 0d 00h 00m 00s - Coupon code: 70spcl

Databricks Databricks-Certified-Data-Engineer-Associate Databricks Certified Data Engineer Associate Exam Exam Practice Test

Databricks Certified Data Engineer Associate Exam Questions and Answers

Question 1

A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level.

Which of the following tools can the data engineer use to solve this problem?

Options:

A.

Unity Catalog

B.

Data Explorer

C.

Delta Lake

D.

Delta Live Tables

E.

Auto Loader

Question 2

Which of the following commands can be used to write data into a Delta table while avoiding the writing of duplicate records?

Options:

A.

DROP

B.

IGNORE

C.

MERGE

D.

APPEND

E.

INSERT

Question 3

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.

Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?

Options:

A.

None of these changes will need to be made

B.

The pipeline will need to stop using the medallion-based multi-hop architecture

C.

The pipeline will need to be written entirely in SQL

D.

The pipeline will need to use a batch source in place of a streaming source

E.

The pipeline will need to be written entirely in Python

Question 4

What is the functionality of AutoLoader in Databricks?

Options:

A.

Auto Loader automatically ingests and processes new files from cloud storage, handling batch data with support for schema evolution.

B.

Auto Loader automatically ingests and processes new files from cloud storage, handling only streaming data with no support for schema evolution.

C.

Auto Loader automatically ingests and processes new files from cloud storage, handling batch and streaming data with no support for schema evolution.

D.

Auto Loader automatically ingests and processes new files from cloud storage, handling both batch and streaming data with support for schema evolution.

Question 5

A data engineer is setting up access control in Unity Catalog and needs to ensure that a group of data analysts can query tables but not modify data.

Which permission should the data engineer grant to the data analysts?

Options:

A.

SELECT

B.

INSERT

C.

MODIFY

D.

ALL PRIVILEGES

Question 6

Which SQL code snippet will correctly demonstrate a Data Definition Language (DDL) operation used to create a table?

Options:

A.

DROP TABLE employees;

B.

INSERT INTO employees (id, name) VALUES (1, 'Alice');

C.

CRFATF tabif employees ( id INT, name suing

D.

ALTFR TABIF employees add column salary DECTMA(10,2);

Question 7

A new data engineering team has been assigned to work on a project. The team will need access to database customers in order to see what tables already exist. The team has its own group team.

Which of the following commands can be used to grant the necessary permission on the entire database to the new team?

Options:

A.

GRANT VIEW ON CATALOG customers TO team;

B.

GRANT CREATE ON DATABASE customers TO team;

C.

GRANT USAGE ON CATALOG team TO customers;

D.

GRANT CREATE ON DATABASE team TO customers;

E.

GRANT USAGE ON DATABASE customers TO team;

Question 8

Identify a scenario to use an external table.

A Data Engineer needs to create a parquet bronze table and wants to ensure that it gets stored in a specific path in an external location.

Which table can be created in this scenario?

Options:

A.

An external table where the location is pointing to specific path in external location.

B.

An external table where the schema has managed location pointing to specific path in external location.

C.

A managed table where the catalog has managed location pointing to specific path in external location.

D.

A managed table where the location is pointing to specific path in external location.

Question 9

A Databricks workflow fails at the last stage due to an error in a notebook. This workflow runs daily. The data engineer fixes the mistake and wants to rerun the pipeline. This workflow is very costly and time-intensive to run.

Which action should the data engineer do in order to minimise downtime and cost?

Options:

A.

Switch to another cluster

B.

Repair run

C.

Re-run the entire workflow

D.

Restart the cluster

Question 10

Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?

Options:

A.

Parquet files can be partitioned

B.

CREATE TABLE AS SELECT statements cannot be used on files

C.

Parquet files have a well-defined schema

D.

Parquet files have the ability to be optimized

E.

Parquet files will become Delta tables

Question 11

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.

In which location can the data engineer review their permissions on the table?

Options:

A.

Jobs

B.

Dashboards

C.

Catalog Explorer

D.

Repos

Question 12

Which method should a Data Engineer apply to ensure Workflows are being triggered on schedule?

Options:

A.

Scheduled Workflows require an always-running cluster, which is more expensive but reduces processing latency.

B.

Scheduled Workflows process data as it arrives at configured sources.

C.

Scheduled Workflows can reduce resource consumption and expense since the cluster runs only long enough to execute the pipeline.

D.

Scheduled Workflows run continuously until manually stopped.

Question 13

A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests to ensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL.

Which of the following commands could the data engineering team use to access sales in PySpark?

Options:

A.

SELECT * FROM sales

B.

There is no way to share data between PySpark and SQL.

C.

spark.sql("sales")

D.

spark.delta.table("sales")

E.

spark.table("sales")

Question 14

Which of the following describes a scenario in which a data team will want to utilize cluster pools?

Options:

A.

An automated report needs to be refreshed as quickly as possible.

B.

An automated report needs to be made reproducible.

C.

An automated report needs to be tested to identify errors.

D.

An automated report needs to be version-controlled across multiple collaborators.

E.

An automated report needs to be runnable by all stakeholders.

Question 15

A dataset has been defined using Delta Live Tables and includes an expectations clause:

CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE

What is the expected behavior when a batch of data containing data that violates these constraints is processed?

Options:

A.

Records that violate the expectation cause the job to fail.

B.

Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.

C.

Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

D.

Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.

Question 16

A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.

Which of the following approaches can the data engineer use to set up the new task?

Options:

A.

They can clone the existing task in the existing Job and update it to run the new notebook.

B.

They can create a new task in the existing Job and then add it as a dependency of the original task.

C.

They can create a new task in the existing Job and then add the original task as a dependency of the new task.

D.

They can create a new job from scratch and add both tasks to run concurrently.

E.

They can clone the existing task to a new Job and then edit it to run the new notebook.

Question 17

A data engineering team has noticed that their Databricks SQL queries are running too slowly when they are submitted to a non-running SQL endpoint. The data engineering team wants this issue to be resolved.

Which of the following approaches can the team use to reduce the time it takes to return results in this scenario?

Options:

A.

They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to "Reliability Optimized."

B.

They can turn on the Auto Stop feature for the SQL endpoint.

C.

They can increase the cluster size of the SQL endpoint.

D.

They can turn on the Serverless feature for the SQL endpoint.

E.

They can increase the maximum bound of the SQL endpoint's scaling range

Question 18

A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table metadata and data.

They run the following command:

DROP TABLE IF EXISTS my_table

While the object no longer appears when they run SHOW TABLES, the data files still exist.

Which of the following describes why the data files still exist and the metadata files were deleted?

Options:

A.

The table’s data was larger than 10 GB

B.

The table’s data was smaller than 10 GB

C.

The table was external

D.

The table did not have a location

E.

The table was managed

Question 19

A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start.

Which of the following actions can the data engineer perform to improve the start up time for the clusters used for the Job?

Options:

A.

They can use endpoints available in Databricks SQL

B.

They can use jobs clusters instead of all-purpose clusters

C.

They can configure the clusters to be single-node

D.

They can use clusters that are from a cluster pool

E.

They can configure the clusters to autoscale for larger data sizes

Question 20

A data engineer is developing an ETL process based on Spark SQL. The execution fails. The data engineer checks the Spark Ul and can see the ERRORS as follows:

Question # 20

Which two corrective actions should the data engineer perform to resolve this issue?

Choose 2 answers - (Q) Narrow the filters in order to collect less data in the query

Options:

A.

Upsize the worker nodes and activate autoshuffle partitions

B.

Upsize the driver node and deactivate autoshuffle partitions

C.

Cache the dataset in order to boost the query performance

D.

Fix the shuffle partitions to 50 to ensure the allocation

Question 21

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

Which of the following describes why all of these files were deleted?

Options:

A.

The table was managed

B.

The table's data was smaller than 10 GB

C.

The table's data was larger than 10 GB

D.

The table was external

E.

The table did not have a location

Question 22

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.

In which of the following locations can the data engineer review their permissions on the table?

Options:

A.

Databricks Filesystem

B.

Jobs

C.

Dashboards

D.

Repos

E.

Data Explorer

Question 23

A data engineer needs to process SQL queries on a large dataset with fluctuating workloads. The workload requires automatic scaling based on the volume of queries, without the need to manage or provision infrastructure. The solution should be cost-efficient and charge only for the compute resources used during query execution.

Which compute option should the data engineer use?

Options:

A.

Databricks SQL Analytics

B.

Databricks Jobs

C.

Databricks Runtime for ML

D.

Serverless SQL Warehouse

Question 24

A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location.

Which of the following data entities should the data engineer create?

Options:

A.

Database

B.

Function

C.

View

D.

Temporary view

E.

Table

Question 25

Which TWO items are characteristics of the Gold Layer?

Choose 2 answers

Options:

A.

Read-optimized

B.

Normalised

C.

Raw Data

D.

Historical lineage

E.

De-normalised

Question 26

A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.

Which of the following approaches can the data engineer take to identify the table that is dropping the records?

Options:

A.

They can set up separate expectations for each table when developing their DLT pipeline.

B.

They cannot determine which table is dropping the records.

C.

They can set up DLT to notify them via email when records are dropped.

D.

They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.

E.

They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.

Question 27

Which of the following Structured Streaming queries is performing a hop from a Silver table to a Gold table?

Options:

A.

B.

C.

D.

E.

Question 28

A data engineer is getting a partner organization up to speed with Databricks account. Both teams share some business use cases. The data engineer has to share some of your Unity-Catalog managed delta tables and the notebook jobs creating those tables with the partner organization.

How can the data engineer seamlessly share the required information?

Options:

A.

Zip all the code and share via email and allow data ingestion from your data lake

B.

Data and Notebooks can be shared simply using Unity Catalog.

C.

Share access to codebase via Github and allow them to ingest datasets from your Datalake.

D.

Share required datasets and notebooks via Delta Sharing. Manage permissions via Unity Catalog.

Question 29

A new data engineering team team has been assigned to an ELT project. The new data engineering team will need full privileges on the table sales to fully manage the project.

Which of the following commands can be used to grant full permissions on the database to the new data engineering team?

Options:

A.

GRANT ALL PRIVILEGES ON TABLE sales TO team;

B.

GRANT SELECT CREATE MODIFY ON TABLE sales TO team;

C.

GRANT SELECT ON TABLE sales TO team;

D.

GRANT USAGE ON TABLE sales TO team;

E.

GRANT ALL PRIVILEGES ON TABLE team TO sales;

Question 30

A data engineer needs to create a table in Databricks using data from their organization's existing SQLite database. They run the following command:

CREATE TABLE jdbc_customer360

USING

OPTIONS (

url "jdbc:sqlite:/customers.db", dbtable "customer360"

)

Which line of code fills in the above blank to successfully complete the task?

Options:

A.

autoloader

B.

org.apache.spark.sql.jdbc

C.

sqlite

D.

org.apache.spark.sql.sqlite

Question 31

A Delta Live Table pipeline includes two datasets defined using streaming live table. Three datasets are defined against Delta Lake table sources using live table.

The table is configured to run in Production mode using the Continuous Pipeline Mode.

What is the expected outcome after clicking Start to update the pipeline assuming previously unprocessed data exists and all definitions are valid?

Options:

A.

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

B.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

C.

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

D.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.

Question 32

Which file format is used for storing Delta Lake Table?

Options:

A.

Parquet

B.

Delta

C.

SV

D.

JSON

Question 33

A departing platform owner currently holds ownership of multiple catalogs and controls storage credentials and external locations. The data engineer wants to ensure continuity: transfer catalog ownership to the platform team group, delegate ongoing privilege management, and retain the ability to receive and share data via Delta Sharing.

Which role must be in place to perform these actions across the metastore?

Options:

A.

Account Admin, because account admins can only create metastores but cannot change ownership of catalogs.

B.

Workspace Admin, because workspace admins can transfer ownership of any Unity Catalog object.

C.

Metastore Admin, because metastore admins can transfer ownership and manage privileges across all metastore objects, including shares and recipients.

D.

Catalog Owner, because catalog owners can transfer any object in any catalog in the metastore.

Question 34

A data engineer is working in a Python notebook on Databricks to process data, but notices that the output is not as expected. The data engineer wants to investigate the issue by stepping through the code and checking the values of certain variables during execution.

Which tool should the data engineer use to inspect the code execution and variables in real-time?

Options:

A.

Python Notebook Interactive Debugger

B.

Cluster Logs

C.

SQL Analytics

D.

Job Execution Dashboard

Question 35

A data engineer needs to ingest from both streaming and batch sources for a firm that relies on highly accurate data. Occasionally, some of the data picked up by the sensors that provide a streaming input are outside the expected parameters. If this occurs, the data must be dropped, but the stream should not fail.

Which feature of Delta Live Tables meets this requirement?

Options:

A.

Monitoring

B.

Change Data Capture

C.

Expectations

D.

Error Handling

Question 36

An engineering manager uses a Databricks SQL query to monitor ingestion latency for each data source. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results.

Which of the following approaches can the manager use to ensure the results of the query are updated each day?

Options:

A.

They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL.

B.

They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL.

C.

They can schedule the query to refresh every 1 day from the query's page in Databricks SQL.

D.

They can schedule the query to run every 1 day from the Jobs UI.

E.

They can schedule the query to run every 12 hours from the Jobs UI.

Question 37

A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted.

Which of the following explains why the data files are no longer present?

Options:

A.

The VACUUM command was run on the table

B.

The TIME TRAVEL command was run on the table

C.

The DELETE HISTORY command was run on the table

D.

The OPTIMIZE command was nun on the table

E.

The HISTORY command was run on the table

Question 38

A data engineer is maintaining an ETL pipeline code with a GitHub repository linked to their Databricks account. The data engineer wants to deploy the ETL pipeline to production as a databricks workflow.

Which approach should the data engineer use?

Options:

A.

Databricks Asset Bundles (DAB) + GitHub Integration

B.

Maintain workflow_config.j son and deploy it using Databricks CLI

C.

Manually create and manage the workflow in Ul

D.

Maintain workflow_conf ig. json and deploy it using Terraform

Question 39

A data engineer needs to determine whether to use the built-in Databricks Notebooks versioning or version their project using Databricks Repos.

Which of the following is an advantage of using Databricks Repos over the Databricks Notebooks versioning?

Options:

A.

Databricks Repos automatically saves development progress

B.

Databricks Repos supports the use of multiple branches

C.

Databricks Repos allows users to revert to previous versions of a notebook

D.

Databricks Repos provides the ability to comment on specific changes

E.

Databricks Repos is wholly housed within the Databricks Lakehouse Platform

Question 40

Which of the following describes the type of workloads that are always compatible with Auto Loader?

Options:

A.

Dashboard workloads

B.

Streaming workloads

C.

Machine learning workloads

D.

Serverless workloads

E.

Batch workloads

Question 41

A data engineer wants to create a new table containing the names of customers that live in France.

They have written the following command:

Question # 41

A senior data engineer mentions that it is organization policy to include a table property indicating that the new table includes personally identifiable information (PII).

Which of the following lines of code fills in the above blank to successfully complete the task?

Options:

A.

There is no way to indicate whether a table contains PII.

B.

"COMMENT PII"

C.

TBLPROPERTIES PII

D.

COMMENT "Contains PII"

E.

PII

Question 42

A data engineer manages multiple external tables linked to various data sources. The data engineer wants to manage these external tables efficiently and ensure that only the necessary permissions are granted to users for accessing specific external tables.

How should the data engineer manage access to these external tables?

Options:

A.

Create a single user role with full access to all external tables and assign it to all users.

B.

Use Unity Catalog to manage access controls and permissions for each external table individually.

C.

Set up Azure Blob Storage permissions at the container level, allowing access to all external tables.

D.

Grant permissions on the Databricks workspace level, which will automatically apply to all external tables.

Question 43

A data engineer wants to schedule their Databricks SQL dashboard to refresh every hour, but they only want the associated SQL endpoint to be running when it is necessary. The dashboard has multiple queries on multiple datasets associated with it. The data that feeds the dashboard is automatically processed using a Databricks Job.

Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?

Options:

A.

They can turn on the Auto Stop feature for the SQL endpoint.

B.

They can ensure the dashboard's SQL endpoint is not one of the included query's SQL endpoint.

C.

They can reduce the cluster size of the SQL endpoint.

D.

They can ensure the dashboard's SQL endpoint matches each of the queries' SQL endpoints.

E.

They can set up the dashboard's SQL endpoint to be serverless.

Question 44

A data engineer is attempting to write Python and SQL in the same command cell and is running into an error The engineer thought that it was possible to use a Python variable in a select statement.

Why does the command fail?

Options:

A.

Databricks supports multiple languages but only one per notebook.

B.

Databricks supports language interoperability in the same cell but only between Scala and SQL

C.

Databricks supports language interoperability but only if a special character is used.

D.

Databricks supports one language per cell.

Question 45

A global retail company sells products across multiple categories (e.g.. Electronics, Clothing) and regions (e.g.. North. South, East. West). The sales team has provided the data engineer with a PySpark dataframe named sales_df as below and the team wants the data engineer to analyze the sales data to help them make strategic decisions.

Question # 45

Options:

A.

Category_sales = sales df.groupBy("category").agg(sum("sales amount") .alias ("total sales amount"))

B.

Category_sales = sales_df.sum("3ales_amount"). g-1- upBy("categcryn).alias("toLal_sales_amount))

C.

Category_sale: .es df -agg (sum ("sales amount") .-;r*i:rRy ("category") .alias ("total sa.en amount"))

D.

Category_sales = sales_df.groupBy("reqion"). agq(sum("sales_amountn).alias(ntotal_sales_amount''))

Question 46

What is the structure of an Asset Bundle?

Options:

A.

A single plain text file enumerating the names of assets to be migrated to a new workspace.

B.

A compressed archive (ZIP) that solely contains workspace assets without any accompanying metadata.

C.

A YAML configuration file that specifies the artifacts, resources, and configurations for the project.

D.

A Docker image containing runtime environments and the source code of the assets

Question 47

Which of the following commands will return the location of database customer360?

Options:

A.

DESCRIBE LOCATION customer360;

B.

DROP DATABASE customer360;

C.

DESCRIBE DATABASE customer360;

D.

ALTER DATABASE customer360 SET DBPROPERTIES ('location' = '/user'};

E.

USE DATABASE customer360;