Once a month, a company receives a 100 MB .csv file compressed with gzip. The file contains 50,000 property listing records and is stored in Amazon S3 Glacier. The company needs its data analyst to query a subset of the data for a specific vendor.
What is the most cost-effective solution?
A. Load the data into Amazon S3 and query it with Amazon S3 Select.
B. Query the data from Amazon S3 Glacier directly with Amazon Glacier Select.
C. Load the data to Amazon S3 and query it with Amazon Athena.
D. Load the data to Amazon S3 and query it with Amazon Redshift Spectrum.
A hospital uses wearable medical sensor devices to collect data from patients. The hospital is architecting a near-real-time solution that can ingest the data securely at scale. The solution should also be able to remove the patient's protected health information (PHI) from the streaming data and store the data in durable storage.
Which solution meets these requirements with the least operational overhead?
A. Ingest the data using Amazon Kinesis Data Streams, which invokes an AWS Lambda function using Kinesis Client Library (KCL) to remove all PHI. Write the data in Amazon S3.
B. Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Have Amazon S3 trigger an AWS Lambda function that parses the sensor data to remove all PHI in Amazon S3.
C. Ingest the data using Amazon Kinesis Data Streams to write the data to Amazon S3. Have the data stream launch an AWS Lambda function that parses the sensor data and removes all PHI in Amazon S3.
D. Ingest the data using Amazon Kinesis Data Firehose to write the data to Amazon S3. Implement a transformation AWS Lambda function that parses the sensor data to remove all PHI.
A company is building a service to monitor fleets of vehicles. The company collects IoT data from a device in each vehicle and loads the data into Amazon Redshift in near-real time. Fleet owners upload .csv files containing vehicle reference data into Amazon S3 at different times throughout the day. A nightly process loads the vehicle reference data from Amazon S3 into Amazon Redshift. The company joins the IoT data from the device and the vehicle reference data to power reporting and dashboards. Fleet owners are frustrated by waiting a day for the dashboards to update.
Which solution would provide the SHORTEST delay between uploading reference data to Amazon S3 and the change showing up in the owners' dashboards?
A. Use S3 event notifications to trigger an AWS Lambda function to copy the vehicle reference data into Amazon Redshift immediately when the reference data is uploaded to Amazon S3.
B. Create and schedule an AWS Glue Spark job to run every 5 minutes. The job inserts reference data into Amazon Redshift.
C. Send reference data to Amazon Kinesis Data Streams. Configure the Kinesis data stream to directly load the reference data into Amazon Redshift in real time.
D. Send the reference data to an Amazon Kinesis Data Firehose delivery stream. Configure Kinesis with a buffer interval of 60 seconds and to directly load the data into Amazon Redshift.
A company is migrating from an on-premises Apache Hadoop cluster to an Amazon EMR cluster. The cluster runs only during business hours. Due to a company requirement to avoid intraday cluster failures, the EMR cluster must be highly available. When the cluster is terminated at the end of each business day, the data must persist.
Which configurations would enable the EMR cluster to meet these requirements? (Choose three.)
A. EMR File System (EMRFS) for storage
B. Hadoop Distributed File System (HDFS) for storage
C. AWS Glue Data Catalog as the metastore for Apache Hive
D. MySQL database on the master node as the metastore for Apache Hive
E. Multiple master nodes in a single Availability Zone
F. Multiple master nodes in multiple Availability Zones
An operations team notices that a few AWS Glue jobs for a given ETL application are failing. The AWS Glue jobs read a large number of small JSON files from an Amazon S3 bucket and write the data to a different S3 bucket in Apache Parquet format with no major transformations. Upon initial investigation, a data engineer notices the following error message in the History tab on the AWS Glue console: "Command Failed with Exit Code 1."
Upon further investigation, the data engineer notices that the driver memory profile of the failed jobs crosses the safe threshold of 50% usage quickly and reaches 90?5% soon after. The average memory usage across all executors continues to be less than 4%.
The data engineer also notices the following error while examining the related Amazon CloudWatch Logs.
What should the data engineer do to solve the failure in the MOST cost-effective way?
A. Change the worker type from Standard to G.2X.
B. Modify the AWS Glue ETL code to use the `groupFiles': `inPartition' feature.
C. Increase the fetch size setting by using AWS Glue dynamics frame.
D. Modify maximum capacity to increase the total maximum data processing units (DPUs) used.
A company receives data from its vendor in JSON format with a timestamp in the file name. The vendor uploads the data to an Amazon S3 bucket, and the data is registered into the company's data lake for analysis and reporting. The company has configured an S3 Lifecycle policy to archive all files to S3 Glacier after 5 days.
The company wants to ensure that its AWS Glue crawler catalogs data only from S3 Standard storage and ignores the archived files. A data analytics specialist must implement a solution to achieve this goal without changing the current S3 bucket configuration.
Which solution meets these requirements?
A. Use the exclude patterns feature of AWS Glue to identify the S3 Glacier files for the crawler to exclude.
B. Schedule an automation job that uses AWS Lambda to move files from the original S3 bucket to a new S3 bucket for S3 Glacier storage.
C. Use the excludeStorageClasses property in the AWS Glue Data Catalog table to exclude files on S3 Glacier storage.
D. Use the include patterns feature of AWS Glue to identify the S3 Standard files for the crawler to include.
An advertising company has a data lake that is built on Amazon S3. The company uses AWS Glue Data Catalog to maintain the metadata. The data lake is several years old and its overall size has increased exponentially as additional data sources and metadata are stored in the data lake. The data lake administrator wants to implement a mechanism to simplify permissions management between Amazon S3 and the Data Catalog to keep them in sync.
Which solution will simplify permissions management with minimal development effort?
A. Set AWS Identity and Access Management (IAM) permissions for AWS Glue
B. Use AWS Lake Formation permissions
C. Manage AWS Glue and S3 permissions by using bucket policies
D. Use Amazon Cognito user pools
A workforce management company has built business intelligence dashboards in Amazon QuickSight Enterprise edition to help the company understand staffing behavior for its customers. The company stores data for these dashboards in Amazon S3 and performs queries by using Amazon Athena. The company has millions of records from many years of data.
A data analytics specialist observes sudden changes in overall staffing revenue in one of the dashboards. The company thinks that these sudden changes are because of outliers in data for some customers. The data analytics specialist must implement a solution to explain and identify the primary reasons for these changes.
Which solution will meet these requirements with the LEAST development effort?
A. Add a box plot visual type in QuickSight to compare staffing revenue by customer.
B. Use the anomaly detection and contribution analysis feature in QuickSight.
C. Create a custom SQL script in Athena. Invoke an anomaly detection machine learning model.
D. Use S3 analytics storage class analysis to detect anomalies for data written to Amazon S3.
A company hosts a large data warehouse on Amazon Redshift. A business intelligence (BI) team requires access to tables in schemas A and B. However, the BI team must not have access to tables in schema C.
Members of the BI team connect to the Redshift cluster through a client that uses a JDBC connector. A data analytics specialist needs to set up access for these users.
Which combination of steps will meet these requirements? (Choose two.)
A. Create an IAM user for each BI team member who requires access. Create an IAM group for these users.
B. Create a database user for each BI team member who requires access. Create a database user group for these users.
C. Create an IAM policy that grants read and write permissions for schemas A and B to the BI IAM group and denies read and write permissions for schema C to the BI IAM group. Attach the policy to the BI IAM group.
D. Use the GRANT command to grant access to the BI database user group for schemas A and B. Use the REVOKE command to block access for the BI database user group for schema C.
E. Specify the WITH MANAGED ACCESS parameter during the creation of schema C.
A marketing company collects clickstream data The company sends the data to Amazon Kinesis Data Firehose and stores the data in Amazon S3 The company wants to build a series of dashboards that will be used by hundreds of users across different departments. The company will use Amazon QuickSight to develop these dashboards The company has limited resources and wants a solution that could scale and provide daily updates about clickstream activity.
Which combination of options will provide the MOST cost-effective solution? (Select TWO )
A. Use Amazon Redshift to store and query the clickstream data
B. Use QuickSight with a direct SQL query
C. Use Amazon Athena to query the clickstream data in Amazon S3
D. Use S3 analytics to query the clickstream data
E. Use the QuickSight SPICE engine with a daily refresh