AWS Glue Interview Questions and Answers
AWS Glue Overview
AWS Glue is a fully managed, serverless data integration service designed to simplify the process of extracting, transforming, and loading (ETL) data for analytics, machine learning, and application development. It offers both visual and code-based interfaces, enabling users to integrate data with ease. AWS Glue automatically discovers and catalogs metadata about your data stores into a centralized Data Catalog, which can be queried by services like Amazon Athena and Amazon Redshift Spectrum.
Q: Can you explain the components of AWS Glue?
AWS Glue comprises several key components:
Data Catalog: A centralized repository for storing structural and operational metadata about your data assets. It serves as a persistent and searchable metadata store for data discovery and ETL processes.
Crawler: A tool that connects to data sources, applies classifiers to infer schemas, and creates metadata tables in the Data Catalog. Crawlers keep the Data Catalog updated by periodically running and detecting changes in the data.
ETL Jobs: Scripts, either automatically generated by AWS Glue or custom-written, used to transform, flatten, and enrich data in various formats across data stores
- Triggers: Mechanisms to start ETL jobs or Crawlers based on conditions. Triggers can be time-based, event-based, or on-demand.
- Development Endpoint: An interactive environment for developing and testing ETL scripts before deploying them.
Q: How does AWS Glue handle schema evolution?
AWS Glue automatically manages schema evolution. When changes in a schema are detected, AWS Glue updates the schema in its Data Catalog and adapts the corresponding ETL jobs. This ensures that data processing workflows remain functional without manual intervention, even when data structures evolve.
Q: What are AWS Glue Crawlers and what do they do?
AWS Glue Crawlers automatically populate the Data Catalog with metadata tables by analyzing the schema of connected data sources. Crawlers:
- Connect to source or target data stores.
- Classify data formats and infer schemas.
- Create metadata tables in the Data Catalog.
- Can be scheduled to run periodically to keep the catalog up-to-date with changes in data stores.
Q: What is the AWS Glue Data Catalog?
The AWS Glue Data Catalog is a fully managed, centralized repository for metadata about your data assets. It integrates with services such as Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR, providing a unified view of all data assets. The catalog simplifies data discovery and ETL processes across data silos.
Q: Explain the types of triggers in AWS Glue.
AWS Glue supports three types of triggers:
- Schedule-based Triggers: Start jobs at predefined times using a cron-like syntax.
- Event-based Triggers: Activate jobs in response to specific events, such as the completion of another job.
- On-demand Triggers: Allow users to manually start jobs whenever needed.
Q: What is a Job Bookmark in AWS Glue?
A Job Bookmark is a feature in AWS Glue that tracks the state of data processing. It prevents reprocessing of previously processed data during subsequent ETL job runs. This is particularly beneficial for incremental data loads, ensuring that only new or updated data is processed.
Q: How does AWS Glue integrate with other AWS services?
AWS Glue seamlessly integrates with various AWS services to extend its capabilities:
- Amazon S3: Serves as both a data source and target for ETL jobs.
- AWS Lambda: Can trigger ETL jobs in response to specific events.
- Amazon Redshift: Executes transformation jobs and outputs results directly into Redshift.
- Amazon Athena: Leverages the Data Catalog as a schema repository for querying data.
- Amazon RDS and Amazon DynamoDB: Act as sources or targets for ETL jobs.
AWS Glue’s interoperability with other AWS services makes it a versatile tool for comprehensive data integration workflows.
Q: What are the main use cases for AWS Glue?
A: AWS Glue is commonly used for the following scenarios:
- Data Preparation for Analytics: Transform and load raw data into data lakes, data warehouses, or analytical tools like Amazon Redshift.
- Data Integration: Combine data from multiple sources for applications or machine learning models.
- ETL Automation: Automate data pipelines with minimal manual intervention using triggers and crawlers.
- Schema Management: Manage and evolve schemas across data silos.
- Data Cataloging: Centralize metadata for easy data discovery and governance.
Q: How does AWS Glue work with data lakes?
A: AWS Glue integrates seamlessly with data lakes by enabling:
- Schema Inference: Crawlers infer the schema of data stored in Amazon S3.
- Metadata Management: Populates the AWS Glue Data Catalog with metadata about files in the data lake.
- ETL Pipelines: Transforms raw data into a structured format and stores it back in the data lake.
- Data Queries: Use services like Amazon Athena to query data using the catalog.
Q: What programming languages does AWS Glue support?
A: AWS Glue supports Python and Scala for developing ETL scripts. Users can write custom scripts or use the AWS Glue Studio visual interface for code-free development.
Q: What is AWS Glue Studio, and how is it different from standard AWS Glue?
A: AWS Glue Studio is a graphical interface for creating, managing, and running ETL workflows. It provides a no-code or low-code approach, allowing users to design workflows visually, unlike the standard AWS Glue interface, which primarily relies on writing scripts.
Q: Can AWS Glue process streaming data?
A: Yes, AWS Glue supports processing streaming data through AWS Glue Streaming ETL Jobs. These jobs can consume data from streaming sources like Amazon Kinesis Data Streams or Apache Kafka, process it in real-time, and store the results in data lakes or data warehouses.
Q: How does AWS Glue ensure security?
A: AWS Glue provides several security features:
- Encryption: Encrypt data at rest in Amazon S3 and in transit using SSL/TLS.
- IAM Roles: Use AWS Identity and Access Management (IAM) to define permissions for accessing data sources, targets, and Glue resources.
- VPC Endpoints: Run jobs within a VPC to ensure secure connectivity.
- Data Access Policies: Integrate with AWS Lake Formation to control access to data.
Q: What are DynamicFrames in AWS Glue?
A: A DynamicFrame is a distributed data collection in AWS Glue. It is similar to a DataFrame in Apache Spark but includes schema inference and other Glue-specific features. DynamicFrames are optimized for AWS Glue’s ETL processes and can easily be converted to and from Spark DataFrames.
Q: How does AWS Glue optimize ETL performance?
A: AWS Glue optimizes ETL performance through:
- Job Bookmarking: Avoid reprocessing previously processed data.
- Data Partitioning: Process data in parallel by leveraging partitioning in Amazon S3.
- Auto-scaling: Automatically scales resources based on the job’s needs.
- Pushdown Predicates: Filters data at the source to reduce the amount of data processed.
Q: What are the limitations of AWS Glue?
- Limited Language Support: Only supports Python and Scala.
- Dependency on AWS Services: Works best within the AWS ecosystem.
- Resource Constraints: Complex transformations may require tuning and could face resource limitations.
- Learning Curve: May require familiarity with Spark for advanced use cases.
Q: How does AWS Glue handle data quality?
A: AWS Glue can handle data quality in several ways:
- Schema Validation: Crawlers detect and validate schema consistency.
- Transformations: Cleanse, standardize, and enrich data using ETL scripts.
- Custom Rules: Implement data quality checks using Python or Scala in ETL scripts.
Q: Can AWS Glue work with non-AWS data sources?
A: Yes, AWS Glue supports various data sources, including:
- On-premises databases via JDBC connections.
- Third-party applications like Salesforce and SAP.
- External data sources integrated through connectors or APIs.
Q: What is the difference between AWS Glue and AWS Lake Formation?
- AWS Glue: Primarily a data integration and ETL service, focusing on preparing and cataloging data for analytics.
- AWS Lake Formation: A service to build and manage secure data lakes, providing additional capabilities like fine-grained access control and automated workflows for ingesting and transforming data.
Q: What are the prerequisites for using AWS Glue?
- Data stored in supported sources like Amazon S3, Amazon RDS, or external databases.
- IAM roles with necessary permissions for accessing data and Glue resources.
- Basic understanding of ETL processes and data formats.
Q: How does AWS Glue pricing work?
A: AWS Glue pricing is based on the following:
- Data Processing Units (DPUs): Charged per DPU-hour for running jobs.
- Data Catalog: Charged based on the number of objects stored and requests made.
- Crawlers: Charged per DPU-hour for crawler runs.
Q: What is the difference between AWS Glue ETL Jobs and AWS Glue Crawlers?
- ETL Jobs: These are scripts (Python/Scala) that transform, clean, and enrich data. They are responsible for the actual ETL process of moving and processing data from source to target.
- Crawlers: These discover data in your sources, infer the schema, and populate the Data Catalog with metadata. Crawlers do not perform ETL but enable ETL jobs by creating metadata about the data.
Q: How does AWS Glue handle error handling and monitoring?
AWS Glue provides error handling and monitoring through:
- AWS CloudWatch: Logs job execution details, errors, and performance metrics.
- Error Logging: Logs errors in a specific location on Amazon S3.
- Retry Mechanisms: Automatically retries failed tasks based on configurations.
- Job Metrics Dashboard: Displays real-time and historical metrics for monitoring jobs.
- Notifications: Use Amazon Simple Notification Service (SNS) to send alerts when jobs fail or succeed.
Q: What is the difference between Glue’s Data Catalog and traditional metadata stores?
- Glue Data Catalog: Fully managed, serverless, and tightly integrated with AWS services like Athena and Redshift Spectrum. It supports schema evolution and periodic updates via Crawlers.
- Traditional Metadata Stores: Often self-managed, requiring manual updates and integration with external tools.
Q: Can AWS Glue process semi-structured or unstructured data?
Yes, AWS Glue supports semi-structured and unstructured data formats like:
- JSON
- Parquet
- Avro
- XML
- CSV
AWS Glue can infer the schema of semi-structured data using Crawlers, making it easier to process.
Q: What is a Glue Partition Index, and why is it used?
A Partition Index in AWS Glue allows faster query performance by reducing the amount of data scanned. Instead of scanning all partitions, the index helps locate relevant partitions based on query criteria. This is especially useful for large datasets with many partitions.
Q: What are Glue Connectors, and how do they work?
Glue Connectors are pre-built integrations that allow AWS Glue to connect to various data sources, including third-party databases, APIs, and external services. For example:
- JDBC Connector: Enables integration with on-premises or cloud-hosted databases.
- Custom Connectors: Extend Glue functionality to support additional sources.
They can be downloaded from AWS Marketplace or created manually.
Q: What is the AWS Glue Schema Registry?
The AWS Glue Schema Registry is a feature that allows you to validate and evolve data schemas in real-time. It integrates with services like Apache Kafka, Amazon MSK, and Amazon Kinesis Data Streams. Benefits include:
- Schema Versioning: Tracks changes to schemas over time.
- Validation: Ensures compatibility between producer and consumer schemas.
- Cost Efficiency: Reduces storage and compute costs by serializing data efficiently.
Q: How does AWS Glue handle incremental data updates?
AWS Glue supports incremental data processing through:
- Job Bookmarks: Tracks already processed data to avoid duplication.
- Partitioning: Processes only newly added partitions in the data.
- Custom Logic: Incorporates custom filters and logic in ETL scripts to handle changes.
Q: What is AWS Glue Workflows?
AWS Glue Workflows enable users to create and orchestrate complex ETL pipelines by combining Crawlers, ETL jobs, and triggers into a sequence. Features include:
- Event-based Triggers: Automatically starts downstream jobs upon completion of upstream jobs.
- Graphical Representation: Visualize workflows in the AWS Glue console.
- Monitoring: Tracks the status of each step in the workflow.
Q: How do you debug AWS Glue ETL Jobs?
Debugging AWS Glue jobs involves:
- AWS CloudWatch Logs: Check logs for errors and performance metrics.
- Development Endpoints: Test and debug scripts interactively.
- Error Outputs: Review error files stored in Amazon S3.
- Script Validation: Use the Glue console to validate scripts before execution.
Q: What is Glue DataBrew, and how is it different from AWS Glue?
Glue DataBrew is a visual data preparation tool for cleaning and transforming data without writing code. It focuses on data preparation for analytics, offering over 250 built-in transformations.
- AWS Glue: Geared toward building and running ETL pipelines with code-first or visual interfaces.
- DataBrew: No-code solution for data cleaning, profiling, and enrichment.
Q: Can you schedule Glue ETL jobs? How?
Yes, Glue ETL jobs can be scheduled using:
- Time-based Triggers: Use cron expressions to specify when jobs should run.
- Event-based Triggers: Trigger jobs based on specific events, such as file uploads to Amazon S3.
- On-demand Execution: Manually start jobs when required.
Q: How does AWS Glue integrate with AWS Step Functions?
AWS Glue integrates with AWS Step Functions to orchestrate ETL workflows as part of larger serverless workflows. Step Functions can:
- Trigger Glue Crawlers, ETL jobs, or Workflows.
- Handle retries and errors automatically.
- Combine Glue operations with other AWS services in a single workflow.
Q: What is a Glue Job Script?
A Glue Job Script is the Python or Scala code that defines the transformations and data flow for an ETL job. It typically includes:
- Reading data from a source.
- Applying transformations.
- Writing data to a target.
Q: How can Glue help in GDPR and data compliance?
AWS Glue can assist with compliance by:
- Data Cataloging: Centralizes metadata for governance and audits.
- Sensitive Data Handling: Identifies sensitive data using crawlers and classifiers.
- Data Masking and Encryption: Ensures that sensitive data is transformed or encrypted before storage.
Q: How do you optimize cost in AWS Glue?
Cost optimization strategies include:
- Efficient Job Design: Avoid unnecessary transformations and data scans.
- Partitioning: Use partitions to minimize data processed.
- Crawler Scheduling: Run Crawlers only when necessary.
- DPU Scaling: Adjust the number of DPUs based on job requirements.
Q: What are Glue Classifiers, and how are they used?
Glue Classifiers are used by Crawlers to recognize data structures and infer schemas. AWS Glue provides built-in classifiers (e.g., for JSON, CSV, and XML) and allows users to create custom classifiers using Python scripts. Crawlers evaluate classifiers in a priority order to determine the best match for the data.
Q: What is AWS Glue’s relationship with Apache Spark?
AWS Glue is built on Apache Spark, a distributed processing framework. Glue jobs execute on a Spark runtime environment, which enables large-scale data processing. The underlying Spark engine allows Glue to perform parallel processing, making it highly efficient for ETL workflows.
Q: How does AWS Glue handle large-scale data transformations?
AWS Glue can handle large-scale data by leveraging:
- Distributed Processing: Spark processes data in parallel across multiple nodes.
- DynamicFrames: Automatically manages schema inference and transformation.
- Pushdown Predicates: Filters data at the source to reduce the amount of data processed.
- Partitioning: Allows Glue to process only relevant partitions of the data.
Q: What are the best practices for writing Glue ETL scripts?
- Leverage DynamicFrames: Use them for schema inference and flexibility.
- Optimize Transformations: Avoid complex transformations that increase execution time.
- Partition Data: Use partitions for efficient data processing.
- Use Pushdown Predicates: Filter data early to minimize data processed.
- Enable Job Bookmarks: Prevent reprocessing of already processed data.
Q: How does AWS Glue handle dependencies between jobs?
AWS Glue uses Triggers and Workflows to manage dependencies between jobs:
- Event-based Triggers: Start jobs based on the success or failure of other jobs.
- Workflows: Combine multiple jobs and triggers into a single orchestration pipeline, with dependencies visualized in the console.
Q: What is the difference between Glue DynamicFrames and Spark DataFrames?
- DynamicFrames: Glue-specific data structures that provide schema flexibility and include additional operations for ETL tasks like schema inference and evolution.
- DataFrames: Native to Apache Spark, they require a predefined schema and are optimized for transformations and queries.
- Conversion: Glue allows conversion between DynamicFrames and DataFrames.
Q: Can AWS Glue connect to on-premises data sources? How?
Yes, AWS Glue can connect to on-premises data sources using:
- JDBC Connections: Create a Glue connection to the on-premises database.
- AWS Direct Connect or VPN: Securely establish network connectivity.
- Glue Connector Libraries: Extend connectivity to non-standard data sources.
Q: How does Glue manage version control for ETL scripts?
AWS Glue doesn’t have native version control for ETL scripts. However, you can manage versions externally by:
- Storing Scripts in Git Repositories: Use version control systems like Git.
- S3 Buckets: Maintain versioning in the S3 bucket storing the scripts.
- Workflow Backups: Periodically export workflows and job configurations.
Q: What is the role of Apache Hive in AWS Glue?
AWS Glue Data Catalog is compatible with the Apache Hive Metastore API. It enables tools like Hive, Presto, and Spark to use the Glue Data Catalog as a central metadata repository, ensuring seamless integration and query capabilities.
Q: What are the key features of Glue’s Schema Registry?
- Schema Versioning: Keeps track of schema changes over time.
- Compatibility Checks: Ensures producer and consumer compatibility.
- Integration: Works with streaming platforms like Kafka and Kinesis.
- Serialization Formats: Supports formats like Avro and JSON.
Q: How do Glue jobs handle retries on failure?
AWS Glue allows you to configure retry policies for jobs. You can specify the maximum number of retries and the delay between retries. If a job fails, Glue automatically retries based on the policy, reducing manual intervention.
Q: What is the difference between Glue Jobs and Glue Streaming Jobs?
- Glue Jobs: Process batch data with predefined start and end points.
- Glue Streaming Jobs: Continuously process streaming data in real-time from sources like Kinesis or Kafka.
Q: Can AWS Glue transform encrypted data?
Yes, AWS Glue can process encrypted data. It supports:
- S3 Server-Side Encryption: Read and write encrypted data in S3.
- AWS Key Management Service (KMS): Use KMS-managed keys for data encryption.
- End-to-End Encryption: Process and output data securely across services.
Q: How do you troubleshoot performance issues in Glue jobs?
- Increase DPUs: Allocate more Data Processing Units (DPUs) to the job.
- Enable Pushdown Predicates: Reduce data scanned by filtering early.
- Optimize Transformations: Simplify and minimize transformations.
- Partition Data: Leverage data partitioning for efficient reads.
- CloudWatch Logs: Analyze logs to identify bottlenecks.
Q: What is Glue’s role in a serverless data pipeline?
AWS Glue simplifies serverless data pipelines by:
- Data Discovery: Crawlers populate metadata in the Data Catalog.
- ETL Automation: Automates transformations and data movement.
- Integration: Works with AWS services like S3, Redshift, and Athena.
- Orchestration: Combines jobs and triggers into workflows.
Q: What is AWS Glue Elastic Views?
AWS Glue Elastic Views (preview feature) allows users to build materialized views over data across multiple sources. It uses SQL queries to combine and transform data in real-time. This is particularly useful for creating a unified view of data spread across different systems.
Q: How does Glue support real-time analytics?
AWS Glue supports real-time analytics through:
- Streaming ETL Jobs: Processes streaming data from Kafka or Kinesis.
- Integration with Athena: Enables querying transformed data on S3 in near real-time.
- Schema Registry: Validates and manages streaming data schemas.
Q: Can Glue be used for data deduplication?
Yes, AWS Glue can deduplicate data by:
- Using transformations like
dropDuplicates()
in Spark scripts. - Applying custom logic in ETL jobs to filter duplicates based on unique identifiers.
- Leveraging Glue Studio for drag-and-drop deduplication workflows.
Q: How does AWS Glue scale for large datasets?
AWS Glue automatically scales by:
- Dynamic Allocation: Adjusts resources (DPUs) based on workload.
- Distributed Processing: Uses Apache Spark to process data in parallel.
- Partition Pruning: Reads only the relevant partitions of data.
- Auto-scaling Crawlers: Efficiently handle large and complex datasets.
Q: What are the key metrics to monitor in AWS Glue?
- Job Completion Time: How long it takes for jobs to finish.
- DPU Usage: Measures resource utilization.
- Error Count: Tracks job errors or failures.
- Records Processed: Number of records read, transformed, and written.
- Crawler Metrics: Tracks crawled objects, schema changes, and runtime.
Q: What are AWS Glue Blueprints?
AWS Glue Blueprints are reusable templates for building data integration workflows. They allow users to define standardized ETL pipelines that can be parameterized and reused. Blueprints simplify the creation of workflows by abstracting repetitive tasks.
Q: How does AWS Glue integrate with machine learning?
AWS Glue supports machine learning in the following ways:
- Data Preparation: Cleans and transforms data for ML models.
- Integration with SageMaker: Use AWS Glue to prepare data and feed it into Amazon SageMaker for training.
- ML Transforms: Includes built-in transformations like FindMatches, which identifies duplicates or matches in datasets using ML.
- Feature Engineering: Applies transformations to create features for ML pipelines.
Q: What is the role of Amazon Athena with AWS Glue?
Amazon Athena uses the AWS Glue Data Catalog as its schema repository, allowing you to run SQL queries directly on data stored in Amazon S3. The Glue Data Catalog provides metadata about the data, enabling seamless integration and query execution.
Q: What are the limits of AWS Glue Crawlers?
- Number of Tables: Crawlers are limited to creating up to 1 million tables per Data Catalog.
- File Size: Crawlers may face performance issues with extremely large files.
- Timeouts: Crawlers are subject to a maximum timeout limit, which can vary depending on configurations.
- Schema Inference Complexity: Complex or deeply nested schemas may require manual adjustments.
Q: What is the difference between Glue Streaming Jobs and AWS Kinesis Data Analytics?
- Glue Streaming Jobs: Designed for real-time ETL pipelines, transforming streaming data and saving it to destinations like Amazon S3 or Redshift.
- Kinesis Data Analytics: Focused on real-time analytics, allowing users to run SQL queries on streaming data directly from sources like Amazon Kinesis or Kafka.
Q: How does AWS Glue handle nested data structures?
AWS Glue provides tools to handle nested data structures like JSON or Parquet by:
- DynamicFrames: Automatically interpreting and flattening nested structures.
- Transformations: Using operations like
unnest()
to normalize data. - Schema Evolution: Adapting to changes in nested fields dynamically.
Q: What are Glue Job Workflows, and how are they structured?
Glue Workflows are orchestrated sequences of ETL jobs and crawlers, structured as directed acyclic graphs (DAGs). Each node represents a job or crawler, and edges define dependencies between them. Workflows can include:
- Triggers: To start or sequence jobs and crawlers.
- Parallelism: Run multiple jobs or crawlers concurrently.
- Conditional Execution: Execute jobs based on success, failure, or specific conditions.
Q: How does AWS Glue handle data partitioning?
AWS Glue supports partitioning to optimize data processing and reduce query costs. Data is divided into subsets based on partition keys (e.g., date, region). Glue Crawlers and ETL jobs automatically detect and use partitions for:
- Efficient Querying: Process only relevant partitions.
- Reduced Costs: Minimize data scans in analytical queries.
Q: What are the key differences between AWS Glue and Amazon EMR?
- AWS Glue: Serverless, managed ETL service for data integration. Focused on ease of use and automation (e.g., Crawlers, Data Catalog).
- Amazon EMR: Managed big data platform supporting Spark, Hadoop, and other frameworks. Designed for custom, large-scale data processing jobs with fine-grained control.
Q: How does Glue ensure data lineage?
AWS Glue enables data lineage by:
- Data Catalog Metadata: Tracks schema changes and job transformations.
- Integration with Lake Formation: Provides lineage for data lakes.
- Custom Logging: Include lineage-specific logs in ETL scripts.
Q: What is the Glue FindMatches ML Transform?
FindMatches is an AWS Glue ML Transform that identifies duplicate or related records in datasets. It uses machine learning to match records even when fields are similar but not identical, making it ideal for deduplication or record linkage tasks.
Q: How can you configure Glue Crawlers for incremental updates?
To enable Crawlers for incremental updates:
- Use Job Bookmarks to track previously processed data.
- Configure Crawlers to detect changes in the data source.
- Schedule Crawlers to run periodically for updates.
Q: What are AWS Glue EventBridge integrations?
AWS Glue integrates with Amazon EventBridge to trigger Glue jobs or Crawlers based on events. For example:
- Start a Glue job when a file is uploaded to S3.
- Trigger Crawlers based on changes in the Data Catalog.
Q: How does Glue interact with AWS Lake Formation?
Glue integrates with Lake Formation for secure and governed data lakes:
- Shared Data Catalog: Use Glue’s Data Catalog as the metadata store.
- Access Control: Enforce fine-grained permissions with Lake Formation.
- Automated ETL Pipelines: Use Glue to populate and transform Lake Formation data.
Q: How does Glue handle schema conflicts?
AWS Glue resolves schema conflicts by:
- Schema Evolution: Updates the Data Catalog automatically when changes are detected.
- Custom Scripts: Modify ETL scripts to handle specific schema mismatches.
- Schema Registry: Validates schemas for compatibility before applying them.
Q: How does AWS Glue manage job concurrency?
AWS Glue allows job concurrency by:
- Configuring the Max Concurrent Runs parameter for jobs.
- Using workflows to run multiple jobs in parallel.
- Leveraging Spark’s parallel processing capabilities within jobs.
Q: What is AWS Glue Flex?
AWS Glue Flex is a pricing model introduced for Glue that provides more cost-effective options for smaller ETL jobs. It allows users to run jobs with fractional DPUs, making Glue more affordable for light workloads.
Q: What are AWS Glue Tags?
AWS Glue supports resource tagging, allowing users to assign metadata (tags) to Glue resources like jobs, Crawlers, and Workflows. Tags are useful for:
- Cost Allocation: Track expenses by tagging resources for projects or departments.
- Organization: Manage Glue resources systematically.
- Access Control: Define IAM policies based on tags.
Q: What is Glue’s checkpointing mechanism in Streaming Jobs?
Glue Streaming Jobs use checkpointing to track progress in processing streaming data. This ensures:
- Fault Tolerance: Resumes from the last checkpoint in case of job failure.
- Data Consistency: Prevents reprocessing of already processed data.
Q: How do Glue Streaming Jobs process late-arriving data?
Glue handles late-arriving data by:
- Windowing: Defines time windows for aggregating or processing data.
- Custom Logic: Include ETL script logic to handle late data.
- Reprocessing: Use checkpoints to reprocess older data if needed.