Overview
- Serverless ETL (extract, transform, load) service
- Uses Spark underneath
Glue Crawler
- This is not crawler in the sense that would pull data from data sources
- A crawler reads data from data sources ONLY TO determine its data structure / schema
- Crawls databases using a connection (actually a connection profile)
- Crawls files on S3 without needing a connection
- After each crawl, virtual tables are created in Data Catalog, these tables stores meta data like the columns, data types, location, etc. of the data, but not the data itself
- Crawler is serverless, you pay by Data Processing Unit (time consumed for crawling), but there is a 10 minute minimum duration for each crawl
- You can create tables directly without using crawlers
Crawling Behaviors
- For database tables, it is easy to determine data types for columns, but crawler mostly shines on file based data (i.e. data lake)
- The crawler determines column and data type by reading data from the files and “guess” the format based on patterns and similarity of the data format for different files
- It creates tables based on the prefix / folder name of similarly-formatted files, and create partitions of tables as needed
- It uses “classifiers” to parse and understand data, classifiers are sets of pattern matchers
- User may create custom classifiers
- Classifiers are categorized based on file types, e.g. the Grok classifier is for text based files
- There are JSON and CSV classifiers, they are for respected file types
- Classifier will only classify file types into their primitive data types, for example, even if a JSON contains ISO 8601 formatted timestamp, the crawler will still see it as a
string
Glue Data Catalog
- An index to the location, schema and runtime metrics of your data
- Connections
- This is actually connection configuration for Glue to connect to databases
- If you access S3 via a VPCE, then you also need a
NETWORK
type connection- Creating
NETWORK
type connection without a VPCE will cause “Cannot create enum from NETWORK value!” error which is very confusing
- Creating
- If you access S3 via its public endpoints then no connection is required
Glue ETL
- Glue ETL is EMR made serverless
- It runs Spark underneath
- User can create Spark jobs and run it directly without provisioning servers
- Glue ETL provisions EMR clusters on-the-fly, so expect 5-10 minutes cold start time even for the simplest jobs
- Glue has built-in helpers to perform common tasks like casting
string
totimestamp
(you will need this for crawled JSONs) - ETL is useful for
- Type conversion for columns
- Compress data from plain format (text, log, CSV, JSON) to columnar format (parquet)
- Data joining, so data can be scanned more easily
- Other data transforming and manipulation
- ETL Jobs can be connected to form a Workflow, every Job in the workflow can be Triggered manually or automatically
Dev Endpoint
- Glue ETL provisions EMR for every Job run, so it is too slow for development purposes
- If you are developing your Spark script and want to have an environment that is in the cloud and has access to the resources, use Dev Endpoint
- Basically Glue provisions an EMR cluster that is long-running (and long-charging) for you as a dev environment, remember to delete the endpoint after you are done to prevent bill overflow