Glue Cheat Sheet

Overview

  • Serverless ETL (extract, transform, load) service
  • Uses Spark underneath

Glue Crawler

  • This is not crawler in the sense that would pull data from data sources
  • A crawler reads data from data sources ONLY TO determine its data structure / schema
    • Crawls databases using a connection (actually a connection profile)
    • Crawls files on S3 without needing a connection
  • After each crawl, virtual tables are created in Data Catalog, these tables stores meta data like the columns, data types, location, etc. of the data, but not the data itself
  • Crawler is serverless, you pay by Data Processing Unit (time consumed for crawling), but there is a 10 minute minimum duration for each crawl
  • You can create tables directly without using crawlers

Crawling Behaviors

  • For database tables, it is easy to determine data types for columns, but crawler mostly shines on file based data (i.e. data lake)
  • The crawler determines column and data type by reading data from the files and “guess” the format based on patterns and similarity of the data format for different files
  • It creates tables based on the prefix / folder name of similarly-formatted files, and create partitions of tables as needed
  • It uses “classifiers” to parse and understand data, classifiers are sets of pattern matchers
    • User may create custom classifiers
    • Classifiers are categorized based on file types, e.g. the Grok classifier is for text based files
      • There are JSON and CSV classifiers, they are for respected file types
    • Classifier will only classify file types into their primitive data types, for example, even if a JSON contains ISO 8601 formatted timestamp, the crawler will still see it as a string

Glue Data Catalog

  • An index to the location, schema and runtime metrics of your data
  • Connections
    • This is actually connection configuration for Glue to connect to databases
    • If you access S3 via a VPCE, then you also need a NETWORK type connection
      • Creating NETWORK type connection without a VPCE will cause “Cannot create enum from NETWORK value!” error which is very confusing
    • If you access S3 via its public endpoints then no connection is required

Glue ETL

  • Glue ETL is EMR made serverless
  • It runs Spark underneath
  • User can create Spark jobs and run it directly without provisioning servers
    • Glue ETL provisions EMR clusters on-the-fly, so expect 5-10 minutes cold start time even for the simplest jobs
  • Glue has built-in helpers to perform common tasks like casting string to timestamp (you will need this for crawled JSONs)
  • ETL is useful for
    • Type conversion for columns
    • Compress data from plain format (text, log, CSV, JSON) to columnar format (parquet)
    • Data joining, so data can be scanned more easily
    • Other data transforming and manipulation
  • ETL Jobs can be connected to form a Workflow, every Job in the workflow can be Triggered manually or automatically

Dev Endpoint

  • Glue ETL provisions EMR for every Job run, so it is too slow for development purposes
  • If you are developing your Spark script and want to have an environment that is in the cloud and has access to the resources, use Dev Endpoint
  • Basically Glue provisions an EMR cluster that is long-running (and long-charging) for you as a dev environment, remember to delete the endpoint after you are done to prevent bill overflow

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s