AWS Glue Studio
AWS Glue Studio
AWS Glue Studio
User Guide
AWS Glue Studio User Guide
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not
Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or
discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may
or may not be affiliated with, connected to, or sponsored by Amazon.
AWS Glue Studio User Guide
Table of Contents
Using Notebooks (Preview) .................................................................................................................. 1
Overview of using notebooks ...................................................................................................... 1
Getting started with notebooks in AWS Glue Studio ....................................................................... 2
Creating an ETL job using notebooks in AWS Glue Studio ........................................................ 2
Notebook editor components .............................................................................................. 3
Saving your notebook and job script .................................................................................... 3
Managing notebook sessions ............................................................................................... 4
AWS Glue Visual Job API (Preview) ...................................................................................................... 5
API design and CRUD APIs ......................................................................................................... 5
Getting started ......................................................................................................................... 5
API design and CRUD APIs ......................................................................................................... 9
SDK Onboarding ....................................................................................................................... 9
Appendix: Visual Job Examples and Model Definitions ................................................................... 9
Examples ......................................................................................................................... 9
Model Definitions ............................................................................................................ 11
Detect PII (Preview) ......................................................................................................................... 22
Choosing how you want the data to be scanned ......................................................................... 22
Choosing the PII entities to take action on ................................................................................. 23
Choosing what to do with identified PII data .............................................................................. 24
What is AWS Glue Studio? ................................................................................................................. 25
Features of AWS Glue Studio ..................................................................................................... 26
Visual job editor ............................................................................................................... 26
Notebook interface for interactively developing and debugging job scripts ............................... 26
Job script code editor ....................................................................................................... 27
Job performance dashboard .............................................................................................. 27
Support for dataset partitioning ........................................................................................ 27
When should I use AWS Glue Studio? ......................................................................................... 27
Accessing AWS Glue Studio ....................................................................................................... 28
Pricing for AWS Glue Studio ...................................................................................................... 28
Setting up ....................................................................................................................................... 29
Complete initial AWS configuration tasks .................................................................................... 29
Sign up for AWS .............................................................................................................. 29
Create an IAM administrator user ....................................................................................... 29
Sign in as an IAM user ...................................................................................................... 30
Review IAM permissions needed for the AWS Glue Studio user ....................................................... 31
AWS Glue service permissions ............................................................................................ 31
Creating Custom IAM Policies for AWS Glue Studio ............................................................... 31
Notebook and data preview permissions ............................................................................. 33
Amazon CloudWatch permissions ....................................................................................... 33
Review IAM permissions needed for ETL jobs ............................................................................... 34
Data source and data target permissions ............................................................................. 34
Permissions required for deleting jobs ................................................................................ 34
AWS Key Management Service permissions .......................................................................... 34
Permissions required for using connectors ........................................................................... 35
Set up IAM permissions for AWS Glue Studio ............................................................................... 35
Create an IAM Role ........................................................................................................... 35
Attach policies to the AWS Glue Studio user ........................................................................ 36
Create a trust policy for roles not named "AWSGlueServiceRole*" ............................................ 36
Configure a VPC for your ETL job ............................................................................................... 37
Populate the AWS Glue Data Catalog .......................................................................................... 38
Tutorial: Getting started .................................................................................................................... 39
Prerequisites ............................................................................................................................ 39
Step 1: Start the job creation process ......................................................................................... 39
Step 2: Edit the data source node in the job diagram .................................................................... 40
iii
AWS Glue Studio User Guide
iv
AWS Glue Studio User Guide
v
AWS Glue Studio User Guide
Overview of using notebooks
Data engineers can author AWS Glue jobs faster and more easily than before using the new interactive
notebook interface in AWS Glue Studio or interactive sessions in AWS Glue.
Topics
• Overview of using notebooks (p. 1)
• Getting started with notebooks in AWS Glue Studio (p. 2)
When you start a notebook through AWS Glue Studio, all the configuration steps are done for you so
that you can explore your data and start developing your job script after only a few seconds. AWS Glue
Studio configures a Jupyter notebook with the AWS Glue Jupyter kernel. You don’t have to configure
VPCs, network connections, or development endpoints to use this notebook.
1
AWS Glue Studio User Guide
Getting started with notebooks in AWS Glue Studio
After your notebook is saved, your notebook is a full AWS Glue job. You can manage all aspects of the
job, such as scheduling jobs runs, setting job parameters, and viewing the job run history right along side
your notebook.
The following sections describe how to use the AWS Glue Studio to create notebooks for ETL jobs.
Topics
• Creating an ETL job using notebooks in AWS Glue Studio (p. 2)
• Notebook editor components (p. 3)
• Saving your notebook and job script (p. 3)
• Managing notebook sessions (p. 4)
1. Attach AWS Identity and Access Management policies to the AWS Glue Studio user and create an
IAM role for your ETL job and notebook, as instructed in Set up IAM permissions for AWS Glue
Studio (p. 35).
2. Configure additional IAM security for notebooks, as described in
3. Open the AWS Glue Studio console at https://console.aws.amazon.com/gluestudio/.
4. Choose the Jobs link in the left-side navigation menu.
5. Choose Jupyter notebook and then choose Create to start a new notebook session.
6. On the Create job in Jupyter notebook page, provide the job name, the IAM role to use, and choose
which programming language you want to use within the notebook. Choose Create job.
When the notebook first opens, it contains a single cell with an example %%configure command
based on the information you provided on the Create job in Jupyter notebook page. You can
modify this cell to customize the notebook session.
Run the cell to start a new notebook session and generate a session id.
8. Add cells, and enter code or markdown text.
For information about writing code using a Jupyter notebook interface, see The Jupyter Notebook
User Documentation .
9. To test your script, run the entire script, or individual cells. Any command output will be displayed in
the area beneath the cell.
2
AWS Glue Studio User Guide
Notebook editor components
10. After you have finished developing your script, you can save the job and then run it. For more
information about running jobs, see Start a job run (p. 106).
Although the AWS Glue Studio notebook is similar to Juptyer Notebooks, it differs in a few key ways:
• Notebook – Use this tab to view the job script using the notebook interface.
• Job details – Configure the environment and properties for the job runs.
• Runs – View information about previous runs of this job.
• Schedules – Configure a schedule for running your job at specific times.
When you choose Save, the job script and notebook file are saved in the locations you specified.
• The job script is saved to the Amazon S3 location indicated by the job property Script path, in the
Scripts folder.
• The notebook file (.ipynb) is saved to the Amazon S3 location indicated by the job property Script
path, in the Notebooks folder.
When you save the job, the job script contains only the code cells from the notebook. The Markdown
cells aren't included.
After you save the job, you can then run the job using the script that you created in the notebook.
3
AWS Glue Studio User Guide
Managing notebook sessions
To modify the default session timeout for notebooks in AWS Glue Studio
1. In the notebook, enter the %idle_timeout magic in a cell and specify the timeout value in
minutes.
2. For example: %idle_timeout 15 will change the default timeout from 60 to 15 minutes. If the
session is not used in 15 minutes, the session is automatically stopped.
To view a list of available Python modules see Using Python Libraries with AWS Glue
You can also specify the Number of workers with %number_of_workers. For example, to specify 40
workers: %number_of_workers 40.
If you navigate away from the notebook in the AWS console, you will receive a warning message where
you can choose to stop the session.
4
AWS Glue Studio User Guide
API design and CRUD APIs
Topics
• API design and CRUD APIs (p. 5)
• Getting started (p. 5)
• API design and CRUD APIs (p. 9)
• SDK Onboarding (p. 9)
• Appendix: Visual Job Examples and Model Definitions (p. 9)
AWS Glue provides an API that allows customers to create data integration jobs using the AWS Glue API
from a JSON object that represents a DAG. Customers can then use the visual editor in AWS Glue Studio
to work with these jobs.
Updates to the codeGenConfigurationNodes field will be done through the update-job AWS Glue API
in a similar way as create-job. The entire field should be specified in update-job where DAG has been
changed as desired. A null value provided will be ignored and no update to the DAG would be performed.
An empty structure or string will cause the codeGenConfigurationNodes to be set as empty and any
previous DAG removed. The get-job API will return a DAG if one exists. The delete-job API will need to
also delete any associated DAG.
Getting started
Follow the SDK Onboarding (p. 9).
To create a job, use the createJob function. The CreateJobRequest input will have an additional field
‘codeGenConfigurationNodes’ where you can get specify the DAG object in JSON.
5
AWS Glue Studio User Guide
Getting started
{
"Name":"myjob1",
"Role":"arn:aws:iam::253723508848:role/myrole",
"Description":"",
"GlueVersion":"2.0",
"Command":{
"Name":"glueetl",
"ScriptLocation":"s3://myscripts/myjob1.py",
"PythonVersion":"3"
},
"MaxRetries":3,
"Timeout":2880,
"ExecutionProperty":{
"MaxConcurrentRuns":1
},
"NotificationProperty":{},
"DefaultArguments":{
"--class":"GlueApp",
"--job-language":"python",
"--job-bookmark-option":"job-bookmark-enable",
"--TempDir":"s3://assets/temporary/",
"--enable-metrics":"true",
"--enable-continuous-cloudwatch-log":"true",
"--enable-spark-ui":"true",
"--spark-event-logs-path":"s3://assets/sparkHistoryLogs/",
"--encryption-type":"sse-s3",
"--enable-glue-datacatalog":"true"
},
"Tags":{},
"DeveloperMode":false,
"WorkerType":"G.1X",
"NumberOfWorkers":10,
"CodeGenConfigurationNodes":{
"node-1":{
"S3CatalogSource":{
"Database":"database",
"Name":"S3 bucket",
"Table":"table1"
}
},
"node-2":{
"ApplyMapping":{
"Mapping":[
{
"FromPath":[
"col0"
],
"ToKey":"col0",
"ToType":"string",
"FromType":"string",
"Dropped":false
},
{
"FromPath":[
"col1"
],
"ToKey":"col1",
"ToType":"string",
"FromType":"string",
"Dropped":false
},
{
"FromPath":[
"col2"
],
"ToKey":"col2",
6
AWS Glue Studio User Guide
Getting started
"ToType":"string",
"FromType":"string",
"Dropped":false
},
{
"FromPath":[
"col3"
],
"ToKey":"col3",
"ToType":"string",
"FromType":"string",
"Dropped":false
}
],
"Inputs":[
"node-1"
],
"Name":"ApplyMapping"
}
},
"node-3":{
"S3CatalogTarget":{
"Path":"s3://mypath/",
"UpdateCatalogOptions":"none",
"Inputs":[
"node-1",
"node-2"
],
"SchemaChangePolicy":{
"enableUpdateCatalog":false
},
"Name":"S3 bucket",
"Format":"json",
"PartitionKeys":[],
"Compression":"none"
}
}
}
}
{
"Name": "myjob2",
"Role": "arn:aws:iam::253723508848:role/myrole",
"Description": "",
"GlueVersion": "2.0",
"Command": {
"Name": "glueetl",
"ScriptLocation": "s3://myscripts/myjob1.py",
"PythonVersion": "3"
},
"MaxRetries": 3,
"Timeout": 2880,
"ExecutionProperty": {
"MaxConcurrentRuns": 1
},
"node-3": {
"S3DirectTarget": {
"Path": "s3://mypath/",
"UpdateCatalogOptions": "none",
"Inputs": [
7
AWS Glue Studio User Guide
Getting started
"node-1624994219677"
],
"SchemaChangePolicy": {
"EnableUpdateCatalog": false
},
"Name": "S3 bucket",
"Format": "json",
"PartitionKeys": [],
"Compression": "none"
}
},
"node-1624994205115": { "
CatalogSource": {
"Name": "AWS Glue Data Catalog",
"Database": "database2",
"Table": "table2"
}
},
"node-1624994219677": {
"Join": {
"Name": "Join",
"Inputs": [
"node-1624994205115",
"node-2"
],
"JoinType": "equijoin",
"Columns": [
{
"From": "node-1624994205115",
"Keys": [
"firstname"
]
},
{
"From": "node-2",
"Keys": [
"col0"
]
}
],
"ColumnConditions": [
"="
]
}
},
"node-2": {
"S3CatalogSource": {
"Database": "database",
“Input”: [“node-1624994219677”]
"Name": "S3 bucket",
"Format": "json",
"PartitionKeys": [],
"Compression": "none"
}
}
}
Updating jobs Since updateJob will also have a ‘codeGenConfigurationNodes’ field, the input format will
be the same. The get-job command will return a ‘codeGenConfigurationNodes’ field in the same format
as well.
8
AWS Glue Studio User Guide
API design and CRUD APIs
{
"Nodeid-1": {...},
"Nodeid-2": {...}
}
Things to note:
SDK Onboarding
To access the required files, go to the GitHub repository as described below.
CLI
Go to the GitHub repository to access the service-2.json file and download the file. If you're using Mac
or Linux, place this file in the folder ~/.aws/models/glue/2017- 03-31. If .aws does not exist, that mean
you have to configure the AWS CLI. AWS CLI installation instructions can be found here . If you do not
have the other folders, you can create them manually. The CLI with this custom model can be used in the
same way that CLI is normally used.
Java SDK
For older java clients, a JAR called AwsGlueJavaClient-1.12.x.jar is available on the GitHub
repository .
To use the newer AWS SDK for Java2.x, a JAR called AwsJavaSdk-Glue-2.0.jar is available on the
GitHub repository .
Add the JAR to your class path in your preferred way. After the JAR is added to your class path, it can be
used in the same way as you are using the existing AWS Glue SDK.
Examples
Sources
9
AWS Glue Studio User Guide
Examples
{
"Database": "database",
"Table": "table1",
"Name": "S3 bucket",
"IsCatalog": true
}
{
"Database": "database",
"Table": "rdsSource",
"Name": "MyRdsSource",
"IsCatalog": true
}
Data Targets
S3CatalogTarget
{
"Inputs": [
"node-1625147321253"
],
"Database": "dbl",
"Table": "s3Table",
"Name": "s3 bucket",
"Format": "json",
"PartitionKeys": [
"col1"
],
"UpdateCatalogOptions": "schemaAndPartitions",
"SchemaChangePolicy": {
"EnableUpdateCatalog": true,
"UpdateBehavior": "UPDATE_IN_DATABASE"
}
}
S3DirectTarget
{
"Path": "s3://mypath/",
"UpdateCatalogOptions": "none",
"Inputs": [
"node-2"
],
"SchemaChangePolicy": {
"EnableUpdateCatalog": false
},
"Name": "S3 bucket",
"Format": "json",
"PartitionKeys": [],
10
AWS Glue Studio User Guide
Model Definitions
"Classification": "DataSink",
"Compression": "none"
}
Transforms
Rename Field
{
"Inputs": [
"node-1"
],
"Name": "MyRenameField",
"SourcePath": "col3"
"TargetPath": "name"
}
Filter
{
"Name": "Filter",
"Inputs": [
"node-2"
],
"LogicalOperator": "AND",
"Filters": [
{
"Operation": "ISNULL",
"Negated": false,
"Values": [
{
"Type": "COLUMNEXTRACTED",
"Value": "col1"
}
]
},
{
"Operation": "REGEX",
"Negated": false,
"Values": [
{
"Type": "CONSTANT",
"Value": ".*"
},
{
"Type": "COLUMNEXTRACTED",
"Value": "col2"
}
]
}
]
}
Model Definitions
Sources
11
AWS Glue Studio User Guide
Model Definitions
AthenaConnector
{
"Name": 100 character String Required,
"ConnectionName": 256 character String Required,
"ConnectorName": 256 character String,
"ConnectionType": 256 character String Required,
"ConnectionTable": 256 character String Required,
"SchemaName": 256 character String Required
}
JDBCConnector
{
"Name": 100 character String Required,
"ConnectionName": 256 character String Required,
"ConnectorName": 256 character String,
"ConnectionType": 256 character String Required,
"AdditionalOptions": JDBCConnectorOptions,
"ConnectionTable": 256 character String,
"Query": 256 character String
}
JDBCConnectorOptions:
{
"FilterPredicate": 256 character String,
"PartitionColumn": 256 character String,
"LowerBound": Non-Negative Long,
"UpperBound": Non-Negative Long,
"NumPartitions": Non-Negative Long,
"JobBookmarkKeys": List of Strings up to 100,
"JobBookmarkKeysSortOrder": ASC or DESC,
"DataTypeMapping": Map<DBCDataType, JDBCDataType>
SparkConnectorSource
{
"Name": 100 character String Required,
"ConnectionName": 256 character String Required,
"ConnectorName": 256 character String,
"ConnectionType": 256 character String Required,
"AdditionalOptions": Map<256 character String, Object>
}
CatalogSource:
{
"Name": 100 character String Required,
"Database": 256 character String Required,
"Table": 256 character String Required
}
12
AWS Glue Studio User Guide
Model Definitions
{
"Name": 100 character String Required,
"Database": 256 character String Required,
"WindowSize": Positive Integer,
"DetectSchema": Boolean,
"StreamingOptions": KinesisStreamingSourceOptions,
"Table": 256 character String Required
}
KinesisStreamingSourceOptions
{
"EndpointUrl":256 character String,
"StreamName":256 character String, "Classification":256 character
String,
"Delimiter":256 character String,
"StartingPosition": LATEST or TRIM_HORIZON or EARLIEST,
"MaxFetchTimeInMs": Non-negative Long,
"MaxFetchRecordsPerShard": non-negative Long,
"MaxRecordsPerRead": Non-negative Long,
"AddIdleTimeBetweenReads": Boolean,
"IdleTimeBetweenReadsInMs": Non-negative Long,
"DescribeShardInterval": Non-negative Long,
"NumRetries": Positive Integer,
"RetryIntervalInMs": Non-negative Long,
"MaxRetryIntervalMs": Non-negative Long,
"AvoidEmptyBatches": Boolean,
"StreamARN": 256 character String
"AwsSTSRoleARN": 256 character String,
"AwsSTSSessionName": 256 character String
}
DirectKinesisSource
{
"Name":100 character String Required,
"WindowSize": Positive Integer,
"DetectSchema": Boolean,
"StreamingOptions": KinesisStreamingSourceOptions
}
CatalogKafkaSource
{
"Name":100 character String Required,
"Database": 256 character String Required,
"WindowSize": Positive Integer,
"DetectSchema": Boolean,
"StreamingOptions": KafkaStreamingSourceOptions,
"Table": 256 character String Required,
}
KafkaStreamingSourceOptions
13
AWS Glue Studio User Guide
Model Definitions
{
"BootstrapServers":256 character String,
"SecurityProtocol":256 character String,
"ConnectionName":256 character String,
"TopicName":256 character String,
"Assign":256 character String,
"SubscribePattern":256 character String,
"Classification":256 character String,
"Delimiter":256 character String,
"StartingOffsets":256 character String,
"EndingOffsets":256 character String,
"PollTimeoutInMs": Non-negative long,
"NumRetries": Positive integer,
"RetryIntervalMs": Non-negative long,
"MaxOffsetsPerTrigger": Non-negative long,
"MinPartitions": Non-negative integer
}
DirectKafkaSource
{
"Name":100 character String Required,
"WindowSize": Positive Integer,
"DetectSchema": Boolean,
"StreamingOptions": KafkaStreamingSourceOptions
}
{
"Name":100 character String Required,
"Database": 256 character String Required,
"Table": 256 character String Required,
"RedshiftTmpDir":256 character String,
"TmpDirIAMRole":256 character String
}
S3CatalogSource
{
"Name":100 character String Required,
"Database": 256 character String Required,
"Table": 256 character String Required,
"S3SourceAdditionalOptions": {
//Only one can be specified, or neither
"BoundedSize":Nullable Long,
"BoundedFiles":Nullable Long
}
}
S3CSVSource
14
AWS Glue Studio User Guide
Model Definitions
S3JSONSource
{
"Name":100 character String Required,
"Paths": List of Strings. Up to 100 256 character Strings Required,
"CompressionType":gzip or bzip2,
"Exclusions": List of Strings. Up to 100 256 character Strings,
"GroupFiles":256 character String,
"GroupSize":256 character String,
"Recurse":Boolean,
"MaxBand": Non negative Integer,
"MaxFilesInBand": Non negative Integer,
"S3SourceAdditionalOptions": {
//Only one can be specified, or neither
"BoundedSize":Nullable Long,
"BoundedFiles":Nullable Long
},
"JsonPath":256 character String,
"Multiline":Boolean
}
S3ParquetSource
{
"Name":100 character String Required,
"Paths": List of Strings. Up to 100 256 character Strings Required,
"CompressionType":gzip or bzip2,
"Exclusions": List of Strings. Up to 100 256 character Strings,
"GroupFiles":256 character String,
"GroupSize":256 character String,
"Recurse":Boolean,
"MaxBand": Non negative Integer,
"MaxFilesInBand": Non negative Integer,
"S3SourceAdditionalOptions": {
15
AWS Glue Studio User Guide
Model Definitions
Targets
JDBCConnectorTarget
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required,
"ConnectionName":256 character String Required,
"ConnectionTable":256 character String,
"ConnectorName":256 character String,
"ConnectionType":256 character String Required,
"ConnectionTypeSuffix":256 character String,
"AdditionalOptions":Map<256 character String,Object>
}
SparkConnectorTarget
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required,
"ConnectionName":256 character String Required,
"ConnectionTable":256 character String,
"ConnectorName":256 character String,
"ConnectionType":256 character String Required,
"ConnectionTypeSuffix": 256 character String,
AdditionalOptions":Map<256 character String,Object>
}
CatalogTarget
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required,
"Database":256 character String Required,
"Table":256 character String Required
}
RedshiftTarget
{ "Name":100 character String Required, "Inputs": List of Strings. One 256 character String Required,
"Database":256 character String Required, "Table":256 character String Required, "RedshiftTmpDir":256
character String, "TmpDirIAMRole":256 character String }
S3CatalogTarget
{
"Name":100 character String Required,
16
AWS Glue Studio User Guide
Model Definitions
S3DirectTarget
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required,
"PartitionKeys": List of Strings. Up to 100 256 character Strings,
"Path":256 character String Required,
"Compression": gzip or bzip2,
"Format":json, csv, avro, orc, or parquet Required,
"SchemaChangePolicy": DirectSchemaChangePolicy
}
DirectSchemaChangePolicy:
{
"EnableUpdateCatalog": Boolean,
"UpdateBehavior": "LOG" | "UPDATE_IN_DATABASE",
"Database":256 character String,
"Table":256 character String
}
Transforms
ApplyMapping
See the end of the document for the possible values of ApplyMappingType
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required,
"Mapping":List of up to 250 Mapping Required
}
Mapping:
{
"ToKey":256 character String Required,
"FromPath": List of Strings. One 256 character String Required,
"FromType":ApplyMappingType Required,
"ToType": ApplyMappingType Required,
"Dropped":Boolean,
"Children": List of up to 250 Mapping
}
SelectFields
{
"Name":100 character String Required,
17
AWS Glue Studio User Guide
Model Definitions
DropFields
{
"Name":100 character String Required,
"Inputs: List of Strings. One 256 character String Required,
"Paths": List of Strings. Up to 100 256 character Strings Required
}
RenameField
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required,
"SourcePath":List of Strings. Up to 100 256 character Strings Required
"TargetPath":256 character String Required
}
Spigot
{
"name":100 character String Required,
"inputs": List of Strings. One 256 character String Required,
"path":256 character String Required,
"topk":Integer from 0 to 100,
"prob":Double from 0 to 1.0
}
Join
{
"Name": 100 character String Required
"Inputs": List of Strings. Two 256 character String Required
"JoinTYpe": equijoin, left, right, outer, leftsemi, or
leftanti Required
"Columns": List[Column] Required
}
Column:
{
"From": 256 character String Required
"Keys": List[String] Required
}
SplitFields
{
"Name":100 character String Required,
18
AWS Glue Studio User Guide
Model Definitions
SelectFromCollection
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required,
"Index": Non Negative Integer Required
}
FillMissingValues
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String RRequired,
"ImputedPath":256 character String Required
"FilledPath":256 character String
}
Filter
{
"Name":100 character String Required,
"Inputs": List of Strings. One 256 character String Required,
"LogicalOperator":String Required,
"Filters":List[FilterInstance] Required
}
FilterInstance:
{
"Operation": "EQ" | "LT" | "GT" | "LTE" | "GTE" | "REGEX" |
"ISNULL" Required,
"Negated":Boolean,
"Values":List[FilterValue] Required
}
FilterValue:
{
"Type": "COLUMNEXTRACTED" | "CONSTANT" Required,
"Value": Object Required,
CustomCode
{
"Name":100 character String Required,
"Inputs": List of Strings. One to fifty 256 character String Required,
"Code":Up to 51,200 character string or 50 KB Required,
"ClassName":256 character String Required
}
SparkSQL
19
AWS Glue Studio User Guide
Model Definitions
{
"Name":100 character String Required,
"Inputs": List of Strings. One to fifty 256 character String Required,
"SqlQuery": Up to 51,200 character string or 50 KB Required,
"SqlAliases":List of Alias. Up to 256 Aliases Required
}
Alias:
{
"From":256 character String Required,
"Alias":256 character String Required
}
DropNullFields
{
"Name":100 character String Required,
"Inputs": List of Strings. One to fifty 256 character String Required,
"Paths":List of Strings. Up to 100 256 character Strings Required
"NullCheckBoxList": NullCheckBoxList,
"NullTextList": List of NullValueField. Up to 50 NullValueField.
}
NullCheckboxList
{
"IsEmpty": Boolean,
"IsNullString": Boolean,
"IsNegOne": Boolean
}
NullvalueFields
{
"Value": 256 character String,
"DataType": DataType
}
DataType
{
"Id": 256 character String,
"Label": 256 character String
}
Union
{
"Name":100 character String Required,
"Inputs":List of Strings. Two 256 character String Required,
"Sources":List of Strings. Two 256 character String,
"UnionType": ALL or DISTINCT Required
}
Enums
JDBCDataType
ARRAY,BIGINT,BINARY,BIT,BLOB,BOOLEAN,CHAR,CLOB,DATALINK,DATE,DECIMAL,DISTINCT,DOUBL
E,FLOAT,INTEGER,JAVA_OBJECT,LONGNVARCHAR,LONGVARBINARY,LONGVARCHAR,NCHAR,NCLOB,NULL ,NUMERIC,NVARCHAR,
WITH_TIMEZONE,TIMESTAMP,TIMESTAMP_WITH_TIMEZONE,TINYINT,VARBINARY,VARCHAR
20
AWS Glue Studio User Guide
Model Definitions
ApplyMappingType
bigint,binary,boolean,char,date,decimal,double,float,int,interval,long,smallint,str
ing,timestamp,tinyint,varchar
21
AWS Glue Studio User Guide
Choosing how you want the data to be scanned
Note
Using the Detect PII transform in AWS Glue Studio jobs requires AWS Glue 2.0.
The Detect PII transform identifies Personal Identifiable Information (PII) in your data source. You choose
the PII entity to identify, how you want the data to be scanned, and what to do with the PII entity that
have been identified by the Detect PII transform.
Topics
• Choosing how you want the data to be scanned (p. 22)
• Choosing the PII entities to take action on (p. 23)
• Choosing what to do with identified PII data (p. 24)
When you choose Detect PII in each cell, you’re choosing to scan all rows in the data source. This is a
comprehensive scan to ensure that PII entities are identified.
When you choose Detect fields containing PII, you’re choosing to scan a sample of rows for PII entities.
This is a way to keep costs and resources low while also identifying the fields where PII entities are found.
When you choose to detect fields that contain PII, you can reduce costs and improve performance by
sampling a portion of rows. Choosing this option will allow you to specify additional options:
• Sample portion: This allows you to specify the percentage of rows to sample. For example, if you
enter ‘50’, you’re specifying that you want 50 percent of scanned rows for the PII entity.
• Detection threshold: This allows you to specify the percentage of rows that contain the PII entity
in order for the entire column to be identified as having the PII entity. For example, if you enter ‘10’,
you’re specifying that the number of the PII entity, US Phone, in the rows that are scanned must be
10 percent or greater in order for the field to be identified as having the PII entity, US Phone. If the
percentage of rows that contain the PII entity is less than 10 percent, that field will not be labeled as
having the PII entity, US Phone, in it.
22
AWS Glue Studio User Guide
Choosing the PII entities to take action on
• ITIN (US)
• Email
• Passport Number (US)
• US Phone
• Credit Card
• Bank Account (US, Canada)
• US Driving License
• IP Address
• MAC Address
• DEA Number (US)
• HCPCS Code (US)
• National Provider Identifier (US)
• National Drug Code (US)
• Health Insurance Claim Number (US)
• Medicare Beneficiary Identifier (US)
• CPT Code (US)
23
AWS Glue Studio User Guide
Choosing what to do with identified PII data
• Enrich data with detection results: If you chose Detect PII in each cell, you can store the detected
entities into a new column.
• Redact detected text: You can replace the detected PII value with a string that you specify in the
optional Replacing text input field. If no string is specified, the detected PII entity is replaced with
'*******'.
If you chose to detect fields containing PII, you chan choose to take the following actions:
• Output Detection Results: This creates a new dataframe with the detected PII information for each
column.
• Redact detected text: You can replace the detected PII value with a string that you specify. If no string
is specified, the detected PII entity is replaced with '*******'.
24
AWS Glue Studio User Guide
AWS Glue Studio is designed not only for tabular data, but also for semi-structured data, which is
difficult to render in spreadsheet-like data preparation interfaces. Examples of semi-structured data
include application logs, mobile events, Internet of Things (IoT) event streams, and social feeds.
When creating a job in AWS Glue Studio, you can choose from a variety of data sources that are stored
in AWS services. You can quickly prepare that data for analysis in data warehouses and data lakes. AWS
Glue Studio also offers tools to monitor ETL workflows and validate that they are operating as intended.
You can preview the dataset for each node. This helps you to debug your ETL jobs by displaying a sample
of the data at each step of the job.
AWS Glue Studio provides a visual interface that makes it easy to:
25
AWS Glue Studio User Guide
Features of AWS Glue Studio
• Run, monitor, and manage the jobs created in AWS Glue Studio.
The notebook editor interface in AWS Glue Studio offers the following features:
26
AWS Glue Studio User Guide
Job script code editor
• Test in the exact same run environment that your AWS Glue ETL jobs run in
When creating a new job, you can choose to write scripts for Spark jobs or Python shell jobs. You can
code the job ETL script for Spark jobs using either Python or Scala. If you create a Python shell job, the
job ETL script uses Python 3.6.
The script editor interface in AWS Glue Studio offers the following features:
• Insert, modify, and delete sources, targets, and transforms in your script.
• Add or modify arguments for data sources, targets, and transforms.
• Syntax and keyword highlighting
• Auto-completion suggestions for local words, Python keywords, and code snippets.
• Jobs overview summary – A high-level overview showing total jobs, current runs, completed runs, and
failed jobs.
• Status summaries – Provides high level job metrics based on job properties, such as worker type and
job type.
• Job runs time line – A bar graph summary of successful, failed, and total runs for the currently selected
time frame.
• Job run breakdown – A detailed list of job runs from the selected time frame.
AWS Glue Studio makes it easy for ETL developers to create repeatable processes to move and transform
large-scale, semi-structured datasets, and load them into data lakes and data warehouses. It provides a
boxes-and-arrows style visual interface for developing and managing AWS Glue ETL workflows that you
can optionally customize with code. AWS Glue Studio combines the ease of use of traditional ETL tools,
and the power and flexibility of AWS Glue’s big data processing engine.
27
AWS Glue Studio User Guide
Accessing AWS Glue Studio
AWS Glue Studio provides multiple ways to customize your ETL scripts,including adding nodes that
represent code snippets in the visual editor.
Use AWS Glue Studio for easier job management. AWS Glue Studio provides you with job and job run
management interfaces that make it clear how jobs relate to each other, and give an overall picture of
your job runs. The job management page makes it easy to do bulk operations on jobs (previously difficult
to do in the AWS Glue console). All job runs are available in a single interface where you can search and
filter. This gives you a constantly updated view of your ETL operations and the resources you use. You
can use the real-time dashboard in AWS Glue Studio to monitor your job runs and validate that they are
operating as intended.
You also pay for the underlying AWS services that your jobs use or interact with–for example, AWS Glue,
your data sources, and your data targets. For pricing information, see AWS Glue Pricing.
28
AWS Glue Studio User Guide
Complete initial AWS configuration tasks
Topics
• Complete initial AWS configuration tasks (p. 29)
• Review IAM permissions needed for the AWS Glue Studio user (p. 31)
• Review IAM permissions needed for ETL jobs (p. 34)
• Set up IAM permissions for AWS Glue Studio (p. 35)
• Configure a VPC for your ETL job (p. 37)
• Populate the AWS Glue Data Catalog (p. 38)
You can either use the administrator user for creating and managing your ETL jobs, or you can create a
separate user for accessing AWS Glue Studio.
To create additional users for AWS Glue or AWS Glue Studio, follow the steps in Creating Your First IAM
Delegated User and Group in the IAM User Guide.
• Sign in as an IAM user (p. 30)
1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.
Part of the sign-up procedure involves receiving a phone call and entering a verification code on the
phone keypad.
29
AWS Glue Studio User Guide
Sign in as an IAM user
To create an administrator user for yourself and add the user to an administrators group
(console)
1. Sign in to the IAM console as the account owner by choosing Root user and entering your AWS
account email address. On the next page, enter your password.
Note
We strongly recommend that you adhere to the best practice of using the Administrator
IAM user that follows and securely lock away the root user credentials. Sign in as the root
user only to perform a few account and service management tasks.
2. In the navigation pane, choose Users and then choose Add user.
3. For User name, enter Administrator.
4. Select the check box next to AWS Management Console access. Then select Custom password, and
then enter your new password in the text box.
5. (Optional) By default, AWS requires the new user to create a new password when first signing in. You
can clear the check box next to User must create a new password at next sign-in to allow the new
user to reset their password after they sign in.
6. Choose Next: Permissions.
7. Under Set permissions, choose Add user to group.
8. Choose Create group.
9. In the Create group dialog box, for Group name enter Administrators.
10. Choose Filter policies, and then select AWS managed - job function to filter the table contents.
11. In the policy list, select the check box for AdministratorAccess. Then choose Create group.
Note
You must activate IAM user and role access to Billing before you can use the
AdministratorAccess permissions to access the AWS Billing and Cost Management
console. To do this, follow the instructions in step 1 of the tutorial about delegating access
to the billing console.
12. Back in the list of groups, select the check box for your new group. Choose Refresh if necessary to
see the group in the list.
13. Choose Next: Tags.
14. (Optional) Add metadata to the user by attaching tags as key-value pairs. For more information
about using tags in IAM, see Tagging IAM entities in the IAM User Guide.
15. Choose Next: Review to see the list of group memberships to be added to the new user. When you
are ready to proceed, choose Create user.
You can use this same process to create more groups and users and to give your users access to your AWS
account resources. To learn about using policies that restrict user permissions to specific AWS resources,
see Access management and Example policies.
30
AWS Glue Studio User Guide
Review IAM permissions needed
for the AWS Glue Studio user
Job Actions
• GetJob
• CreateJob
• DeleteJob
• GetJobs
• UpdateJob
• *QueryJobs
• *SaveJob
• *CreateDag
• *UpdateDag
• *GetDag
• *DeleteDag
Database Actions
• GetDatabases
Plan Actions
• GetPlan
31
AWS Glue Studio User Guide
Creating Custom IAM Policies for AWS Glue Studio
• StartJobRun
• GetJobRuns
• BatchStopJobRun
• GetJobRun
• *QueryJobRuns
• *QueryJobRunsAggregated
Schema Actions
• *GetSchema
• *GetInferredSchema
Table Actions
• SearchTables
• GetTables
• GetTable
Connection Actions
File Actions
• GetFile
Mapping Actions
• GetMapping
• *GetNextScheduledJobs
Repository Actions
• *ListRepositories
Branch Actions
• *ListBranches
• *GetBranches
Commit Actions
32
AWS Glue Studio User Guide
Notebook and data preview permissions
• *CreateCommit
• *GetCommit
S3 Proxy Actions
• *ListBuckets
• *ListObjectsV2
• *GetBucketLocation
• *CreateSchedule
• *GetSchedule
• *DeleteSchedule
• GetSecurityConfigurations
Script Actions
To ensure data previews and notebook commands work correctly, use a role that has a name that starts
with the string AWSGlueServiceRole. If you choose to use a different name for your role, then you
must add the iam:passrole permission and configure a role trust policy for the role in IAM. Add the
AWS Glue service as a principal in this trust policy, as described in Create a trust policy for roles not
named "AWSGlueServiceRole*" (p. 36).
Warning
If a role grants the iam:passrole permission for a notebook, and you implement role
chaining, a user could unintentionally gain access to the notebook. There is currently no auditing
implemented which would allow you to monitor which users have been granted access to the
notebook.
To access CloudWatch dashboards, the user accessing AWS Glue Studio needs one of the following:
33
AWS Glue Studio User Guide
Review IAM permissions needed for ETL jobs
For more information for changing permissions for an IAM user using policies, see Changing Permissions
for an IAM User in the IAM User Guide.
The name of the role that you create for the job must start with the string AWSGlueServiceRole
for it to be used correctly by AWS Glue Studio. For example, you might name your role
AWSGlueServiceRole-FlightDataJob.
If you choose Amazon Redshift as your data source, you can provide a role for cluster permissions. Jobs
that run against a Amazon Redshift cluster issue commands that access Amazon S3 for temporary
storage using temporary credentials. If your job runs for more than an hour, these credentials will expire
causing the job to fail. To avoid this problem, you can assign a role to the Amazon Redshift cluster itself
that grants the necessary permissions to jobs using temporary credentials. For more information, see
Moving Data to and from Amazon Redshift in the AWS Glue Developer Guide.
If the job uses data sources or targets other than Amazon S3, then you must attach the necessary
permissions to the IAM role used by the job to access these data sources and targets. For more
information, see Setting Up Your Environment to Access Data Stores in the AWS Glue Developer Guide.
If you're using connectors and connections for your data store, you need additional permissions, as
described in the section called “Permissions required for using connectors” (p. 35).
34
AWS Glue Studio User Guide
Permissions required for using connectors
enables the job to decrypt the data. The job role needs the kms:ReEncrypt, kms:GenerateDataKey,
and kms:DescribeKey permissions. Additionally, the job role needs the kms:Decrypt permission
to upload or download an Amazon S3 object that is encrypted with an AWS KMS customer master key
(CMK).
There are additional charges for using AWS KMS CMKs. For more information, see AWS Key Management
Service Concepts - Customer Master Keys (CMKs) and AWS Key Management Service Pricing in the AWS
Key Management Service Developer Guide.
If your AWS Glue ETL job runs within a VPC running Amazon VPC, then the VPC must be configured as
described in the section called “Configure a VPC for your ETL job” (p. 37).
You can use the AWSGlueConsoleFullAccess AWS managed policy to provide the necessary permissions
for using the AWS Glue Studio console.
To create your own policy, follow the steps documented in Create an IAM Policy for the AWS Glue
Service in the AWS Glue Developer Guide. Include the IAM permissions described previously in Review IAM
permissions needed for the AWS Glue Studio user (p. 31).
Topics
• Create an IAM Role (p. 35)
• Attach policies to the AWS Glue Studio user (p. 36)
• Create a trust policy for roles not named "AWSGlueServiceRole*" (p. 36)
You need to grant your IAM role permissions that AWS Glue Studio and AWS Glue can assume when
calling other services on your behalf. This includes access to Amazon S3 for storing scripts and temporary
files, and any other sources or targets that you use with AWS Glue Studio.
1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
35
AWS Glue Studio User Guide
Attach policies to the AWS Glue Studio user
If you choose a different name for your role, you must add a policy to allow your users the
iam:PassRole permission for IAM roles to match your naming convention.
1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the navigation pane, choose Policies.
3. In the list of policies, select the check box next to the AWSGlueConsoleFullAccess. You can use the
Filter menu and the search box to filter the list of policies.
4. Choose Policy actions, and then choose Attach.
5. Choose the user to attach the policy to. You can use the Filter menu and the search box to filter the
list of principal entities. After choosing the user to attach the policy to, choose Attach policy.
6. Repeat the previous steps to attach additional policies to the user, as needed.
1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the left-side navigation, choose Roles.
3. Locate the role used for data previews or your ETL job, and then choose the role name.
4. Choose the Trust Relationships tab, and then choose the Edit trust relationship button.
5. Copy and paste the following blocks into the policy under the "Statement" array.
{
"Action": ["iam:PassRole"],
36
AWS Glue Studio User Guide
Configure a VPC for your ETL job
"Effect": "Allow",
"Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*",
"Condition": {
"StringLike": {
"iam:PassedToService": ["glue.amazonaws.com"]
}
}
},
{
"Effect": "Allow",
"Principal": {
"Service": ["glue.amazonaws.com"]
},
"Action": "sts:AssumeRole"
}
Here is the full example with the Version and Statement arrays included in the policy
{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["iam:PassRole"],
"Effect": "Allow",
"Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*",
"Condition": {
"StringLike": {
"iam:PassedToService": ["glue.amazonaws.com"]
}
}
},
{
"Effect": "Allow",
"Principal": {
"Service": ["glue.amazonaws.com"]
},
"Action": "sts:AssumeRole"
}
]
}
You can configure your AWS Glue ETL jobs to run within a VPC when using connectors. You must
configure your VPC for the following, as needed:
37
AWS Glue Studio User Guide
Populate the AWS Glue Data Catalog
• Public network access for data stores not in AWS. All data stores that are accessed by the job must be
available from the VPC subnet.
• If your job needs to access both VPC resources and the public internet, the VPC needs to have a
network address translation (NAT) gateway inside the VPC.
For more information, see Setting Up Your Environment to Access Data Stores in the AWS Glue
Developer Guide.
When reading from or writing to a data source, your ETL job needs to know the schema of the data. The
ETL job can get this information from a table in the AWS Glue Data Catalog. You can use a crawler, the
AWS Glue console, AWS CLI, or an AWS CloudFormation template file to add databases and tables to
the Data Catalog. For more information about populating the Data Catalog, see Data Catalog in the AWS
Glue Developer Guide.
When using connectors, you can use the schema builder to enter the schema information when you
configure the data source node of your ETL job in AWS Glue Studio. For more information, see the
section called “Authoring jobs with custom connectors” (p. 85).
For some data sources, AWS Glue Studio can automatically infer the schema of the data it reads from the
files at the specified location.
• For Amazon S3 data sources, you can find more information at Using files in Amazon S3 for the data
source (p. 51).
• For streaming data sources, you can find more information at Using a streaming data source (p. 53).
38
AWS Glue Studio User Guide
Prerequisites
Topics
• Prerequisites (p. 39)
• Step 1: Start the job creation process (p. 39)
• Step 2: Edit the data source node in the job diagram (p. 40)
• Step 3: Edit the transform node of the job (p. 41)
• Step 4: Edit the data target node of the job (p. 41)
• Step 5: View the job script (p. 42)
• Step 6: Specify the job details and save the job (p. 42)
• Step 7: Run the job (p. 43)
• Next steps (p. 43)
Prerequisites
This tutorial has the following prerequisites:
To create these components, you can complete the service tutorial Add a crawler, which populates
the AWS Glue Data Catalog with the necessary objects. This tutorial also creates an IAM role with
the necessary permissions. You can find the tutorial on the AWS Glue service page at https://
console.aws.amazon.com/glue/. The tutorial is located in the left-side navigation, under Tutorials.
Alternatively, you can use the documentation version of this tutorial, Tutorial: Adding an AWS Glue
crawler (p. 114).
39
AWS Glue Studio User Guide
Step 2: Edit the data source node in the job diagram
1. Sign in to the AWS Management Console and open the AWS Glue Studio console at https://
console.aws.amazon.com/gluestudio/.
2. On the AWS Glue Studio landing page, choose View jobs under the heading Create and manage
jobs.
3. On the Jobs page, under the heading Create job, choose the Source and target added to the graph
option. Then, choose S3 for the Source and S3 for the Target.
4. Choose the Create button to start the job creation process.
The job editing page opens with a simple three-node job diagram displayed.
1. On the Node properties tab in the node details pane, for Name, enter a name that is unique for this
job.
The value you enter is used as the label for the data source node in the job diagram. If you use
unique names for the nodes in your job, then it's easier to identify each node in the job diagram, and
also to select parent nodes. For this tutorial, enter the name S3 Flight Data.
2. Choose the Data source properties - S3 tab in the node details panel.
3. Choose the Data Catalog table option for the S3 source type.
4. For Database, choose the flights-db database from the list of available databases in your AWS Glue
Data Catalog.
5. For Table, enter flight in the search field, and then choose the flightscsv table from your AWS
Glue Data Catalog.
6. (Optional) Choose the Output schema tab in the node details panel to view the data schema.
7. (Optional) After configuring the node properties and data source properties, you can preview the
dataset from your data source by choosing the Data preview tab in the node details panel. The first
time you choose this tab for any node in your job, you are prompted to provide an IAM role to access
the data. There is a cost associated with using this feature, and billing starts as soon as you provide
an IAM role.
By default, the first 5 columns are selected for viewing in the Data preview tab. To view other
columns, choose Previewing 5 of 65 fields. For example, you can deselect the first 5 columns and
select fl_date, airline_id, fl_num, tail_num, and origin_airport_id. Scroll to the end of
the column list and choose Confirm to save your choices.
After you have provided the required information for the data source node, a green check mark appears
on the node in the job diagram.
40
AWS Glue Studio User Guide
Step 3: Edit the transform node of the job
When you edit the Transform - ApplyMapping node, the original schema for your data is shown in the
Source key column in the node details panel. This is the data property key name (column name) that is
obtained from the source data and stored in the table in the AWS Glue Data Catalog.
The Target key column shows the key name that will appear in the data target. You can use this field to
change the data property key name in the output. The Data type column shows the data type of the key
and allows you to change it to different data type for the target. The Drop column contains a check box.
This box allows you to choose a field to drop it from the target schema.
1. Choose the Transform - ApplyMapping node in the job diagram to edit the data transformation
properties.
2. In the node details panel, on the Node properties tab, review the information.
Change the data type for the month and day keys to tinyint. The tinyint data type stores integers
using 1 byte of storage, with a range of values from 0 to 255. When changing the data type, you
must verify that the data type is supported by your target.
6. (Optional) Choose the Output schema tab in the node details panel to view the modified schema.
7. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
By default, the first 5 columns are selected for the data preview, but the columns are no longer the
same as the columns viewed on the data source node because we dropped two of the columns and
renamed a third column.
Notice that the Transform - Apply Mapping node in the job diagram now has a green check mark,
indicating that the node has been edited and has all the required information.
41
AWS Glue Studio User Guide
Step 5: View the job script
1. Choose the Data target - S3 bucket node in the job diagram to edit the data target properties.
2. In the node details panel on the right, choose the Node properties tab. For Name, enter a unique
name for the node, such as Revised Flight Data.
3. Choose the Data target properties - S3 tab.
4. For Format, choose JSON.
For the S3 Target Location, choose the Browse S3 button to see the Amazon S3 buckets that you
have access to, and choose one as the target destination.
For the Data Catalog update options, keep the default setting of Do not update the Data Catalog.
For more information about the available options, see Overview of data target options (p. 73).
To view the script, choose the Script tab at the top of the job editing pane. Don’t click the Edit script
button, because this will take you out of visual editor mode.
If you clicked the Edit script button and confirmed your choice, you can reload the page (without saving
the job first), to reset the Script tab.
If you have many roles to choose from, you can start entering part of the role name in the IAM role
search field, and the roles with the matching text string will be displayed. For example, you can
enter tutorial in the search field to find all roles with tutorial (case-insensitive) in the name.
The AWS Identity and Access Management (IAM) role is used to authorize access to resources that
are used to run the job. You can only choose roles that already exist in your account. The role you
choose must have permission to access your Amazon S3 sources, targets, temporary directory,
scripts, and any libraries used by the job, as well as access to AWS Glue service resources.
For the steps to create a role, see Create an IAM Role for AWS Glue in the AWS Glue Developer Guide.
4. For the remaining fields, use the default values.
42
AWS Glue Studio User Guide
Step 7: Run the job
You should see a notification at the top of the page that the job was successfully saved.
If you don't see a notification that your job was successfully saved, then there is most likely information
missing that prevents the job from being saved.
• Review the job in the visual editor, and choose any node that doesn't have a green check mark.
• If any of the tabs above the visual editor pane have a callout, choose that tab and look for any fields
that are highlighted in red.
You can choose either the link in the notification for Run Details, or choose the Runs tab to view the run
status of the job.
On the Runs tab, there is a card for each recent run of the job with information about that job run.
For more information about the job run information, see the section called “View information for recent
job runs” (p. 108).
Next steps
After you start the job run, you might want to try some of the following tasks:
• View the job monitoring dashboard – Accessing the job monitoring dashboard (p. 101).
• Try a different transform on the data – Editing the data transform node (p. 55).
• View the jobs that exist in your account – View your jobs (p. 108).
• Run the job using a time-based schedule – Schedule job runs (p. 106).
43
AWS Glue Studio User Guide
Start the job creation process
On the Jobs page, you can see all the jobs that you have created either with AWS Glue Studio or AWS
Glue. You can view, manage, and run your jobs on this page.
Topics
• Start the job creation process (p. 44)
• Create jobs that use a connector (p. 45)
• Next steps for creating a job in AWS Glue Studio (p. 45)
1. Sign in to the AWS Management Console and open the AWS Glue Studio console at https://
console.aws.amazon.com/gluestudio/.
2. You can either choose Create and manage jobs from the AWS Glue Studio landing page, or you can
choose Jobs from the navigation pane.
• Visual with a blank canvas – To create a job starting with an empty canvas
• Visual with a source and target – To create a job starting with source node, or with a source,
transform and target node
You then choose the data source type. You can also choose the data target type, or you can choose
the Choose later option from the Target drop-down list to start with only a data source node in
the graph.
• Spark script editor – For those familiar with programming and writing ETL scripts, choose this
option to create a new Spark ETL job. You then have the option of writing Python or Scala code
in a script editor window, or uploading an existing script from a local file. If you choose to use the
script editor, you can't use the visual job editor to design or edit your job.
A Spark job is run in an Apache Spark environment managed by AWS Glue. By default, new scripts
are coded in Python. To write a new Scala script, see Creating and editing Scala scripts in AWS
Glue Studio (p. 77).
• Python Shell script editor – For those familiar with programming and writing ETL scripts, choose
this option to create a new Python shell job. You write code in a script editor window starting with
a template (boilerplate), or you can upload an existing script from a local file. If you choose to use
the Python shell editor, you can't use the visual job editor to design or edit your job.
44
AWS Glue Studio User Guide
Create jobs that use a connector
A Python shell job runs Python scripts as a shell and supports a Python version that depends on
the AWS Glue version you choose for the job. You can use these jobs to schedule and run tasks
that don't require an Apache Spark environment.
• Jupyter Notebook – For those familiar with programming and writing ETL scripts, choose this
option to create a new Python or Scala job script using a notebook interface based on Jupyter
notebook. You write code in a notebook. If you choose to use the notebook interface to create
your job, you can't use the visual job editor to design or edit your job.
You can also use a command line interface to easily configure a notebook for authoring jobs.
4. Choose Create to create a job in the editing interface that you selected.
5. If you chose the Jupyter notebook option, the Create job in Jupyter notebook page appears instead
of the job editor interface. You must provide additional information before creating a notebook
authoring session. For more information about how to specify this information, see Getting started
with notebooks in AWS Glue Studio (p. 2).
For detailed instructions, see the section called “Authoring jobs with custom connectors” (p. 85).
The next steps for creating and managing your jobs are:
45
AWS Glue Studio User Guide
Next steps for creating a job in AWS Glue Studio
46
AWS Glue Studio User Guide
Accessing the job diagram editor
Topics
• Accessing the job diagram editor (p. 47)
• Job editor features (p. 47)
• Editing the data source node (p. 50)
• Editing the data transform node (p. 55)
• Configuring data target nodes (p. 73)
• Editing or uploading a job script (p. 76)
• Adding nodes to the job diagram (p. 79)
• Changing the parent nodes for a node in the job diagram (p. 79)
• Deleting nodes from the job diagram (p. 80)
• Choose Jobs in the console navigation pane. On the Jobs page, locate the job in the Your jobs list. You
can then either:
• Choose the name of the job in the Name column to open the job editor for that job.
• Choose the job, and then choose Edit job from the Actions list.
• Choose Monitoring in the console navigation pane. On the Monitoring page, locate the job in the Job
runs list. You can filter the rows in the Job runs list, as described in Job runs view (p. 101). Choose
the job you want to edit, and then choose View job from the Actions menu.
• A visual diagram of your job, with a node for each job task: Data source nodes for reading the data;
transform nodes for modifying the data; data target nodes for writing the data.
You can view and configure the properties of each node in the job diagram. You can also view the
schema and sample data for each node in the job diagram. These features help you to verify that your
job is modifying and transforming the data in the right way, without having to run the job.
47
AWS Glue Studio User Guide
Using schema previews in the visual job editor
• A Script viewing and editing tab, where you can modify the code generated for your job.
• A Job details tab, where you can configure a variety of settings to customize the environment in which
your AWS Glue ETL job runs.
• A Runs tab, where you can view the current and previous runs of the job, view the status of the job run,
and access the logs for the job run.
• A Schedules tab, where you can configure the start time for you job, or set up a recurring job runs.
Before you can see the schema, the job editor needs permissions to access the data source. You can
specify an IAM role on the Job details tab of the editor or on the Output schema tab for a node. If the
IAM role has all the necessary permissions to access the data source, you can then view the schema on
the Output schema tab for a node.
Before you can see the data sample, the job editor needs permissions to access the data source. The first
time you choose the Data preview tab, you are prompted to choose an IAM role to use. This can be the
same role that you plan to use for your job, or it can be a different role. The IAM role you choose must
have the necessary permissions to create the data previews.
After you choose an IAM role, it takes about 20 to 30 seconds before the data appears. You are charged
for data preview usage as soon as you choose the IAM role. The following features help you when
viewing the data.
• Choose the settings icon (a gear symbol) to configure your preferences for data previews. You can
change the sample size or you can choose to wrap the text from one line to the next. These settings
apply to all nodes in the job diagram.
• Choose the Previewing x of y fields button to select which columns (fields) to preview. When you
preview you data using the default settings, the job editor shows the first 5 columns of your dataset.
You can change this to show all or none (not recommended).
• You can scroll through the data preview window both horizontally and vertically.
• Use the split/whole screen button to expand the Data preview tab to the entire screen (over-laying the
job graph), to better view the data and data structures.
Data previews help you create and test your job, without having to repeatedly run the job.
• You can test an IAM role to make sure you have access to your data sources or data targets.
• You can check that the transform is modifying the data in the intended way. For example, if you use a
Filter transform, you can make sure that the filter is selecting the right subset of data.
• If your dataset contains columns with values of multiple types, the data preview shows a list of
tuples for these columns. Each tuple contains the data type and its value, as shown in the following
screenshot.
48
AWS Glue Studio User Guide
Restrictions when using data previews
• The first time you choose the Data preview tab you must choose IAM role. This role must have the
necessary permissions to access the data and other resources needed to create the data previews.
• After you provide an IAM role, it takes a while before the data is available for viewing. For datasets
with less than 1 GB of data, it can take up to one minute. If you have a large dataset, you should
use partitions to improve the loading time. Loading data directly from Amazon S3 has the best
performance.
• If you have a very large dataset, and it takes more than 30 minutes to query the data for the data
preview, the request will time out. You can reduce the dataset size to use data previews.
• By default, you see the first 5 columns in the Data preview tab. If the columns have no data values, you
will get a message that there is no data to display. You can increase the number of rows sampled, or
selected different columns to see data values.
• Data previews are currently not supported for streaming data sources, or for data sources that use
custom connectors.
• Errors on one node effect the entire job. If any one node has an error with data previews, the error will
show up on all nodes until you correct it.
• If you change a data source for the job, then the child nodes of that data source might need to be
updated to match the new schema. For example, if you have an ApplyMapping node that modifies a
column, and the column does not exist in the replacement data source, you will need to update the
ApplyMapping transform node.
• If you view the Data preview tab for a SQL query transform node, and the SQL query uses an incorrect
field name, the Data preview tab shows an error.
49
AWS Glue Studio User Guide
Editing the data source node
There are two forms of code generated by AWS Glue Studio: the original, or Classic version, and a newer,
streamlined version. By default, the new code generator is used to create the job script. You can generate
a job script using classic code generator on the Script tab by choosing the Generate classic script toggle
button.
Some of the differences in the new version of the generated code include:
New features in AWS Glue Studio require the new version of code generation, and will not work with the
classic code script. You are prompted to update these jobs when you attempt to run them.
• Name: (Optional) Enter a name to associate with the node in the job diagram. This name should
be unique among all the nodes for this job.
• Node type: The node type determines the action that is performed by the node. In the list of
options for Node type, choose one of the values listed under the heading Data source.
4. Configure the Data source properties information. For more information, see the following sections:
• Using Data Catalog tables for the data source (p. 51)
• Using a connector for the data source (p. 51)
• Using files in Amazon S3 for the data source (p. 51)
• Using a streaming data source (p. 53)
5. (Optional) After configuring the node properties and data source properties, you can view the
schema for your data source by choosing the Output schema tab in the node details panel. The first
time you choose this tab for any node in your job, you are prompted to provide an IAM role to access
the data. If you have not specified an IAM role on the Job details tab, you are prompted to enter an
IAM role here.
6. (Optional) After configuring the node properties and data source properties, you can preview the
dataset from your data source by choosing the Data preview tab in the node details panel. The first
time you choose this tab for any node in your job, you are prompted to provide an IAM role to access
the data. There is a cost associated with using this feature, and billing starts as soon as you provide
an IAM role.
50
AWS Glue Studio User Guide
Using Data Catalog tables for the data source
• S3 source type: (For Amazon S3 data sources only) Choose the option Select a Catalog table to
use an existing AWS Glue Data Catalog table.
• Database: Choose the database in the Data Catalog that contains the source table you want to use
for this job. You can use the search field to search for a database by its name.
• Table: Choose the table associated with the source data from the list. This table must already exist
in theAWS Glue Data Catalog. You can use the search field to search for a table by its name.
• Partition predicate: (For Amazon S3 data sources only) Enter a Boolean expression based on
Spark SQL that includes only the partitioning columns. For example: "(year=='2020' and
month=='04')"
• Temporary directory: (For Amazon Redshift data sources only) Enter a path for the location of a
working directory in Amazon S3 where your ETL job can write temporary intermediate results.
• Role associated with the cluster: (For Amazon Redshift data sources only) Enter a role for your
ETL job to use that contains permissions for Amazon Redshift clusters. For more information, see
the section called “Data source and data target permissions” (p. 34).
If you use an Amazon S3 bucket as your data source, AWS Glue Studio detects the schema of the data
at the specified location from one of the files, or by using the file you specify as a sample file. Schema
detection occurs when you use the Infer schema button. If you change the Amazon S3 location or the
sample file, then you must choose Infer schema again to perform the schema detection using the new
information.
To configure a data source node that reads directly from files in Amazon S3
• S3 source type: (For Amazon S3 data sources only) Choose the option S3 location.
51
AWS Glue Studio User Guide
Using files in Amazon S3 for the data source
• S3 URL: Enter the path to the Amazon S3 bucket, folder, or file that contains the data for your job.
You can choose Browse S3 to select the path from the locations available to your account.
• Recursive: Choose this option if you want AWS Glue Studio to read data from files in child folders
at the S3 location.
If the child folders contain partitioned data, AWS Glue Studio doesn't add any partition
information that's specified in the folder names to the Data Catalog. For example, consider the
following folders in Amazon S3:
S3://sales/year=2019/month=Jan/day=1
S3://sales/year=2019/month=Jan/day=2
If you choose Recursive and select the sales folder as your S3 location, then AWS Glue Studio
reads the data in all the child folders, but doesn't create partitions for year, month or day.
• Data format: Choose the format that the data is stored in. You can choose JSON, CSV, or Parquet.
The value you select tells the AWS Glue job how to read the data from the source file.
Note
If you don't select the correct format for your data, AWS Glue Studio might infer the
schema correctly, but the job won't be able to correctly parse the data from the source
file.
You can enter additional configuration options, depending on the format you choose.
• JSON (JavaScript Object Notation)
• JsonPath: Enter a JSON path that points to an object that is used to define a table schema.
JSON path expressions always refer to a JSON structure in the same way as XPath expression
are used in combination with an XML document. The "root member object" in the JSON path
is always referred to as $, even if it's an object or array. The JSON path can be written in dot
notation or bracket notation.
For more information about the JSON path, see JsonPath on the GitHub website.
• Records in source files can span multiple lines: Choose this option if a single record can span
multiple lines in the CSV file.
• CSV (comma-separated values)
• Delimiter: Enter a character to denote what separates each column entry in the row, for
example, ; or ,.
• Escape character: Enter a character that is used as an escape character. This character
indicates that the character that immediately follows the escape character should be taken
literally, and should not be interpreted as a delimiter.
• Quote character: Enter the character that is used to group separate strings into a single
value. For example, you would choose Double quote (") if you have values such as "This is
a single value" in your CSV file.
• Records in source files can span multiple lines: Choose this option if a single record can span
multiple lines in the CSV file.
• First line of source file contains column headers: Choose this option if the first row in the
CSV file contains column headers instead of data.
• Parquet (Apache Parquet columnar storage)
There are no additional settings to configure for data stored in Parquet format.
• Partition predicate: To partition the data that is read from the data source, enter a Boolean
expression based on Spark SQL that includes only the partitioning columns. For example:
"(year=='2020' and month=='04')"
• Advanced options: Expand this section if you want AWS Glue Studio to detect the schema of your
data based on a specific file. 52
AWS Glue Studio User Guide
Using a streaming data source
• Schema inference: Choose the option Choose a sample file from S3 if you want to use a
specific file instead of letting AWS Glue Studio choose a file.
• Auto-sampled file: Enter the path to the file in Amazon S3 to use for inferring the schema.
If you're editing a data source node and change the selected sample file, choose Reload schema to
detect the schema by using the new sample file.
4. Choose the Infer schema button to detect the schema from the sources files in Amazon S3. If you
change the Amazon S3 location or the sample file, you must choose Infer schema again to infer the
schema using the new information.
Kinesis
• Kinesis source type: Choose the option Stream details to use direct access to the streaming
source or choose Data Catalog table to use the information stored there instead.
AWS Glue Studio automatically detects the schema from the streaming data.
If you choose Data Catalog table, specify the following additional information.
• Database: (Optional) Choose the database in the AWS Glue Data Catalog that contains the
table associated with your streaming data source. You can use the search field to search for
a database by its name.
• Table: (Optional) Choose the table associated with the source data from the list. This table
must already exist in the AWS Glue Data Catalog. You can use the search field to search for
a table by its name.
• Detect schema: Choose this option to have AWS Glue Studio detect the schema from the
streaming data, rather than using the schema information in a Data Catalog table. This
option is enabled automatically if you choose the Stream details option.
• Starting position: By default, the ETL job uses the Earliest option, which means it reads
data starting with the oldest available record in the stream. You can instead choose Latest,
which indicates the ETL job should start reading from just after the most recent record in the
stream.
53
AWS Glue Studio User Guide
Using a streaming data source
• Window size: By default, your ETL job processes and writes out data in 100-second windows.
This allows data to be processed efficiently and permits aggregations to be performed on
data that arrives later than expected. You can modify this window size to increase timeliness
or aggregation accuracy.
AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that
has been read.
• Connection options: Expand this section to add key-value pairs to specify additional
connection options. For information about what options you can specify here, see
"connectionType": "kinesis" in the AWS Glue Developer Guide.
Kafka
• Apache Kafka source: Choose the option Stream details to use direct access to the streaming
source or choose Data Catalog table to use the information stored there instead.
If you choose Data Catalog table, specify the following additional information.
• Database: (Optional) Choose the database in the AWS Glue Data Catalog that contains the
table associated with your streaming data source. You can use the search field to search for
a database by its name.
• Table: (Optional) Choose the table associated with the source data from the list. This table
must already exist in the AWS Glue Data Catalog. You can use the search field to search for
a table by its name.
• Detect schema: Choose this option to have AWS Glue Studio detect the schema from the
streaming data, rather than storing the schema information in a Data Catalog table. This
option is enabled automatically if you choose the Stream details option.
AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that
has been read.
• Connection options: Expand this section to add key-value pairs to specify additional
connection options. For information about what options you can specify here, see
"connectionType": "kafka" in the AWS Glue Developer Guide.
Note
Data previews are not currently supported for streaming data sources.
54
AWS Glue Studio User Guide
Editing the data transform node
In the pre-populated diagram for a job, between the data source and data target nodes is the Transform
- ApplyMapping node. You can configure this transform node to modify your data, or you can use
additional transforms.
The following built-in transforms are available with AWS Glue Studio:
• ApplyMapping (p. 55): Map data property keys in the data source to data property keys in the data
target. You can rename keys, modify the data types for keys, and choose which keys to drop from the
dataset.
• SelectFields (p. 56): Choose the data property keys that you want to keep.
• DropFields (p. 57): Choose the data property keys that you want to drop.
• RenameField (p. 57): Rename a single data property key.
• Spigot (p. 58): Write samples of the data to an Amazon S3 bucket.
• Join (p. 58): Join two datasets into one dataset using a comparison phrase on the specified data
property keys. You can use inner, outer, left, right, left semi, and left anti joins.
• SplitFields (p. 60): Split data property keys into two DynamicFrames. Output is a collection of
DynamicFrames: one with selected data property keys, and one with the remaining data property
keys.
• SelectFromCollection (p. 60): Choose one DynamicFrame from a collection of DynamicFrames.
The output is the selected DynamicFrame.
• FillMissingValues (p. 62): Locate records in the dataset that have missing values and add a new
field with a suggested value that is determined by imputation
• Filter (p. 62): Split a dataset into two, based on a filter condition.
• DropNullFields (p. 66) (p. 63): Removes columns from the dataset if all values in the column are
‘null’.
• SQL (p. 64): Enter SparkSQL code into a text entry field to use a SQL query to transform the data.
The output is a single DynamicFrame.
• Aggregate (p. 66): performs a calculation (such as average, sum, min, max) on selected fields and
rows, and creates a new field with the newly calculated value(s).
• Custom transform (p. 68): Enter code into a text entry field to use custom transforms. The output
is a collection of DynamicFrames.
You can add additional ApplyMapping nodes to the job diagram as needed – for example, to modify
additional data sources or following a Join transform.
55
AWS Glue Studio User Guide
Using SelectFields to remove most data property keys
Note
The ApplyMapping transform is not case-sensitive.
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose
ApplyMapping to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent isn't
already selected, choose a node from the Node parents list to use as the input source for the
transform.
3. Choose the Transform tab in the node details panel.
4. Modify the input schema:
• To rename a data property key, enter the new name of the key in the Target key field.
• To change the data type for a data property key, choose the new data type for the key from the
Data type list.
• To remove a data property key from the target schema, choose the Drop check box for that key.
5. (Optional) After configuring the transform node properties, you can view the modified schema for
your data by choosing the Output schema tab in the node details panel. The first time you choose
this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you
have not specified an IAM role on the Job details tab, you are prompted to enter an IAM role here.
6. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose
SelectFields to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent is
not already selected, choose a node from the Node parents list to use as the input source for the
transform.
3. Choose the Transform tab in the node details panel.
4. Under the heading SelectFields, choose the data property keys in the dataset that you want to keep.
Any data property keys not selected are dropped from the dataset.
You can also choose the check box next to the column heading Field to automatically choose all the
data property keys in the dataset. Then you can deselect individual data property keys to remove
them from the dataset.
5. (Optional) After configuring the transform node properties, you can view the modified schema for
your data by choosing the Output schema tab in the node details panel. The first time you choose
56
AWS Glue Studio User Guide
Using DropFields to keep most data property keys
this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you
have not specified an IAM role on the Job details tab, you are prompted to enter an IAM role here.
6. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose
DropFields to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent is not
already selected, then choose a node from the Node parents list to use as the input source for the
transform.
3. Choose the Transform tab in the node details panel.
4. Under the heading DropFields, choose the data property keys to drop from the data source.
You can also choose the check box next to the column heading Field to automatically choose all the
data property keys in the dataset. Then you can deselect individual data property keys so they are
retained in the dataset.
5. (Optional) After configuring the transform node properties, you can view the modified schema for
your data by choosing the Output schema tab in the node details panel. The first time you choose
this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you
have not specified an IAM role on the Job details tab, you are prompted to enter an IAM role here.
6. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
57
AWS Glue Studio User Guide
Using Spigot to sample your dataset
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose
RenameField to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent is not
already selected, then choose a node from the Node parents list to use as the input source for the
transform.
3. Choose the Transform tab.
4. Under the heading Data field, choose a property key from the source data and then enter a new
name in the New field name field.
5. (Optional) After configuring the transform node properties, you can view the modified schema for
your data by choosing the Output schema tab in the node details panel. The first time you choose
this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you
have not specified an IAM role on the Job details tab, you are prompted to enter an IAM role here.
6. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose Spigot
to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent is not
already selected, then choose a node from the Node parents list to use as the input source for the
transform.
3. Choose the Transform tab in the node details panel.
4. Enter an Amazon S3 path or choose Browse S3 to choose a location in Amazon S3. This is the
location where the job writes the JSON file that contains the data sample.
5. Enter information for the sampling method. You can specify a value for Number of records to write
starting from the beginning of the dataset and a Probability threshold (entered as a decimal value
with a maximum value of 1) of picking any given record.
For example, to write the first 50 records from the dataset, you would set Number of records to 50
and Probability threshold to 1 (100%).
Joining datasets
The Join transform allows you to combine two datasets into one. You specify the key names in the
schema of each dataset to compare. The output DynamicFrame contains rows where keys meet the join
condition. The rows in each dataset that meet the join condition are combined into a single row in the
output DynamicFrame that contains all the columns found in either dataset.
58
AWS Glue Studio User Guide
Joining datasets
1. If there is only one data source available, you must add a new data source node to the job diagram.
See Adding nodes to the job diagram (p. 79) for details.
2. Choose one of the source nodes for the join. Choose Transform in the toolbar at the top of the
visual editor, and then choose Join to add a new transform to your job diagram.
3. On the Node properties tab, enter a name for the node in the job diagram.
4. In the Node properties tab, under the heading Node parents, add a parent node so that there are
two datasets providing inputs for the join. The parent can be a data source node or a transform
node.
Note
A join can have only two parent nodes.
5. Choose the Transform tab.
If you see a message indicating that there are conflicting key names, you can either:
• Choose Resolve it to automatically add an ApplyMapping transform node to your job diagram. The
ApplyMapping node adds a prefix to any keys in the dataset that have the same name as a key in
the other dataset. For example, if you use the default value of right, then any keys in the right
dataset that have the same name as a key in the left dataset will be renamed to (right)key
name.
• Manually add a transform node earlier in the job diagram to remove or rename the conflicting
keys.
6. Choose the type of join in the Join type list.
• Inner join: Returns a row with columns from both datasets for every match based on the join
condition. Rows that don't satisfy the join condition aren't returned.
• Left join: All rows from the left dataset and only the rows from the right dataset that satisfy the
join condition.
• Right join: All rows from the right dataset and only the rows from the left dataset that satisfy the
join condition.
• Outer join: All rows from both datasets.
• Left semi join: All rows from the left dataset that have a match in the right dataset based on the
join condition.
• Left anti join: All rows in the left dataset that don't have a match in the right dataset based on
join condition.
7. On the Transform tab, under the heading Join conditions, choose Add condition. Choose a
property key from each dataset to compare. Property keys on the left side of the comparison
operator are referred to as the left dataset and property keys on the right are referred to as the right
dataset.
For more complex join conditions, you can add additional matching keys by choosing Add condition
more than once. If you accidentally add a condition, you can choose the delete icon ( ) to remove
it.
8. (Optional) After configuring the transform node properties, you can view the modified schema for
your data by choosing the Output schema tab in the node details panel. The first time you choose
this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you
have not specified an IAM role on the Job details tab, you are prompted to enter an IAM role here.
9. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
59
AWS Glue Studio User Guide
Using SplitFields to split a dataset into two
For an example of the join output schema, consider a join between two datasets with the following
property keys:
The join is configured to match on the id and hire_date keys using the = comparison operator.
Because both datasets contain id and hire_date keys, you chose Resolve it to automatically add the
prefix right to the keys in the right dataset.
The SplitFields transform is case sensitive. Add an ApplyMapping transform as a parent node if you need
case-insensitive property key names.
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose
SplitFields to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent is not
already selected, then choose a node from the Node parents list to use as the input source for the
transform.
3. Choose the Transform tab.
4. Choose which property keys you want to put into the first dataset. The keys that you do not choose
are placed in the second dataset.
5. (Optional) After configuring the transform node properties, you can view the modified schema for
your data by choosing the Output schema tab in the node details panel. The first time you choose
this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you
have not specified an IAM role on the Job details tab, you are prompted to enter an IAM role here.
6. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
7. Configure a SelectFromCollection transform node to process the resulting datasets.
60
AWS Glue Studio User Guide
Using SelectFromCollection to
choose which dataset to keep
You must use this transform after you use a transform that creates a collection of DynamicFrames, such
as:
If you don't add a SelectFromCollection transform node to your job diagram after any of these
transforms, you will get an error for your job.
The parent node for this transform must be a node that returns a collection of DynamicFrames. If you
choose a parent for this transform node that returns a single DynamicFrame, such as a Join transform,
your job returns an error.
Similarly, if you use a SelectFromCollection node in your job diagram as the parent for a transform that
expects a single DynamicFrame as input, your job returns an error.
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose
SelectFromCollection to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent is not
already selected, then choose a node from the Node parents list to use as the input source for the
transform.
3. Choose the Transform tab.
4. Under the heading Frame index, choose the array index number that corresponds to the
DynamicFrame you want to select from the collection of DynamicFrames.
For example, if the parent node for this transform is a SplitFields transform, on the Output
schema tab of that node you can see the schema for each DynamicFrame. If you want to keep the
DynamicFrame associated with the schema for Output 2, you would select 1 for the value of Frame
index, which is the second value in the list.
61
AWS Glue Studio User Guide
Find and fill missing values in a dataset
6. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose
FillMissingValues to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent isn't
already selected, choose a node from the Node parents list to use as the input source for the
transform.
3. Choose the Transform tab.
4. For Data field, choose the column or field name from the source data that you want to analyze for
missing values.
5. (Optional) In the New field name field, enter a name for the field added to each record that will
hold the estimated replacement value for the analyzed field. If the analyzed field doesn't have a
missing value, the value in the analyzed field is copied into the new field.
If you don't specify a name for the new field, the default name is the name of the analyzed column
with _filled appended. For example, if you enter Age for Data field and don't specify a value for
New field name, a new field named Age_filled is added to each record.
6. (Optional) After configuring the transform node properties, you can view the modified schema for
your data by choosing the Output schema tab in the node details panel. The first time you choose
this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you
have not specified an IAM role on the Job details tab, you are prompted to enter an IAM role here.
7. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
• For string data types, you can filter rows where the key value matches a specified string.
• For numeric data types, you can filter rows by comparing the key value to a specified value using the
comparison operators <, >, =, !=, <=, and >=.
If you specify multiple filter conditions, the results are combined using an AND operator by default, but
you can choose OR instead.
62
AWS Glue Studio User Guide
Using DropNullFields to remove fields with null values
The Filter transform is case sensitive. Add an ApplyMapping transform as a parent node if you need case-
insensitive property key names.
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose Filter to
add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent isn't
already selected, then choose a node from the Node parents list to use as the input source for the
transform.
3. Choose the Transform tab.
4. Choose either Global AND or Global OR. This determines how multiple filter conditions are
combined. All conditions are combined using either AND or OR operations. If you have only a single
filter conditions, then you can choose either one.
5. Choose the Add condition button in the Filter condition section to add a filter condition.
In the Key field, choose a property key name from the dataset. In the Operation field, choose the
comparison operator. In the Value field, enter the comparison value. Here are some examples of
filter conditions:
When you filter on string values, make sure that the comparison value uses a regular expression
format that matches the script language selected in the job properties (Python or Scala).
6. Add additional filter conditions, as needed.
7. (Optional) After configuring the transform node properties, you can view the modified schema for
your data by choosing the Output schema tab in the node details panel. The first time you choose
this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you
have not specified an IAM role on the Job details tab, you are prompted to enter an IAM role here.
8. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
63
AWS Glue Studio User Guide
Using a SQL query to transform data
• Empty String ("" or '') - fields that contain empty strings will be removed
• "null string" - fields that contain the string with the word 'null' will be removed
• -1 integer - fields that contain a -1 (negative one) integer will be removed
3. If needed, you can also specify custom null values. These are null values that may be unique to your
dataset. To add a custom null value, choose Add new value.
4. Enter the custom null value. For example, this can zero, or any value that is being used to represent
a null in the dataset.
5. Choose the data type in the drop-down field. Data types can either be String or Integer.
Note
Custom null values and their data types must match exactly in order for the fields to be
recognized as null values and the fields removed. Partial matches where only the custom
null value matches but the data type does not will not reesult in the fields being removed.
A SQL transform node can have multiple datasets as inputs, but produces only a single dataset as output.
In contains a text field, where you enter the Apache SparkSQL query. You can assign aliases to each
dataset used as input, to help simply the SQL query. For more information about the SQL syntax, see the
Spark SQL documentation.
Note
If you use a Spark SQL transform with a data source located in a VPC, add an AWS Glue VPC
endpoint to the VPC that contains the data source. For more information about configuring
development endpoints, see Adding a Development Endpoint, Setting Up Your Environment for
Development Endpoints, and Accessing Your Development Endpoint in the AWS Glue Developer
Guide.
64
AWS Glue Studio User Guide
Using a SQL query to transform data
1. (Optional) Add a transform node to the job diagram, if needed. Choose Spark SQL for the node
type.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent is not
already selected, or if you want multiple inputs for the SQL transform, choose a node from the Node
parents list to use as the input source for the transform. Add additional parent nodes as needed.
3. Choose the Transform tab in the node details panel.
4. The source datasets for the SQL query are identified by the names you specified in the Name field
for each node. If you do not want to use these names, or if the names are not suitable for a SQL
query, you can associate a name to each dataset. The console provides default aliases, such as
MyDataSource.
For example, if a parent node for the SQL transform node is named Rename Org PK field, you
might associate the name org_table with this dataset. This alias can then be used in the SQL query
in place of the node name.
5. In the text entry field under the heading Code block, paste or enter the SQL query. The text field
displays SQL syntax highlighting and keyword suggestions.
6. With the SQL transform node selected, choose the Output schema tab, and then choose Edit.
Provide the columns and data types that describe the output fields of the SQL query.
Specify the schema using the following actions in the Output schema section of the page:
• To rename a column, place the cursor in the Key text box for the column (also referred to as a field
or property key) and enter the new name.
• To change the data type for a column, select the new data type for the column from the drop-
down list.
• To add a new top-level column to the schema, choose the Overflow ( ) button, and then choose
Add root key. New columns are added at the top of the schema.
• To remove a column from the schema, choose the delete icon ( ) to the far right of the Key
name.
7. When you finish specifying the output schema, choose Apply to save your changes and exit the
schema editor. If you do not want to save you changes, choose Cancel to edit the schema editor.
8. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
65
AWS Glue Studio User Guide
Using Aggregate to perform summary
calculcations on selected fields
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
When fields are selected, the name and datatype are shown. To remove a field, choose 'X' on the
field.
66
AWS Glue Studio User Guide
Using Aggregate to perform summary
calculcations on selected fields
67
AWS Glue Studio User Guide
Creating a custom transformation
When using custom code, you must use a schema editor to indicate the changes made to the output
through the custom code. When editing the schema, you can perform the following actions:
You must use a SelectFromCollection transform to choose a single DynamicFrame from the result of your
Custom transform node before you can send the output to a target location.
Use the following tasks to add a custom transform node to your job diagram.
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose Custom
transform to add a custom transform to your job diagram.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent is not
already selected, or if you want multiple inputs for the custom transform, then choose a node from
the Node parents list to use as the input source for the transform.
68
AWS Glue Studio User Guide
Creating a custom transformation
1. With the custom transform node selected in the job diagram, choose the Transform tab.
2. In the text entry field under the heading Code block, paste or enter the code for the transformation.
The code that you use must match the language specified for the job on the Job details tab.
When referring to the input nodes in your code, AWS Glue Studio names the DynamicFrames
returned by the job diagram nodes sequentially based on the order of creation. Use one of the
following naming methods in your code:
• Classic code generation – Use functional names to refer to the nodes in your job diagram.
• Data source nodes: DataSource0, DataSource1, DataSource2, and so on.
• Transform nodes: Transform0, Transform1, Transform2, and so on.
• New code generation – Use the name specified on the Node properties tab of a node, appended
with '_node1', '_node2', and so on. For example, S3bucket_node1, ApplyMapping_node2,
S3bucket_node2, MyCustomNodeName_node1.
For more information about the new code generator, see Script code generation (p. 49).
The following examples show the format of the code to enter in the code box:
Python
The following example takes the first DynamicFrame received, converts it to a DataFrame to apply
the native filter method (keeping only records that have over 1000 votes), then converts it back to a
DynamicFrame before returning it.
Scala
The following example takes the first DynamicFrame received, converts it to a DataFrame to apply
the native filter method (keeping only records that have over 1000 votes), then converts it back to a
DynamicFrame before returning it.
object FilterHighVoteCounts {
def execute(glueContext : GlueContext, input : Seq[DynamicFrame]) : Seq[DynamicFrame]
= {
val frame = input(0).toDF()
val filtered = DynamicFrame(frame.filter(frame("vote_count") > 1000), glueContext)
Seq(filtered)
}
}
69
AWS Glue Studio User Guide
Creating a custom transformation
A custom code node can have any number of parent nodes, each providing a DynamicFrame as
input for your custom code. A custom code node returns a collection of DynamicFrames. Each
DynamicFrame that is used as input has an associated schema. You must add a schema that describes
each DynamicFrame returned by the custom code node.
Note
When you set your own schema on a custom transform, AWS Glue Studio does not inherit
schemas from previous nodes.To update the schema, select the Custom transform node, then
choose the Data preview tab. Once the preview is generated, choose 'Use Preview Schema'. The
schema will then be replaced by the schema using the preview data.
1. With the custom transform node selected in the job diagram, in the node details panel, choose the
Output schema tab.
2. Choose Edit to make changes to the schema.
If you have nested data property keys, such as an array or object, you can choose the Expand-Rows
icon ( ) on the top right of each schema panel to expand the list of child data property keys. After
you choose this icon, it changes to the Collapse-Rows icon ( ), which you can choose to collapse
the list of child property keys.
3. Modify the schema using the following actions in the section on the right side of the page:
• To rename a property key, place the cursor in the Key text box for the property key, then enter the
new name.
• To change the data type for a property key, use the list to choose the new data type for the
property key.
• To add a new top-level property key to the schema, choose the Overflow ( ) icon to the left of
the Cancel button, and then choose Add root key.
• To add a child property key to the schema, choose the Add-Key icon associated with the parent
key. Enter a name for the child key and choose the data type.
• To remove a property key from the schema, choose the Remove icon ( ) to the far right of the
key name.
4. If your custom transform code uses multiple DynamicFrames, you can add additional output
schemas.
• To add a new, empty schema, choose the Overflow ( ) icon, and then choose Add output
schema.
• To copy an existing schema to a new output schema, make sure the schema you want to copy is
displayed in the schema selector. Choose the Overflow ( ) icon, and then choose Duplicate.
If you want to remove an output schema, make sure the schema you want to copy is displayed in the
schema selector. Choose the Overflow ( ) icon, and then choose Delete.
5. Add new root keys to the new schema or edit the duplicated keys.
6. When you are modifying the output schemas, choose the Apply button to save your changes and
exit the schema editor.
70
AWS Glue Studio User Guide
Using Aggregate to perform summary
calculcations on selected fields
If you do not want to save your changes, choose the Cancel button.
1. Add a SelectFromCollection transform node, which has the custom transform node as its
parent node. Update this transform to indicate which dataset you want to use. See Using
SelectFromCollection to choose which dataset to keep (p. 61) for more information.
2. Add additional SelectFromCollection transforms to the job diagram if you want to use additional
DynamicFrames produced by the custom transform node.
Consider a scenario in which you add a custom transform node to split a flight dataset into multiple
datasets, but duplicate some of the identifying property keys in each output schema, such as the
flight date or flight number. You add a SelectFromCollection transform node for each output schema,
with the custom transform node as its parent.
3. (Optional) You can then use each SelectFromCollection transform node as input for other nodes in
the job, or as a parent for a data target node.
When fields are selected, the name and datatype are shown. To remove a field, choose 'X' on the
field.
71
AWS Glue Studio User Guide
Using Aggregate to perform summary
calculcations on selected fields
72
AWS Glue Studio User Guide
Configuring data target nodes
• S3 – The job writes the data in a file in the Amazon S3 location you choose and in the format you
specify.
If you configure partition columns for the data target, then the job writes the dataset to Amazon S3
into directories based on the partition key.
• AWS Glue Data Catalog – The job uses the information associated with the table in the Data Catalog
to write the output data to a target location.
You can create the table manually or with the crawler. You can also use AWS CloudFormation
templates to create tables in the Data Catalog.
• A connector – A connector is a piece of code that facilitates communication between your data
store and AWS Glue. The job uses the connector and associated connection to write the output data
to a target location. You can either subscribe to a connector offered in AWS Marketplace, or you
can create your own custom connector. For more information, see Adding connectors to AWS Glue
Studio (p. 82)
You can choose to update the Data Catalog when your job writes to an Amazon S3 data target. Instead
of requiring a crawler to update the Data Catalog when the schema or partitions change, this option
makes it easy to keep your tables up to date. This option simplifies the process of making your data
available for analytics by optionally adding new tables to the Data Catalog, updating table partitions,
and updating the schema of your tables directly from the job.
73
AWS Glue Studio User Guide
Editing the data target node
1. (Optional) If you need to add a target node, choose Target in the toolbar at the top of the visual
editor, and then choose either S3 or Glue Data Catalog.
• If you choose S3 for the target, then the job writes the dataset to one or more files in the Amazon
S3 location you specify.
• If you choose AWS Glue Data Catalog for the target, then the job writes to a location described by
the table selected from the Data Catalog.
2. Choose a data target node in the job diagram. When you choose a node, the node details panel
appears on the right-side of the page.
3. Choose the Node properties tab, and then enter the following information:
• Name: Enter a name to associate with the node in the job diagram.
• Node type: A value should already be selected, but you can change it as needed.
• Node parents: The parent node is the node in the job diagram that provides the output data you
want to write to the target location. For a pre-populated job diagram, the target node should
already have the parent node selected. If there is no parent node displayed, then choose a parent
node from the list.
• Format: Choose a format from the list. The available format types for the data results are:
• JSON: JavaScript Object Notation.
• CSV: Comma-separated values.
• Avro: Apache Avro JSON binary.
• Parquet: Apache Parquet columnar storage.
74
AWS Glue Studio User Guide
Editing the data target node
• Glue Parquet: A custom Parquet writer type that is optimized for DynamicFrames as the data
format. Instead of requiring a precomputed schema for the data, it computes and modifies the
schema dynamically.
• ORC: Apache Optimized Row Columnar (ORC) format.
To learn more about these format options, see Format Options for ETL Inputs and Outputs in AWS
Glue in the AWS Glue Developer Guide.
• Compression Type: You can choose to optionally compress the data using either the gzip or
bzip2 format. The default is no compression, or None.
• S3 Target Location: The Amazon S3 bucket and location for the data output. You can choose the
Browse S3 button to see the Amazon S3 buckets that you have access to and choose one as the
target destination.
• Data catalog update options
• Do not update the Data Catalog: (Default) Choose this option if you don't want the job to
update the Data Catalog, even if the schema changes or new partitions are added.
• Create a table in the Data Catalog and on subsequent runs, update the schema and add new
partitions: If you choose this option, the job creates the table in the Data Catalog on the first
run of the job. On subsequent job runs, the job updates the Data Catalog table if the schema
changes or new partitions are added.
You must also select a database from the Data Catalog and enter a table name.
• Create a table in the Data Catalog and on subsequent runs, keep existing schema and add
new partitions: If you choose this option, the job creates the table in the Data Catalog on the
first run of the job. On subsequent job runs, the job updates the Data Catalog table only to add
new partitions.
You must also select a database from the Data Catalog and enter a table name.
• Partition keys: Choose which columns to use as partitioning keys in the output. To add more
partition keys, choose Add a partition key.
To configure the data properties for a target that uses a Data Catalog table
• Database: Choose the database that contains the table you want to use as the target from the list.
This database must already exist in the Data Catalog.
• Table: Choose the table that defines the schema of your output data from the list. This table must
already exist in the Data Catalog.
A table in the Data Catalog consists of the names of columns, data type definitions, partition
information, and other metadata about the target dataset. Your job writes to a location described
by this table in the Data Catalog.
For more information about creating tables in the Data Catalog, see Defining Tables in the Data
Catalog in the AWS Glue Developer Guide.
• Data catalog update options
75
AWS Glue Studio User Guide
Editing or uploading a job script
• Do not change table definition: (Default) Choose this option if you don't want the job to
update the Data Catalog, even if the schema changes, or new partitions are added.
• Update schema and add new partitions: If you choose this option, the job updates the Data
Catalog table if the schema changes or new partitions are added.
• Keep existing schema and add new partitions: If you choose this option, the job updates the
Data Catalog table only to add new partitions.
• Partition keys: Choose which columns to use as partitioning keys in the output. To add more
partition keys, choose Add a partition key.
You can use the visual editor to edit job nodes only if the jobs were created with AWS Glue Studio. If the
job was created using the AWS Glue console, through API commands, or with the command line interface
(CLI), you can use the script editor in AWS Glue Studio to edit the job script, parameters, and schedule.
You can also edit the script for a job created in AWS Glue Studio by converting the job to script-only
mode.
1. If creating a new job, on the Jobs page, choose the Spark script editor option to create a Spark job
or choose the Python Shell script editor to create a Python shell job. You can either write a new
script, or upload an existing script. If you choose Spark script editor, you can write or upload either
a Scala or Python script. If you choose Python Shell script editor, you can only write or upload a
Python script.
After choosing the option to create a new job, in the Options section that appears, you can choose
to either start with a starter script (Create a new script with boilerplate code), or you can upload a
local file to use as the job script.
If you chose Spark script editor, you can upload either Python or Scala script files. Scala scripts
must have the file extension .scala. Python scripts must be recognized as files of type Python. If
you chose Python Shell script editor, you can upload only Python script files.
When you are finished making your choices, choose Create to create the job and open the visual
editor.
2. Go to the visual job editor for the new or saved job, and then choose the Script tab.
3. If you didn't create a new job using one of the script editor options, and you have never edited the
script for an existing job, the Script tab displays the heading Script (Locked). This means the script
editor is in read-only mode. Choose Edit script to unlock the script for editing.
To make the script editable, AWS Glue Studio converts your job from a visual job to a script-only job.
If you unlock the script for editing, you can't use the visual editor anymore for this job after you save
it.
In the confirmation window, choose Confirm to continue or Cancel to keep the job available for
visual editing.
76
AWS Glue Studio User Guide
Creating and editing Scala scripts in AWS Glue Studio
If you choose Confirm, the Visual tab no longer appears in the editor. You can use AWS Glue Studio
to modify the script using the script editor, modify the job details or schedule, or view job runs.
Note
Until you save the job, the conversion to a script-only job is not permanent. If you refresh
the console web page, or close the job before saving it and reopen it in the visual editor, you
will still be able to edit the individual nodes in the visual editor.
4. Edit the script as needed.
When you are done editing the script, choose Save to save the job and permanently convert the job
from visual to script-only.
5. (Optional) You can download the script from the AWS Glue Studio console by choosing the
Download button on the Script tab. When you choose this button, a new browser window
opens, displaying the script from its location in Amazon S3. The Script filename and Script path
parameters in the Job details tab of the job determine the name and location of the script file in
Amazon S3.
When you save the job, AWS Glue save the job script at the location specified by these fields. If you
modify the script file at this location within Amazon S3, AWS Glue Studio will load the modified
script the next time you edit the job.
77
AWS Glue Studio User Guide
Creating and editing Python shell jobs in AWS Glue Studio
object MyScript {
def main(args: Array[String]): Unit = {
val sc: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(sc)
}
}
6. Write your Scala job script in the editor. Add additional import statements as needed.
Refer to the instructions at Start the job creation process (p. 44).
The job properties that are supported for Python shell jobs are not the same as those supported for
Spark jobs. The following list describes the changes to the available job parameters for Python shell jobs
on the Job details tab.
• The Type property for the job is automatically set to Python Shell and can't be changed.
• Instead of Language, there is a Python version property for the job. Currently, Python shell jobs
created in AWS Glue Studio use Python 3.6.
• The Glue version property is not available, because it does not apply to Python shell jobs.
• Instead of Worker type and Number of workers, a Data processing units property is shown instead.
This job property determines how many data processing units (DPUs) are consumed by the Python
shell when running the job.
• The Job bookmark property is not available, because it is not supported for Python shell jobs.
• Under Advanced properties, the following properties are not available for Python shell jobs.
• Job metrics
• Continuous logging
78
AWS Glue Studio User Guide
Adding nodes to the job diagram
1. Go to the visual editor for a new or saved job and choose the Visual tab.
2. Use the toolbar buttons to add a node of a specific type: Source, Transform, or Target.
• For a source node, see Editing the data source node (p. 50).
• For a transform node, see Editing the data transform node (p. 55).
• For a data target node, see Editing the data target node (p. 74).
4. If you're inserting a node in between two nodes in the job diagram, then perform the following
actions:
a. Choose the node that will be the parent for the new node.
b. Choose one of the toolbar buttons to add a new node to the job diagram. The new node is
added as a child of the currently selected node.
c. Choose the node that will be the child of the newly added node and change its parent node to
point to the newly added node.
If you added a node by mistake, you can use the Undo button on the toolbar to reverse the action.
1. Choose the node in the job diagram that you want to modify.
2. In the node details panel, on the Node properties tab, under the heading Node parents remove the
current parent for the node.
3. Choose a new parent node from the list.
4. Modify the other properties of the node as needed to match the newly selected parent node.
79
AWS Glue Studio User Guide
Deleting nodes from the job diagram
If you modified a node by mistake, you can use the Undo button on the toolbar to reverse the action.
To remove a node
1. Go to the visual editor for a new or saved job and choose the Visual tab.
2. Choose the node you want to remove.
3. In the toolbar at the top of the visual editing pane, choose the Remove button.
4. If the node you removed had children nodes, modify the parents for those nodes as needed.
If you removed a node by mistake, you can use the Undo button on the toolbar to reverse the action.
80
AWS Glue Studio User Guide
Overview of using connectors and connections
A connector is an optional code package that assists with accessing data stores in AWS Glue Studio. You
can subscribe to several connectors offered in AWS Marketplace.
When creating ETL jobs, you can use a natively supported data store, a connector from AWS Marketplace,
or your own custom connectors. If you use a connector, you must first create a connection for the
connector. A connection contains the properties that are required to connect to a particular data
store. You use the connection with your data sources and data targets in the ETL job. Connectors and
connections work together to facilitate access to the data stores.
Topics
• Overview of using connectors and connections (p. 81)
• Adding connectors to AWS Glue Studio (p. 82)
• Creating connections for connectors (p. 85)
• Authoring jobs with custom connectors (p. 85)
• Managing connectors and connections (p. 90)
• Developing custom connectors (p. 92)
• Restrictions for using connectors and connections in AWS Glue Studio (p. 94)
You can subscribe to connectors for non-natively supported data stores in AWS Marketplace, and then
use those connectors when you're creating connections. Developers can also create their own connectors,
and you can use them when creating connections.
Note
Connections created using the AWS Glue console do not appear in AWS Glue Studio.
Connections created using custom or AWS Marketplace connectors in AWS Glue Studio appear in
the AWS Glue console with type set to UNKNOWN.
The following steps describe the overall process of using connectors in AWS Glue Studio:
1. Subscribe to a connector in AWS Marketplace, or develop your own connector and upload it to AWS
Glue Studio. For more information, see Adding connectors to AWS Glue Studio (p. 82).
2. Review the connector usage information. You can find this information on the Usage tab on the
connector product page. For example, if you click the Usage tab on this product page, AWS Glue
81
AWS Glue Studio User Guide
Adding connectors to AWS Glue Studio
Connector for Google BigQuery, you can see in the Additional Resources section a link to a blog
about using this connector. Other connectors might contain links to the instructions in the Overview
section, as shown on the connector product page for Cloudwatch Logs connector for AWS Glue.
3. Create a connection. You choose which connector to use and provide additional information for the
connection, such as login credentials, URI strings, and virtual private cloud (VPC) information. For
more information, see Creating connections for connectors (p. 85).
4. Create an IAM role for your job. The job assumes the permissions of the IAM role that you specify
when you create it. This IAM role must have the necessary permissions to authenticate with, extract
data from, and write data to your data stores. For more information, see Review IAM permissions
needed for ETL jobs (p. 34) and Permissions required for using connectors (p. 35).
5. Create an ETL job and configure the data source properties for your ETL job. Provide the connection
options and authentication information as instructed by the custom connector provider. For more
information, see Authoring jobs with custom connectors (p. 85).
6. Customize your ETL job by adding transforms or additional data stores, as described in Editing ETL
jobs in AWS Glue Studio (p. 47).
7. If using a connector for the data target, configure the data target properties for your ETL job.
Provide the connection options and authentication information as instructed by the custom
connector provider. For more information, see the section called “Authoring jobs with custom
connectors” (p. 85).
8. Customize the job run environment by configuring job properties, as described in Modify the job
properties (p. 109).
9. Run the job.
Topics
• Subscribing to AWS Marketplace connectors (p. 82)
• Creating custom connectors (p. 83)
1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.
2. On the Connectors page, choose Go to AWS Marketplace.
3. In AWS Marketplace, in Featured products, choose the connector you want to use. You can choose
one of the featured connectors, or use search. You can search on the name or type of connector, and
you can use options to refine the search results.
If you want to use one of the featured connectors, choose View product. If you used search to locate
a connector, then choose the name of the connector.
4. On the product page for the connector, use the tabs to view information about the connector. If you
decide to purchase this connector, choose Continue to Subscribe.
5. Provide the payment information, and then choose Continue to Configure.
82
AWS Glue Studio User Guide
Creating custom connectors
6. On the Configure this software page, choose the method of deployment and the version of the
connector to use. Then choose Continue to Launch.
7. On the Launch this software page, you can review the Usage Instructions provided by the
connector provider. When you're ready to continue, choose Activate connection in AWS Glue
Studio.
After a small amount of time, the console displays the Create marketplace connection page in AWS
Glue Studio.
8. Create a connection that uses this connector, as described in Creating connections for
connectors (p. 85).
Alternatively, you can choose Activate connector only to skip creating a connection at this time. You
must create a connection at a later date before you can use the connector.
Custom connectors are integrated into AWS Glue Studio through the AWS Glue Spark runtime API. The
AWS Glue Spark runtime allows you to plug in any connector that is compliant with the Spark, Athena,
or JDBC interface. It allows you to pass in any connection option that is available with the custom
connector.
You can encapsulate all your connection properties with AWS Glue Connections and supply the
connection name to your ETL job. Integration with Data Catalog connections allows you to use the same
connection properties across multiple calls in a single Spark application or across different applications.
You can specify additional options for the connection. The job script that AWS Glue Studio generates
contains a Datasource entry that uses the connection to plug in your connector with the specified
connection options. For example:
Datasource = glueContext.create_dynamic_frame.from_options(connection_type =
"custom.jdbc", connection_options = {"dbTable":"Account","connectionName":"my-custom-jdbc-
connection"}, transformation_ctx = "DataSource0")
1. Create the code for your custom connector. For more information, see Developing custom
connectors (p. 92).
2. Add support for AWS Glue features to your connector. Here are some examples of these features and
how they are used within the job script generated by AWS Glue Studio:
• Data type mapping – Your connector can typecast the columns while reading them from
the underlying data store. For example, a dataTypeMapping of {"INTEGER":"STRING"}
converts all columns of type Integer to columns of type String when parsing the records and
constructing the DynamicFrame. This helps users to cast columns to types of their choice.
DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type
= "custom.jdbc", connection_options = {"dataTypeMapping":{"INTEGER":"STRING"}",
connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
• Partitioning for parallel reads – AWS Glue allows parallel data reads from the data store by
partitioning the data on a column. You must specify the partition column, the lower partition
bound, the upper partition bound, and the number of partitions. This feature enables you to make
use of data parallelism and multiple Spark executors allocated for the Spark application.
83
AWS Glue Studio User Guide
Creating custom connectors
DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type
= "custom.jdbc", connection_options = {"upperBound":"200","numPartitions":"4",
"partitionColumn":"id","lowerBound":"0","connectionName":"test-connection-jdbc"},
transformation_ctx = "DataSource0")
• Use AWS Secrets Manager for storing credentials –The Data Catalog connection can also
contain a secretId for a secret stored in AWS Secrets Manager. The AWS secret can securely
store authentication and credentials information and provide it to your ETL job at runtime.
Alternatively, you can specify the secretId from the Spark script as follows:
DataSource = glueContext.create_dynamic_frame.from_options(connection_type
= "custom.jdbc", connection_options = {"connectionName":"test-connection-jdbc",
"secretId"-> "my-secret-id"}, transformation_ctx = "DataSource0")
• Filtering the source data with row predicates and column projections – The AWS Glue Spark
runtime also allows users to push down SQL queries to filter data at the source with row
predicates and column projections. This allows your ETL job to load filtered data faster from data
stores that support push-downs. An example SQL query pushed down to a JDBC data source is:
SELECT id, name, department FROM department WHERE id < 200.
DataSource = glueContext.create_dynamic_frame.from_options(connection_type =
"custom.jdbc", connection_options = {"query":"SELECT id, name, department FROM
department
WHERE id < 200","connectionName":"test-connection-jdbc"}, transformation_ctx =
"DataSource0")
• Job bookmarks – AWS Glue supports incremental loading of data from JDBC sources. AWS Glue
keeps track of the last processed record from the data store, and processes new data records
in the subsequent ETL job runs. Job bookmarks use the primary key as the default column for
the bookmark key, provided that this column increases or decreases sequentially. For more
information about job bookmarks, see Job Bookmarks in the AWS Glue Developer Guide.
DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type =
"custom.jdbc", connection_options = {"jobBookmarkKeys":["empno"],
"jobBookmarkKeysSortOrder"
:"asc", "connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
3. Package the custom connector as a JAR file and upload the file to Amazon S3.
4. Test your custom connector. For more information, see the instructions on GitHub at Glue Custom
Connectors: Local Validation Tests Guide.
5. In the AWS Glue Studio console, choose Connectors in the console navigation pane.
6. On the Connectors page, choose Create custom connector.
7. On the Create custom connector page, enter the following information:
• The path to the location of the custom code JAR file in Amazon S3.
• A name for the connector that will be used by AWS Glue Studio.
• Your connector type, which can be one of JDBC, Spark, or Athena.
• The name of the entry point within your custom code that AWS Glue Studio calls to use the
connector.
• For JDBC connectors, this field should be the class name of your JDBC driver.
• For Spark connectors, this field should be the fully qualified data source class name, or its alias,
that you use when loading the Spark data source with the format operator.
• (JDBC only) The base URL used by the JDBC connection for the data store.
• (Optional) A description of the custom connector.
8. Choose Create connector.
84
AWS Glue Studio User Guide
Creating connections for connectors
9. From the Connectors page, create a connection that uses this connector, as described in Creating
connections for connectors (p. 85).
1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.
2. Choose the connector you want to create a connection for, and then choose Create connection.
3. On the Create connection page, enter a name for your connection, and optionally a description.
4. Enter the connection details. Depending on the type of connector you selected, you're prompted to
enter additional information:
• Enter the requested authentication information, such as a user name and password, or choose an
AWS secret.
• For connectors that use JDBC, enter the information required to create the JDBC URL for the data
store.
• If you use a virtual private cloud (VPC), then enter the network information for your VPC.
5. Choose Create connection.
You are returned to the Connectors page, and the informational banner indicates the connection
that was created. You can now use the connection in your AWS Glue Studio jobs, as described in the
section called “Create jobs that use a connector” (p. 45).
Topics
• Create jobs that use a connector for the data source (p. 85)
• Configure source properties for nodes that use connectors (p. 86)
• Configure target properties for nodes that use connectors (p. 89)
To create a job that uses connectors for the data source or data target
1. Sign in to the AWS Management Console and open the AWS Glue Studio console at https://
console.aws.amazon.com/gluestudio/.
85
AWS Glue Studio User Guide
Configure source properties for nodes that use connectors
2. On the Connectors page, in the Your connections resource list, choose the connection you want to
use in your job, and then choose Create job.
Alternatively, on the AWS Glue Studio Jobs page, under Create job, choose Source and target
added to the graph. In the Source drop-down list, choose the custom connector that you want to
use in your job. You can also choose a connector for Target.
To configure the properties for a data source node that uses a connector
1. Choose the connector data source node in the job graph or add a new node and choose the
connector for the Node type. Then, on the right-side, in the node details panel, choose the Data
source properties tab, if it's not already selected.
86
AWS Glue Studio User Guide
Configure source properties for nodes that use connectors
2. In the Data source properties tab, choose the connection that you want to use for this job.
JDBC
• Data source input type: Choose to provide either a table name or a SQL query as the
data source. Depending on your choice, you then need to provide the following additional
information:
• Table name: The name of the table in the data source. If the data source does not use the
term table, then supply the name of an appropriate data structure, as indicated by the
custom connector usage information (which is available in AWS Marketplace).
• Filter predicate: A condition clause to use when reading the data source, similar to a WHERE
clause, which is used to retrieve a subset of the data.
• Query code: Enter a SQL query to use to retrieve a specific dataset from the data source. An
example of a basic SQL query is:
• Schema: Because AWS Glue Studio is using information stored in the connection to access the
data source instead of retrieving metadata information from a Data Catalog table, you must
provide the schema metadata for the data source. Choose Add schema to open the schema
editor.
For instructions on how to use the schema editor, see Editing the schema in a custom
transform node (p. 70).
• Partition column: (Optional) You can choose to partition the data reads by providing values
for Partition column, Lower bound, Upper bound, and Number of partitions.
The lowerBound and upperBound values are used to decide the partition stride, not for
filtering the rows in table. All rows in the table are partitioned and returned.
87
AWS Glue Studio User Guide
Configure source properties for nodes that use connectors
Note
Column partitioning adds an extra partitioning condition to the query used to read
the data. When using a query instead of a table name, you should validate that the
query works with the specified partitioning condition. For example:
• If your query format is "SELECT col1 FROM table1", then test the query by
appending a WHERE clause at the end of the query that uses the partition column.
• If your query format is "SELECT col1 FROM table1 WHERE col2=val", then
test the query by extending the WHERE clause with AND and an expression that uses
the partition column.
• Data type casting: If the data source uses data types that are not available in JDBC, use this
section to specify how a data type from the data source should be converted into JDBC data
types. You can specify up to 50 different data type conversions. All columns in the data source
that use the same data type are converted in the same way.
For example, if you have three columns in the data source that use the Float data type, and
you indicate that the Float data type should be converted to the JDBC String data type,
then all three columns that use the Float data type are converted to String data types.
• Job bookmark keys: Job bookmarks help AWS Glue maintain state information and prevent
the reprocessing of old data. Specify one more one or more columns as bookmark keys.
AWS Glue Studio uses bookmark keys to track data that has already been processed during a
previous run of the ETL job. Any columns you use for custom bookmark keys must be strictly
monotonically increasing or decreasing, but gaps are permitted.
If you enter multiple bookmark keys, they're combined to form a single compound key. A
compound job bookmark key should not contain duplicate columns. If you don't specify
bookmark keys, AWS Glue Studio by default uses the primary key as the bookmark key,
provided that the primary key is sequentially increasing or decreasing (with no gaps). If the
table doesn't have a primary key, but the job bookmark property is enabled, you must provide
custom job bookmark keys. Otherwise, the search for primary keys to use as the default will
fail and the job run will fail.
• Job bookmark keys sorting order: Choose whether the key values are sequentially increasing
or decreasing.
Spark
• Schema: Because AWS Glue Studio is using information stored in the connection to access the
data source instead of retrieving metadata information from a Data Catalog table, you must
provide the schema metadata for the data source. Choose Add schema to open the schema
editor.
For instructions on how to use the schema editor, see Editing the schema in a custom
transform node (p. 70).
• Connection options: Enter additional key-value pairs as needed to provide additional
connection information or options. For example, you might enter a database name, table
name, a user name, and password.
For example, for OpenSearch, you enter the following key-value pairs, as described in Tutorial:
Using the open-source Elasticsearch Spark Connector (p. 95):
• es.net.http.auth.user : username
• es.net.http.auth.pass : password
• es.nodes : https://<Elasticsearch endpoint>
• es.port : 443
• path: <Elasticsearch resource>
• es.nodes.wan.only : true
88
AWS Glue Studio User Guide
Configure target properties for nodes that use connectors
For an example of the minimum connection options to use, see the sample test script
MinimalSparkConnectorTest.scala on GitHub, which shows the connection options you would
normally provide in a connection.
Athena
• Table name: The name of the table in the data source. If you're using a connector for reading
from Athena-CloudWatch logs, you would enter the table name all_log_streams.
• Athena schema name: Choose the schema in your Athena data source that corresponds to
the database that contains the table. If you're using a connector for reading from Athena-
CloudWatch logs, you would enter a schema name similar to /aws/glue/name.
• Schema: Because AWS Glue Studio is using information stored in the connection to access the
data source instead of retrieving metadata information from a Data Catalog table, you must
provide the schema metadata for the data source. Choose Add schema to open the schema
editor.
For instructions on how to use the schema editor, see Editing the schema in a custom
transform node (p. 70).
• Additional connection options: Enter additional key-value pairs as needed to provide
additional connection information or options.
To configure the properties for a data target node that uses a connector
1. Choose the connector data target node in the job graph. Then, on the right-side, in the node details
panel, choose the Data target properties tab, if it's not already selected.
2. In the Data target properties tab, choose the connection to use for writing to the target.
89
AWS Glue Studio User Guide
Managing connectors and connections
JDBC
• Connection: Choose the connection to use with your connector. For information about how to
create a connection, see Creating connections for connectors (p. 85).
• Table name: The name of the table in the data target. If the data target does not use the
term table, then supply the name of an appropriate data structure, as indicated by the custom
connector usage information (which is available in AWS Marketplace).
• Batch size (Optional): Enter the number of rows or records to insert in the target table in a
single operation. The default value is 1000 rows.
Spark
• Connection: Choose the connection to use with your connector. If you did not create a
connection previously, choose Create connection to create one. For information about how to
create a connection, see Creating connections for connectors (p. 85).
• Connection options: Enter additional key-value pairs as needed to provide additional
connection information or options. You might enter a database name, table name, a user
name, and password.
For example, for OpenSearch, you enter the following key-value pairs, as described in Tutorial:
Using the open-source Elasticsearch Spark Connector (p. 95):
• es.net.http.auth.user : username
• es.net.http.auth.pass : password
• es.nodes : https://<Elasticsearch endpoint>
• es.port : 443
• path: <Elasticsearch resource>
• es.nodes.wan.only : true
For an example of the minimum connection options to use, see the sample test script
MinimalSparkConnectorTest.scala on GitHub, which shows the connection options you would
normally provide in a connection.
3. After providing the required information, you can view the resulting data schema for your data
source by choosing the Output schema tab in the node details panel.
Topics
• Viewing connector and connection details (p. 90)
• Editing connectors and connections (p. 91)
• Deleting connectors and connections (p. 91)
• Cancel a subscription for a connector (p. 92)
90
AWS Glue Studio User Guide
Editing connectors and connections
Note
Connections created using the AWS Glue console do not appear in AWS Glue Studio.
1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.
2. Choose the connector or connection that you want to view detailed information for.
3. Choose Actions, and then choose View details to open the detail page for that connector or
connection.
4. On the detail page, you can choose to Edit or Deleteguilabel> the connector or connection.
• For connectors, you can choose Create connection to create a new connection that uses the
connector.
• For connections, you can choose Create job to create a job that uses the connection.
1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.
2. Choose the connector or connection that you want to change.
3. Choose Actions, and then choose Edit.
You can also choose View details and on the connector or connection detail page, you can choose
Edit.
4. On the Edit connector or Edit connection page, update the information, and then choose Save.
1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.
2. Choose the connector or connection you want to delete.
3. Choose Actions, and then choose Delete.
You can also choose View details, and on the connector or connection detail page, you can choose
Delete.
4. Verify that you want to remove the connector or connection by entering Delete, and then choose
Delete.
When deleting a connector, any connections that were created for that connector are also deleted.
Any jobs that use a deleted connection will no longer work. You can either edit the jobs to use a different
data store, or remove the jobs. For information about how to delete a job, see Delete jobs (p. 113).
If you delete a connector, this doesn't cancel the subscription for the connector in AWS Marketplace.
To remove a subscription for a deleted connector, follow the instructions in Cancel a subscription for a
connector (p. 92) .
91
AWS Glue Studio User Guide
Cancel a subscription for a connector
You will need a local development environment for creating your connector code. You can use any IDE
or even just a command line editor to write your connector. Examples of development environments
include:
• A local Scala environment with a local AWS Glue ETL Maven library, as described in Developing Locally
with Scala in the AWS Glue Developer Guide.
• IntelliJ IDE, by downloading the IDE from https://www.jetbrains.com/idea/.
Topics
• Developing Spark connectors (p. 92)
• Developing Athena connectors (p. 93)
• Developing JDBC connectors (p. 93)
• Examples of using custom connectors with AWS Glue Studio (p. 93)
• Developing AWS Glue connectors for AWS Marketplace (p. 94)
92
AWS Glue Studio User Guide
Developing Athena connectors
Follow the steps in the AWS Glue GitHub sample library for developing Spark connectors, which is
located at https://github.com/aws-samples/aws-glue-samples/tree/master/GlueCustomConnectors/
development/Spark/README.md.
Follow the steps in the AWS Glue GitHub sample library for developing Athena connectors, which is
located at https://github.com/aws-samples/aws-glue-samples/tree/master/GlueCustomConnectors/
development/Athena.
1. Install the AWS Glue Spark runtime libraries in your local development environment. Refer to the
instructions in the AWS Glue GitHub sample library at https://github.com/aws-samples/aws-glue-
samples/tree/master/GlueCustomConnectors/development/GlueSparkRuntime/README.md.
2. Implement the JDBC driver that is responsible for retrieving the data from the data source. Refer to
the Java Documentation for Java SE 8.
Create an entry point within your code that AWS Glue Studio uses to locate your connector. The
Class name field should be the full path of your JDBC driver.
3. Use the GlueContext API to read data with the connector. Users can add more input options in the
AWS Glue Studio console to configure the connection to the data source, if necessary. For a code
example that shows how to read from and write to a JDBC database with a custom JDBC connector,
see Custom and AWS Marketplace connectionType values.
• Developing, testing, and deploying custom connectors for your data stores with AWS Glue
• Apache Hudi: Writing to Apache Hudi tables using AWS Glue Custom Connector
• Google BigQuery: Migrating data from Google BigQuery to Amazon S3 using AWS Glue custom
connectors
• Snowflake (JDBC): Performing data transformations using Snowflake and AWS Glue
• SingleStore: Building fast ETL using SingleStore and AWS Glue
• Salesforce: Ingest Salesforce data into Amazon S3 using the CData JDBC custom connector with AWS
Glue -
• MongoDB: Building AWS Glue Spark ETL jobs using Amazon DocumentDB (with MongoDB
compatibility) and MongoDB
• Amazon Relational Database Service (Amazon RDS): Building AWS Glue Spark ETL jobs by bringing
your own JDBC drivers for Amazon RDS
93
AWS Glue Studio User Guide
Developing AWS Glue connectors for AWS Marketplace
The process for developing the connector code is the same as for custom connectors, but the process
of uploading and verifying the connector code is more detailed. Refer to the instructions in Creating
Connectors for AWS Marketplace on the GitHub website.
• The testConnection API isn't supported with connections created for custom connectors.
• Data Catalog connection password encryption isn't supported with custom connectors.
• You can't use job bookmarks if you specify a filter predicate for a data source node that uses a JDBC
connector.
94
AWS Glue Studio User Guide
Prerequisites
In this tutorial, we will show how to connect to your Amazon OpenSearch Service nodes with a minimal
number of steps.
Topics
• Prerequisites (p. 95)
• Step 1: (Optional) Create an AWS secret for your OpenSearch cluster information (p. 95)
• Step 2: Subscribe to the connector (p. 96)
• Step 3: Activate the connector in AWS Glue Studio and create a connection (p. 97)
• Step 4: Configure an IAM role for your ETL job (p. 97)
• Step 5: Create a job that uses the OpenSearch connection (p. 98)
• Step 6: Run the job (p. 100)
Prerequisites
To use this tutorial, you must have the following:
For more information about creating secrets, see Creating and Managing Secrets with AWS Secrets
Manager in the AWS Secrets Manager User Guide.
95
AWS Glue Studio User Guide
Next step
es.net.http.auth.user: username
5. Choose + Add row, and enter another key-value pair for the password. For example:
es.net.http.auth.pass: password
6. Choose Next.
7. Enter a secret name. For example: my-es-secret. You can optionally include a description.
Record the secret name, which is used later in this tutorial, and then choose Next.
8. Choose Next again, and then choose Store to create the secret.
Next step
Step 2: Subscribe to the connector (p. 96)
1. If you have not already configured your AWS account to use License Manager, do the following:
If you do not see this window, then you have already configured the necessary permissions.
2. Open the AWS Glue Studio console at https://console.aws.amazon.com/gluestudio/.
3. In the AWS Glue Studio console, expand the menu icon ( ), and then choose Connectors in the
navigation pane.
4. On the Connectors page, choose Go to AWS Marketplace.
5. In AWS Marketplace, in the Search AWS Glue Studio products section, enter elasticsearch
connector in the search field, and then press Enter.
6. Choose the name of the connector, ElasticSearch Spark connector for AWS Glue .
7. On the product page for the connector, use the tabs to view information about the connector. When
you're ready to continue, choose Continue to Subscribe.
8. Review and accept the terms of use for the software.
9. When the subscription process completes, choose Continue to Configuration.
10. Keep the default choices on the Configure this software page, and choose Continue to Launch.
96
AWS Glue Studio User Guide
Next step
Next step
Step 3: Activate the connector in AWS Glue Studio and create a connection (p. 97)
1. On the Launch this software page in the AWS Marketplace console, choose Usage Instructions, and
then choose the link in the window that appears.
Your browser is redirected to the AWS Glue Studio console Create marketplace connection page.
2. Enter a name for the connection. For example: my-es-connection.
3. In the Connection access section, for Connection credential type, choose User name and
password.
4. For the AWS secret, enter the name of your secret. For example: my-es-secret.
5. In the Network options section, enter the VPC information to connect to Elastic Search cluster.
6. Choose Create connection and activate connector.
Next step
Step 4: Configure an IAM role for your ETL job (p. 97)
The assumed IAM role for the AWS Glue ETL job must also have access to the secret that was created in
the previous section. By default, the AWS managed role AWSGlueServiceRole does not have access
to the secret. To set up access control for your secrets, see Authentication and Access Control for AWS
Secrets Manager and Limiting Access to Specific Secrets.
1. Configure the permissions described in the section called “Review IAM permissions needed for ETL
jobs” (p. 34).
2. Configure the additional permissions needed when using connectors with AWS Glue Studio, as
described in the section called “Permissions required for using connectors” (p. 35).
Next step
Step 5: Create a job that uses the OpenSearch connection (p. 98)
97
AWS Glue Studio User Guide
Step 5: Create a job that uses the OpenSearch connection
If your job runs within a Amazon Virtual Private Cloud (Amazon VPC), make sure the VPC is configured
correctly. For more information, see the section called “Configure a VPC for your ETL job” (p. 37).
a. Choose Add schema and enter the schema of the data set in the data source. Connections do
not use tables stored in the Data Catalog, which means that AWS Glue Studio doesn't know the
schema of the data. You must manually provide this schema information. For instructions on
how to use the schema editor, see the section called “Editing the schema in a custom transform
node” (p. 70).
b. Expand Connection options.
c. Choose Add new option and enter the information needed for the connector that was not
entered in the AWS secret:
98
AWS Glue Studio User Guide
Step 5: Create a job that uses the OpenSearch connection
4. Add a target node to the graph as described in the section called “Adding nodes to the job
diagram” (p. 79) and the section called “Editing the data target node” (p. 74).
Your data target can be Amazon S3, or it can use information from an AWS Glue Data Catalog or
a connector to write data in a different location. For example, you can use a Data Catalog table to
write to a database in Amazon RDS, or you can use a connector as your data target to write to data
stores that are not natively supported in AWS Glue.
If you choose a connector for your data target, you must choose a connection created for that
connector. Also, if required by the connector provider, you must add options to provide additional
information to the connector. If you use a connection that contains information for an AWS secret,
then you don’t need to provide the user name and password authentication in the connection
options.
5. Optionally, add additional data sources and one or more transform nodes as described in the section
called “Editing the data transform node” (p. 55).
6. Configure the job properties as described in the section called “Modify the job properties” (p. 109),
starting with step 3, and save the job.
99
AWS Glue Studio User Guide
Next step
Next step
Step 6: Run the job (p. 100)
To run the job you created for the Elasticsearch Spark Connector
1. Using the AWS Glue Studio console, on the visual editor page, choose Run.
2. In the success banner, choose Run Details, or you can choose the Runs tab of the visual editor to
view information about the job run.
100
AWS Glue Studio User Guide
Accessing the job monitoring dashboard
Topics
• Accessing the job monitoring dashboard (p. 101)
• Overview of the job monitoring dashboard (p. 101)
• Job runs view (p. 101)
• Viewing the job run logs (p. 103)
• Viewing the details of a job run (p. 103)
• Viewing Amazon CloudWatch metrics for a job run (p. 104)
The graphs in the tiles are interactive. You can choose any block in a graph to run a filter that displays
only those jobs in the Job runs table at the bottom of the page.
You can change the date range for the information displayed on this page by using the Date range
selector. When you change the date range, the information tiles adjust to show the values for the
specified number of days before the current date. You can also use a specific date range if you choose
Custom from the date range selector.
You can filter the jobs on additional criteria, such as status, worker type, job type, and the job name.
In the filter box at the top of the table, you can enter the text to use as a filter. The table results are
updated with rows that contain matching text as you enter the text.
You can view a subset of the jobs by choosing elements from the graphs on the job monitoring
dashboard. For example, if you choose the number of running jobs in the Job runs summary tile, then
101
AWS Glue Studio User Guide
Job runs view
the Job runs list displays only the jobs that currently have a status of Running. If you choose one of the
bars in the Worker type breakdown bar chart, then only job runs with the matching worker type and
status are shown in the Job runs list.
The Job runs resource list displays the details for the job runs. You can sort the rows in the table by
choosing a column heading. The table contains the following information:
Property Description
Start time The date and time at which this job run was
started.
End time The date and time that this job run completed.
Run status The current state of the job run. Values can be:
• STARTING
• RUNNING
• STOPPING
• STOPPED
• SUCCEEDED
• FAILED
• TIMEOUT
Run time The amount of time that the job run consumed
resources.
DPU hours The estimated number of DPUs used for the job
run. A DPU is a relative measure of processing
power. DPUs are used to determine the cost of
running your job. For more information, see the
AWS Glue pricing page.
You can choose any job run in the list and view additional information. Choose a job run, and then do
one of the following:
102
AWS Glue Studio User Guide
Viewing the job run logs
• Choose the Actions menu and the View job option to view the job in the visual editor.
• Choose the Actions menu and the Stop run option to stop the current run of the job.
• Choose the View CloudWatch logs button to view the job run logs for that job.
• Choose View run details to view the job run details page.
• On the Monitoring page, in the Job runs table, choose a job run, and then choose View CloudWatch
logs.
• In the visual job editor, on the Runs tab for a job, choose the hyperlinks to view the logs:
• Logs – Links to the Apache Spark job logs written when continuous logging is enabled for a job run.
When you choose this link, it takes you to the Amazon CloudWatch logs in the /aws-glue/jobs/
logs-v2 log group. By default, the logs exclude non-useful Apache Hadoop YARN heartbeat and
Apache Spark driver or executor log messages. For more information about continuous logging, see
Continuous Logging for AWS Glue Jobs in the AWS Glue Developer Guide.
• Error logs – Links to the logs written to stderr for this job run. When you choose this link, it takes
you to the Amazon CloudWatch logs in the /aws-glue/jobs/error log group. You can use these
logs to view details about any errors that were encountered during the job run.
• Output logs – Links to the logs written to stdout for this job run. When you choose this link, it
takes you to the Amazon CloudWatch logs in the /aws-glue/jobs/output log group. You can use
these logs to see all the details about the tables that were created in the AWS Glue Data Catalog and
any errors that were encountered.
Property Description
Run Status The current state of the job run. Values can be:
• STARTING
• RUNNING
• STOPPING
• STOPPED
• SUCCEEDED
• FAILED
• TIMEOUT
Glue version The AWS Glue version used by the job run
103
AWS Glue Studio User Guide
Viewing Amazon CloudWatch metrics for a job run
Property Description
Start time The date and time at which this job run was
started
End time The date and time that this job run completed
Execution time The amount of time spent running the job script
Trigger name The name of the trigger associated with the job
Last modified on The date when the job was last modified
Number of workers The number of workers used for the job run
AWS Glue reports metrics to Amazon CloudWatch every 30 seconds. The AWS Glue metrics represent
delta values from the previously reported values. Where appropriate, metrics dashboards aggregate
(sum) the 30-second values to obtain a value for the entire last minute. However, the Apache Spark
metrics that AWS Glue passes on to Amazon CloudWatch are generally absolute values that represent
the current state at the time they are reported.
Note
You must configure your account to access Amazon CloudWatch, as described in Amazon
CloudWatch permissions (p. 33).
104
AWS Glue Studio User Guide
Viewing Amazon CloudWatch metrics for a job run
The metrics provide information about your job run, such as:
• ETL Data Movement – The number of bytes read from or written to Amazon S3.
• Memory Profile: Heap used – The number of memory bytes used by the Java virtual machine (JVM)
heap.
• Memory Profile: heap usage – The fraction of memory (scale: 0–1), shown as a percentage, used by
the JVM heap.
• CPU Load – The fraction of CPU system load used (scale: 0–1), shown as a percentage.
105
AWS Glue Studio User Guide
Start a job run
You can initiate a job run in the following ways in AWS Glue Studio:
• On the Jobs page, choose the job you want to start, and then choose the Run job button.
• If you're viewing a job in the visual editor and the job has been saved, you can choose the Run button
to start a job run.
For more information about job runs, see Working with Jobs on the AWS Glue Console in the AWS Glue
Developer Guide.
106
AWS Glue Studio User Guide
Manage job schedules
• On the Jobs page, choose the job you want to create a schedule for, choose Actions, and then
choose Schedule job.
• If you're viewing a job in the visual editor and the job has been saved, choose the Schedules tab.
Then choose Create Schedule.
2. On the Schedule job run page, enter the following information:
On the Schedules tab of the visual editor, you can perform the following tasks:
Choose Create schedule, then enter the information for your schedule as described in the section
called “Schedule job runs” (p. 106).
• Edit an existing schedule.
Choose the schedule you want to edit, then choose Action followed by Edit schedule. When you
choose to edit an existing schedule, the Frequency shows as Custom, and the schedule is displayed
as a cron expression. You can either modify the cron expression, or specify a new schedule using the
Frequency button. When you finish with your changes, choose Update schedule.
107
AWS Glue Studio User Guide
Stop job runs
Choose an active schedule, and then choose Action followed by Pause schedule. The schedule is
instantly deactivated. Choose the refresh (reload) button to see the updated job schedule status.
• Resume a paused schedule.
Choose a deactivated schedule, and then choose Action followed by Resume schedule. The schedule is
instantly activated. Choose the refresh (reload) button to see the updated job schedule status.
• Delete a schedule.
Choose the schedule you want to remove, and then choose Action followed by Delete schedule. The
schedule is instantly deleted. Choose the refresh (reload) button to see the updated job schedule list.
The schedule will show a status of Deleting until it has been completely removed.
On the Monitoring page, in the Job runs list, choose the job that you want to stop, choose Actions, and
then choose Stop run.
On the Jobs page, you can see all the jobs that were created in your account. The Your jobs list shows
the job name, its type, the status of the last run of that job, and the dates on which the job was created
and last modified. You can choose the name of a job to see detailed information for that job.
You can also use the Monitoring dashboard to view all your jobs. You can access the dashboard by
choosing Monitoring in the navigation pane. For more information about using the dashboard, see
Monitoring ETL jobs in AWS Glue Studio (p. 101).
If you choose the settings icon in the Your jobs section, you can customize how AWS Glue Studio
displays the information in the table. You can choose to wrap the lines of text in the display, change the
number of jobs displayed on the page, and specify which columns to display.
• Choose the Runs tab of the visual editor to view the job run information for the currently displayed
job.
108
AWS Glue Studio User Guide
View the job script
On the Runs tab (the Recent job runs page), there is a card for each job run. The information displayed
on the Runs tab includes:
• Job run ID
• Number of attempts to run this job
• Status of the job run
• Start and end time for the job run
• The runtime for the job run
• A link to the job log files
• A link to the job error log files
• The error returned for failed jobs
• In the navigation pane, choose Monitoring. Scroll down to the Job runs list. Choose the job and then
choose View run details.
The information displayed on the job run detail page accessed from the Monitoring page is more
comprehensive. The contents are described in Viewing the details of a job run (p. 103).
For more information about the job logs, see Viewing the job run logs (p. 103).
If you want to edit the job script, see Editing or uploading a job script (p. 76).
For more information about the job properties, see Defining Job Properties in the AWS Glue
Developer Guide.
5. Expand the Advanced properties section if you need to specify these additional job properties:
109
AWS Glue Studio User Guide
Store Spark shuffle files on Amazon S3
• Script filename – The name of the file that stores the job script in Amazon S3.
• Script path – The Amazon S3 location where the job script is stored.
• Job metrics – (Not available for Python shell jobs) Turns on the creation of Amazon CloudWatch
metrics when this job runs.
• Continuous logging – (Not available for Python shell jobs) Turns on continuous logging to
CloudWatch, so the logs are available for viewing before the job completes.
• Spark UI and Spark UI logs path – (Not available for Python shell jobs) Turns on the use of Spark
UI for monitoring this job and specifies the location for the Spark UI logs.
• Maximum concurrency – Sets the maximum number of concurrent runs that are allowed for this
job.
• Temporary path – The location of a working directory in Amazon S3 where temporary
intermediate results are written when AWS Glue runs the job script.
• Delay notification threshold (minutes) – Specify a delay threshold for the job. If the job runs for a
longer time than that specified by the threshold, then AWS Glue sends a delay notification for the
job to CloudWatch.
• Security configuration and Server-side encryption – Use these fields to choose the encryption
options for the job.
• Use Glue Data Catalog as the Hive metastore – Choose this option if you want to use the AWS
Glue Data Catalog as an alternative to Apache Hive Metastore.
• Additional network connection – For a data source in a VPC, you can specify a connection of type
Network to ensure your job accesses your data through the VPC.
• Python library path, Dependent jars path (Not available for Python shell jobs), or Referenced
files path – Use these fields to specify the location of additional files used by the job when it runs
the script.
• Job Parameters – You can add a set of key-value pairs that are passed as named parameters to
the job script. In Python calls to AWS Glue APIs, it's best to pass parameters explicitly by name.
For more information about using parameters in a job script, see Passing and Accessing Python
Parameters in AWS Glue in the AWS Glue Developer Guide.
• Tags – You can add tags to the job to help you organize and identify them.
6. After you have modified the job properties, save the job.
1. On the Jobs page, in the Your Jobs list, choose the name of the job you want to modify.
2. On the visual editor page, choose the Job details tab at the top of the job editing pane.
• --write-shuffle-files-to-s3 — true
110
AWS Glue Studio User Guide
Save the job
This is the main parameter that configures the shuffle manager in AWS Glue to use Amazon S3
buckets for writing and reading shuffle data. By default, this parameter has a value of false.
• (Optional) --write-shuffle-spills-to-s3 — true
This parameter allows you to offload spill files to Amazon S3 buckets, which provides additional
resiliency to your Spark job in AWS Glue. This is only required for large workloads that spill a lot of
data to disk. By default, this parameter has a value of false.
• (Optional) --conf spark.shuffle.glue.s3ShuffleBucket — S3://<shuffle-bucket>
This parameter specifies the Amazon S3 bucket to use when writing the shuffle files. If you do
not set this parameter, the location is the shuffle-data folder in the location specified for
Temporary path (--TempDir).
Note
Make sure the location of the shuffle bucket is in the same AWS Region in which the job
runs.
Also, the shuffle service does not clean the files after the job finishes running, so you
should configure the Amazon S3 storage life cycle policies on the shuffle bucket location.
For more information, see Managing your storage lifecycle in the Amazon S3 User Guide.
1. Provide all the required information in the Visual and Job details tabs.
2. Choose the Save button.
After you save the job, the 'not saved' callout changes to display the time and date that the job was
last saved.
If you exit AWS Glue Studio before saving your job, the next time you sign in to AWS Glue Studio, a
notification appears. The notification indicates that there is an unsaved job, and asks if you want to
restore it. If you choose to restore the job, you can continue to edit it.
• If a node in the visual editor isn't configured correctly, the Visual tab shows a red callout, and the node
with the error displays a warning symbol .
1. Choose the node. In the node details panel, a red callout appears on the tab where the missing or
incorrect information is located.
111
AWS Glue Studio User Guide
Troubleshooting errors when saving a job
2. Choose the tab in the node details panel that shows a red callout, and then locate the problem
fields, which are highlighted. An error message below the fields provides additional information
about the problem.
• If there is a problem with the job properties, the Job details tab shows a red callout. Choose that tab
and locate the problem fields, which are highlighted. The error messages below the fields provide
additional information about the problem.
112
AWS Glue Studio User Guide
Clone a job
Clone a job
You can use the Clone job action to copy an existing job into a new job.
1. On the Jobs page, in the Your jobs list, choose the job that you want to duplicate.
2. From the Actions menu, choose Clone job.
3. Enter a name for the new job. You can then save or edit the job.
Delete jobs
You can remove jobs that are no longer needed. You can delete one or more jobs in a single operation.
1. On the Jobs page, in the Your jobs list, choose the jobs that you want to delete.
2. From the Actions menu, choose Delete job.
3. Verify that you want to delete the job by entering delete.
You can also delete a saved job when you're viewing the Job details tab for that job in the visual editor.
113
AWS Glue Studio User Guide
Prerequisites
In this tutorial, let’s add a crawler that infers metadata from these flight logs in Amazon S3 and creates a
table in your Data Catalog.
Topics
• Prerequisites (p. 114)
• Step 1: Add a crawler (p. 114)
• Step 2: Run the crawler (p. 115)
• Step 3: View AWS Glue Data Catalog objects (p. 115)
Prerequisites
This tutorial assumes that you have an AWS account and access to AWS Glue.
1. On the AWS Glue service console, on the left-side menu, choose Crawlers.
2. On the Crawlers page, choose Add crawler. This starts a series of pages that prompt you for the
crawler details.
3. In the Crawler name field, enter Flights Data Crawler, and choose Next.
Crawlers invoke classifiers to infer the schema of your data. This tutorial uses the built-in classifier
for CSV by default.
4. For the crawler source type, choose Data stores and choose Next.
5. Now let's point the crawler to your data. On the Add a data store page, choose the Amazon S3 data
store. This tutorial doesn't use a connection, so leave the Connection field blank if it's visible.
For the option Crawl data in, choose Specified path in another account. Then, for the Include path,
enter the path where the crawler can find the flights data, which is s3://crawler-public-us-
east-1/flight/2016/csv. After you enter the path, the title of this field changes to Include
path. Choose Next.
6. You can crawl multiple data stores with a single crawler. However, in this tutorial, we're using only a
single data store, so choose No, and then choose Next.
114
AWS Glue Studio User Guide
Step 2: Run the crawler
7. The crawler needs permissions to access the data store and create objects in the AWS Glue Data
Catalog. To configure these permissions, choose Create an IAM role. The IAM role name starts
with AWSGlueServiceRole-, and in the field, you enter the last part of the role name. Enter
CrawlerTutorial, and then choose Next.
Note
To create an IAM role, your AWS user must have CreateRole, CreatePolicy, and
AttachRolePolicy permissions.
Next, enter flights for Prefix added to tables. Use the default values for the rest of the options,
and choose Next.
10. Verify the choices you made in the Add crawler wizard. If you see any mistakes, you can choose Back
to return to previous pages and make changes.
After you have reviewed the information, choose Finish to create the crawler.
1. The banner near the top of this page lets you know that the crawler was created, and asks if you
want to run it now. Choose Run it now? to run the crawler.
The banner changes to show "Attempting to run" and Running" messages for your crawler. After the
crawler starts running, the banner disappears, and the crawler display is updated to show a status of
Starting for your crawler. After a minute, you can click the Refresh icon to update the status of the
crawler that is displayed in the table.
2. When the crawler completes, a new banner appears that describes the changes made by the crawler.
You can choose the test-flights-db link to view the Data Catalog objects.
1. In the left-side navigation, under Data catalog, choose Databases. Here you can view the flights-
db database that is created by the crawler.
2. In the left-side navigation, under Data catalog and below Databases, choose Tables. Here you can
view the flightscsv table created by the crawler. If you choose the table name, then you can view
115
AWS Glue Studio User Guide
Step 3: View AWS Glue Data Catalog objects
the table settings, parameters, and properties. Scrolling down in this view, you can view the schema,
which is information about the columns and data types of the table.
3. If you choose View partitions on the table view page, you can see the partitions created for the
data. The first column is the partition key.
116
AWS Glue Studio User Guide
The following table describes the important changes in each revision of the AWS Glue Studio User Guide.
For notification about updates to this documentation, you can subscribe to an RSS feed.
Glue Studio is now available in AWS Glue Studio is now October 11, 2021
China (p. 117) available in the China Beijing
and Ningxia regions.
Direct access to streaming When adding data sources to September 30, 2021
sources now available (p. 117) your ETL job in the visual editor,
you can supply information to
access the data stream instead
of having to use a Data Catalog
database and table.
Custom connectors can When editing data source node September 24, 2021
now be used with data using a custom connector, you
previews (p. 117) can preview the dataset by
choosing the Dat preview tab.
For more information, see
Custom Connectors
AWS Glue Studio supports AWS When creating jobs in AWS Glue August 18, 2021
Glue version 3.0 (p. 117) Studio, you can choose Glue 3.0
as the version for your job in the
Job details tab. If you do not
choose a version for your ETL
job, Glue 2.0 is used by default.
AWS GovCloud (US) AWS Glue Studio is now August 18, 2021
Region (p. 117) available in the AWS GovCloud
(US) Region
Python shell authoring available When creating a new job, you August 13, 2021
in AWS Glue Studio (p. 117) can now choose to create a
Python shell job. For more
information, see Start the job
creation process and Editing
Python shell jobs in AWS Glue
Studio.
Upload scripts to AWS Glue In conjunction with the script June 14, 2021
Studio (p. 117) editor feature, you can upload
job scripts to AWS Glue Studio.
For more information, see Start
the job creation process and
Editing or uploading a job script.
117
AWS Glue Studio User Guide
View your job's dataset You can use the new Data June 7, 2021
while creating and editing preview tab for a node in your
jobs (p. 117) job diagram to see a sample of
the data processed by that node.
For more information, see Using
data previews in the visual job
editor.
Specify settings for your You can configure additional June 4, 2021
streaming ETL job in the visual connection settings for
job editor (p. 117) streaming data sources in the
visual job editor to optimize
your streaming ETL jobs. For
more information, see Using a
streaming data source.
Network connection support If you want to access a data May 24, 2021
added (p. 117) source located in your VPC,
you can specify a network
connection for the job. For more
information, see Modify the job
properties.
Edit job scripts (p. 117) You can now edit scripts in the May 24, 2021
job editor. For more information,
see Editing a job script.
Delete jobs using the AWS Glue You can now delete jobs in AWS May 24, 2021
Studio console (p. 117) Glue Studio. To learn how, see
Delete jobs.
Read data from files in child You can specify a single folder in April 30, 2021
folders in Amazon S3 (p. 117) Amazon S3 as your data source
and use the Recursive option to
include all the child folders as
part of the data source. For more
information, see Using files in
Amazon S3 for the data source.
Delete connectors and You can now delete connectors April 30, 2021
connections functionality and connections in AWS Glue
added (p. 117) Studio. For more information,
see Deleting connectors and
connections.
Fill missing values transform You can use the FillMissingValues March 29, 2021
added (p. 117) transform in AWS Glue Studio to
locate records in the dataset that
have missing values and add
a new field with an estimated
value. For more information, see
Editing the data transform node.
118
AWS Glue Studio User Guide
SQL transform You can use a SQL transform March 23, 2021
available (p. 117) node to write your own
transform in the form of a SQL
query. For more information, see
Using a SQL query to transform
data.
JDBC source nodes now support Job bookmarks help AWS Glue March 15, 2021
job bookmark keys (p. 117) maintain state information and
prevent the reprocessing of old
data. For more information,
see Authoring jobs with custom
connectors.
Connectors can be used for data Using a custom or AWS March 15, 2021
targets (p. 117) Marketplace connector for your
data target is now supported.
For more information, see
Authoring jobs with custom
connectors.
A new toolbar is available for the A more streamlined and March 8, 2021
visual job editor (p. 117) functional toolbar is available
for the visual job editor of AWS
Glue Studio. This feature makes
it easier to add nodes to your
graph.
Read data from Amazon S3 AWS Glue Studio now allows February 5, 2021
without creating Data Catalog you to read data directly from
tables (p. 117) Amazon S3 without first creating
a table in the AWS Glue Data
Catalog. For more information,
see Editing the data source node.
AWS Glue Studio jobs can AWS Glue Studio now supports February 5, 2021
now update Data Catalog updating the AWS Glue Data
tables (p. 117) Catalog during job runs. This
feature makes it easy to keep
your tables up to date as
your jobs write new data into
Amazon S3. This makes the data
immediately available for query
from any analytics service that
is compatible with the AWS
Glue Data Catalog. For more
information, see Configuring
data target nodes.
Job scheduling now available in You can define a time-based December 21, 2020
AWS Glue Studio (p. 117) schedule for your job runs in
AWS Glue Studio. You can use
the console to create a basic
schedule, or define a more
complex schedule using the
Unix-like cron syntax. For more
information, see Schedule job
runs.
119
AWS Glue Studio User Guide
AWS Glue Custom Connectors AWS Glue Custom Connectors December 21, 2020
released (p. 117) allow you to discover and
subscribe to connectors in
AWS Marketplace. We also
released AWS Glue Spark
runtime interfaces to plug in
connectors built for Apache
Spark Datasource, Athena
federated query, and JDBC APIs.
For more information, see Using
Connectors and connections
with AWS Glue Studio.
Support for running streaming AWS Glue Studio now supports November 11, 2020
ETL jobs in AWS Glue version running streaming ETL jobs
2.0 (p. 117) using AWS Glue version 2.0. For
more information, see Adding
Streaming ETL Jobs in AWS Glue
in the AWS Glue Developer Guide.
Availability of AWS Glue Studio AWS Glue Studio provides a September 23, 2020
announced (p. 117) visual interface that simplifies
the creation of jobs that prepare
the data for analysis. The
initial version of this guide was
published on the same day AWS
Glue Studio launched.
120
AWS Glue Studio User Guide
AWS glossary
For the latest AWS terminology, see the AWS glossary in the AWS General Reference.
121