AWS Glue Studio

AWS Glue Studio
User Guide
AWS Glue Studio User Guide
AWS Glue Studio: User Guide

Copyright © Amazon Web Services, Inc. and/or its affiliates. All rights reserved.
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not
Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or
discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may
or may not be affiliated with, connected to, or sponsored by Amazon.
Table of Contents
Using Notebooks (Preview) .................................................................................................................. 1
Overview of using notebooks ...................................................................................................... 1
Getting started with notebooks in AWS Glue Studio ....................................................................... 2
Creating an ETL job using notebooks in AWS Glue Studio ........................................................ 2
Notebook editor components .............................................................................................. 3
Saving your notebook and job script .................................................................................... 3
Managing notebook sessions ............................................................................................... 4
AWS Glue Visual Job API (Preview) ...................................................................................................... 5
API design and CRUD APIs ......................................................................................................... 5
Getting started ......................................................................................................................... 5
API design and CRUD APIs ......................................................................................................... 9
SDK Onboarding ....................................................................................................................... 9
Appendix: Visual Job Examples and Model Definitions ................................................................... 9
Examples ......................................................................................................................... 9
Model Definitions ............................................................................................................ 11
Detect PII (Preview) ......................................................................................................................... 22
Choosing how you want the data to be scanned ......................................................................... 22
Choosing the PII entities to take action on ................................................................................. 23
Choosing what to do with identified PII data .............................................................................. 24
What is AWS Glue Studio? ................................................................................................................. 25
Features of AWS Glue Studio ..................................................................................................... 26
Visual job editor ............................................................................................................... 26
Notebook interface for interactively developing and debugging job scripts ............................... 26
Job script code editor ....................................................................................................... 27
Job performance dashboard .............................................................................................. 27
Support for dataset partitioning ........................................................................................ 27
When should I use AWS Glue Studio? ......................................................................................... 27
Accessing AWS Glue Studio ....................................................................................................... 28
Pricing for AWS Glue Studio ...................................................................................................... 28
Setting up ....................................................................................................................................... 29
Complete initial AWS configuration tasks .................................................................................... 29
Sign up for AWS .............................................................................................................. 29
Create an IAM administrator user ....................................................................................... 29
Sign in as an IAM user ...................................................................................................... 30
Review IAM permissions needed for the AWS Glue Studio user ....................................................... 31
AWS Glue service permissions ............................................................................................ 31
Creating Custom IAM Policies for AWS Glue Studio ............................................................... 31
Notebook and data preview permissions ............................................................................. 33
Amazon CloudWatch permissions ....................................................................................... 33
Review IAM permissions needed for ETL jobs ............................................................................... 34
Data source and data target permissions ............................................................................. 34
Permissions required for deleting jobs ................................................................................ 34
AWS Key Management Service permissions .......................................................................... 34
Permissions required for using connectors ........................................................................... 35
Set up IAM permissions for AWS Glue Studio ............................................................................... 35
Create an IAM Role ........................................................................................................... 35
Attach policies to the AWS Glue Studio user ........................................................................ 36
Create a trust policy for roles not named "AWSGlueServiceRole*" ............................................ 36
Configure a VPC for your ETL job ............................................................................................... 37
Populate the AWS Glue Data Catalog .......................................................................................... 38
Tutorial: Getting started .................................................................................................................... 39
Prerequisites ............................................................................................................................ 39
Step 1: Start the job creation process ......................................................................................... 39
Step 2: Edit the data source node in the job diagram .................................................................... 40
iii
Step 3: Edit the transform node of the job .................................................................................. 41

Step 4: Edit the data target node of the job ................................................................................ 41
Step 5: View the job script ........................................................................................................ 42
Step 6: Specify the job details and save the job ........................................................................... 42
Step 7: Run the job .................................................................................................................. 43
Next steps ............................................................................................................................... 43
Creating jobs ................................................................................................................................... 44
Start the job creation process .................................................................................................... 44
Create jobs that use a connector ................................................................................................ 45
Next steps for creating a job in AWS Glue Studio ......................................................................... 45
Editing jobs ..................................................................................................................................... 47
Accessing the job diagram editor ................................................................................................ 47
Job editor features ................................................................................................................... 47
Using schema previews in the visual job editor .................................................................... 48
Using data previews in the visual job editor ......................................................................... 48
Restrictions when using data previews ................................................................................ 49
Script code generation ...................................................................................................... 49
Editing the data source node ..................................................................................................... 50
Using Data Catalog tables for the data source ..................................................................... 51
Using a connector for the data source ................................................................................ 51
Using files in Amazon S3 for the data source ....................................................................... 51
Using a streaming data source ........................................................................................... 53
Editing the data transform node ................................................................................................ 55
Using ApplyMapping to remap data property keys ................................................................ 55
Using SelectFields to remove most data property keys .......................................................... 56
Using DropFields to keep most data property keys ............................................................... 57
Renaming a field in the dataset ......................................................................................... 57
Using Spigot to sample your dataset .................................................................................. 58
Joining datasets ............................................................................................................... 58
Using SplitFields to split a dataset into two ......................................................................... 60
Overview of SelectFromCollection transform ........................................................................ 60
Using SelectFromCollection to choose which dataset to keep ................................................. 61
Find and fill missing values in a dataset .............................................................................. 62
Filtering keys within a dataset ........................................................................................... 62
Using DropNullFields to remove fields with null values .......................................................... 63
Using a SQL query to transform data ................................................................................. 64
Using Aggregate to perform summary calculcations on selected fields .................................... 66
Creating a custom transformation ...................................................................................... 68
Using Aggregate to perform summary calculcations on selected fields .................................... 71
Configuring data target nodes ................................................................................................... 73
Overview of data target options ........................................................................................ 73
Editing the data target node ............................................................................................. 74
Editing or uploading a job script ................................................................................................ 76
Creating and editing Scala scripts in AWS Glue Studio ........................................................... 77
Creating and editing Python shell jobs in AWS Glue Studio .................................................... 78
Adding nodes to the job diagram ............................................................................................... 79
Changing the parent nodes for a node in the job diagram ............................................................. 79
Deleting nodes from the job diagram ......................................................................................... 80
Using connectors and connections ...................................................................................................... 81
Overview of using connectors and connections ............................................................................ 81
Adding connectors to AWS Glue Studio ....................................................................................... 82
Subscribing to AWS Marketplace connectors ........................................................................ 82
Creating custom connectors ............................................................................................... 83
Creating connections for connectors ........................................................................................... 85
Authoring jobs with custom connectors ...................................................................................... 85
Create jobs that use a connector for the data source ............................................................ 85
Configure source properties for nodes that use connectors .................................................... 86
iv
Configure target properties for nodes that use connectors ..................................................... 89

Managing connectors and connections ........................................................................................ 90
Viewing connector and connection details ........................................................................... 90
Editing connectors and connections .................................................................................... 91
Deleting connectors and connections .................................................................................. 91
Cancel a subscription for a connector ................................................................................. 92
Developing custom connectors ................................................................................................... 92
Developing Spark connectors ............................................................................................. 92
Developing Athena connectors ........................................................................................... 93
Developing JDBC connectors .............................................................................................. 93
Examples of using custom connectors with AWS Glue Studio ................................................. 93
Developing AWS Glue connectors for AWS Marketplace ......................................................... 94
Restrictions for using connectors and connections in AWS Glue Studio ............................................ 94
Tutorial: Using the open-source Elasticsearch Spark Connector ............................................................... 95
Prerequisites ............................................................................................................................ 95
Step 1: (Optional) Create an AWS secret for your OpenSearch cluster information ............................. 95
Next step ........................................................................................................................ 96
Step 2: Subscribe to the connector ............................................................................................. 96
Next step ........................................................................................................................ 97
Step 3: Activate the connector in AWS Glue Studio and create a connection ..................................... 97
Next step ........................................................................................................................ 97
Step 4: Configure an IAM role for your ETL job ............................................................................ 97
Next step ........................................................................................................................ 97
Step 5: Create a job that uses the OpenSearch connection ............................................................ 98
Next step ....................................................................................................................... 100
Step 6: Run the job ................................................................................................................ 100
Monitoring jobs .............................................................................................................................. 101
Accessing the job monitoring dashboard ................................................................................... 101
Overview of the job monitoring dashboard ................................................................................ 101
Job runs view ......................................................................................................................... 101
Viewing the job run logs ......................................................................................................... 103
Viewing the details of a job run ............................................................................................... 103
Viewing Amazon CloudWatch metrics for a job run ..................................................................... 104
Managing jobs ................................................................................................................................ 106
Start a job run ....................................................................................................................... 106
Schedule job runs ................................................................................................................... 106
Manage job schedules ............................................................................................................. 107
Stop job runs ......................................................................................................................... 108
View your jobs ....................................................................................................................... 108
Customize the job display ................................................................................................ 108
View information for recent job runs ........................................................................................ 108
View the job script ................................................................................................................. 109
Modify the job properties ........................................................................................................ 109
Store Spark shuffle files on Amazon S3 ............................................................................. 110
Save the job .......................................................................................................................... 111
Troubleshooting errors when saving a job .......................................................................... 111
Clone a job ............................................................................................................................ 113
Delete jobs ............................................................................................................................ 113
Tutorial: Adding an AWS Glue crawler ............................................................................................... 114
Prerequisites .......................................................................................................................... 114
Step 1: Add a crawler ............................................................................................................. 114
Step 2: Run the crawler ........................................................................................................... 115
Step 3: View AWS Glue Data Catalog objects ............................................................................. 115
Document history ........................................................................................................................... 117
AWS glossary ................................................................................................................................. 121
v
Overview of using notebooks
Using Notebooks with AWS Glue

Studio and AWS Glue
Notebooks is in preview release for AWS Glue Studio and is subject to change.
Data engineers can author AWS Glue jobs faster and more easily than before using the new interactive
notebook interface in AWS Glue Studio or interactive sessions in AWS Glue.
Topics
• Overview of using notebooks (p. 1)
• Getting started with notebooks in AWS Glue Studio (p. 2)
Overview of using notebooks

AWS Glue Studio allows you to interactively author jobs in a notebook interface based on Jupyter
Notebooks. Through notebooks in AWS Glue Studio, you can edit job scripts and view the output without
having to run a full job, and you can edit data integration code and view the output without having to
run a full job, and you can add markdown and save notebooks as .ipynb files and job scripts. You can
start a notebook without installing software locally or managing servers. When you are satisified with
your code, AWS Glue Studio can convert your notebook to a Glue job with the click of a button.
Some benefits of using notebooks include:
• No cluster to provision or manage

• No idle clusters to pay for
• No up-front configuration required
• No installation of Jupyter notebooks required
• The same runtime/platform as AWS Glue ETL
When you start a notebook through AWS Glue Studio, all the configuration steps are done for you so
that you can explore your data and start developing your job script after only a few seconds. AWS Glue
Studio configures a Jupyter notebook with the AWS Glue Jupyter kernel. You don’t have to configure
VPCs, network connections, or development endpoints to use this notebook.
To create jobs using the notebook interface:
• configure the necessary IAM permissions.

• start a notebook session to create a job
• write code in the cells in the notebook
• run and test the code to view the output
• save the job
1
Getting started with notebooks in AWS Glue Studio
After your notebook is saved, your notebook is a full AWS Glue job. You can manage all aspects of the
job, such as scheduling jobs runs, setting job parameters, and viewing the job run history right along side
your notebook.
Getting started with notebooks in AWS Glue

Studio
When you start a notebook through AWS Glue Studio, all the configuration steps are done for you so
that you can explore your data and start developing your job script after only a few seconds.
The following sections describe how to use the AWS Glue Studio to create notebooks for ETL jobs.
Topics
• Creating an ETL job using notebooks in AWS Glue Studio (p. 2)
• Notebook editor components (p. 3)
• Saving your notebook and job script (p. 3)
• Managing notebook sessions (p. 4)
Creating an ETL job using notebooks in AWS Glue

Studio
To start using notebooks in the AWS Glue Studio console
1. Attach AWS Identity and Access Management policies to the AWS Glue Studio user and create an
IAM role for your ETL job and notebook, as instructed in Set up IAM permissions for AWS Glue
Studio (p. 35).
2. Configure additional IAM security for notebooks, as described in
3. Open the AWS Glue Studio console at https://console.aws.amazon.com/gluestudio/.
4. Choose the Jobs link in the left-side navigation menu.
5. Choose Jupyter notebook and then choose Create to start a new notebook session.
6. On the Create job in Jupyter notebook page, provide the job name, the IAM role to use, and choose
which programming language you want to use within the notebook. Choose Create job.
After a short time period, the notebook editor appears.

7. The notebook doesn't automatically run any code. To configure your session, you use the %
%configure magic to specify the notebook session parameters. For more information about
magics, see the AWS Glue Developer Guide.
When the notebook first opens, it contains a single cell with an example %%configure command
based on the information you provided on the Create job in Jupyter notebook page. You can
modify this cell to customize the notebook session.
Run the cell to start a new notebook session and generate a session id.
8. Add cells, and enter code or markdown text.
For information about writing code using a Jupyter notebook interface, see The Jupyter Notebook
User Documentation .
9. To test your script, run the entire script, or individual cells. Any command output will be displayed in
the area beneath the cell.
2
Notebook editor components
10. After you have finished developing your script, you can save the job and then run it. For more
information about running jobs, see Start a job run (p. 106).
Notebook editor components

The notebook editor interface has the following main sections.
• Notebook interface (main panel) and toolbar

• Job editing tabs
The notebook editor

The AWS Glue Studio notebook editor is based on the Jupyter Notebook Application. The AWS Glue
Studio notebook interface is similar to that provided by Juypter Notebooks, which is described in the
section Notebook user interface . The notebook used by interactive sessions is a Jupyter Notebook.
Although the AWS Glue Studio notebook is similar to Juptyer Notebooks, it differs in a few key ways:
• currently, the AWS Glue Studio notebook cannot install extensions

• you cannot use multiple tabs; there is a 1:1 relationship between a job and a notebook
• the AWS Glue Studio notebook does not have the same top file menu that exists in Jupyter Notebooks
• currently, the AWS Glue Studio notebook only runs with the AWS Glue kernel
AWS Glue Studio job editing tabs

The tabs that you use to interact with the ETL job are at the top of the notebook page. They are similar
to tabs that appear in the visual job editor of AWS Glue Studio, and they perform the same actions.
• Notebook – Use this tab to view the job script using the notebook interface.
• Job details – Configure the environment and properties for the job runs.
• Runs – View information about previous runs of this job.
• Schedules – Configure a schedule for running your job at specific times.
Saving your notebook and job script

You can save your notebook and the job script you are creating at any time. Simply choose the Save
button in the upper right corner, the same as if you were using the visual or script editor.
When you choose Save, the job script and notebook file are saved in the locations you specified.
• The job script is saved to the Amazon S3 location indicated by the job property Script path, in the
Scripts folder.
• The notebook file (.ipynb) is saved to the Amazon S3 location indicated by the job property Script
path, in the Notebooks folder.
When you save the job, the job script contains only the code cells from the notebook. The Markdown
cells aren't included.
After you save the job, you can then run the job using the script that you created in the notebook.
3
Managing notebook sessions
Managing notebook sessions

Notebooks in AWS Glue Studio are based on the interactive sessions feature of AWS Glue. There is a cost
for using interactive sessions. To help manage your costs, you can monitor the sessions created for your
account, and configure the default settings for all sessions.
Change the default timeout for all notebook sessions

By default, the notebook (interactive) session in Glue Studio times out after 1 hour.
To modify the default session timeout for notebooks in AWS Glue Studio
1. In the notebook, enter the %idle_timeout magic in a cell and specify the timeout value in
minutes.
2. For example: %idle_timeout 15 will change the default timeout from 60 to 15 minutes. If the
session is not used in 15 minutes, the session is automatically stopped.
Installing additional Python modules

If you would like to install additional modules to your session using pip you can do so by using
%additional_python_modules to add them to your session:
%additional_python_modules awswrangler, s3://mybucket/mymodule.whl
All arguments to additional_python_modules are passed to pip3 install -m <>
To view a list of available Python modules see Using Python Libraries with AWS Glue
Changing AWS Glue Configuration

AWS Glue supports various worker types. You can set the worker type with %worker_type. For example:
%worker_type G.2X . The default is G.1X.
You can also specify the Number of workers with %number_of_workers. For example, to specify 40
workers: %number_of_workers 40.
For more information see Defining Job Properties
Stop a notebook session

To stop a notebook session, use the magic %stop_session.
If you navigate away from the notebook in the AWS console, you will receive a warning message where
you can choose to stop the session.
4
API design and CRUD APIs
AWS Glue Visual Job API (Preview)

The Visual Job API is in preview release for AWS Glue and is subject to change.
Topics
• API design and CRUD APIs (p. 5)
• Getting started (p. 5)
• API design and CRUD APIs (p. 9)
• SDK Onboarding (p. 9)
• Appendix: Visual Job Examples and Model Definitions (p. 9)
AWS Glue provides an API that allows customers to create data integration jobs using the AWS Glue API
from a JSON object that represents a DAG. Customers can then use the visual editor in AWS Glue Studio
to work with these jobs.

The create and update Job APIs now support an additional optional parameter,
codeGenConfigurationNodes. Providing a non-empty json structure for this field will result in the DAG
being registered in AWS Glue Studio for the created job and the associated code being generated. A null
value or empty string for this field on job create will be ignored.
Updates to the codeGenConfigurationNodes field will be done through the update-job AWS Glue API
in a similar way as create-job. The entire field should be specified in update-job where DAG has been
changed as desired. A null value provided will be ignored and no update to the DAG would be performed.
An empty structure or string will cause the codeGenConfigurationNodes to be set as empty and any
previous DAG removed. The get-job API will return a DAG if one exists. The delete-job API will need to
also delete any associated DAG.
Getting started
Follow the SDK Onboarding (p. 9).
To create a job, use the createJob function. The CreateJobRequest input will have an additional field
‘codeGenConfigurationNodes’ where you can get specify the DAG object in JSON.
Things to keep in mind:
• The ‘codeGenConfigurationNodes’ field is a map of nodeId to node.

• Each node begins with a key identifying what kind of node it is.
• There can only be one key specified since a node can only be of one type.
• The input field contains the parent nodes of the current node.
The following is a JSON representation of a createJob input.
5
Getting started
{
"Name":"myjob1",
"Role":"arn:aws:iam::253723508848:role/myrole",
"Description":"",
"GlueVersion":"2.0",
"Command":{
"Name":"glueetl",
"ScriptLocation":"s3://myscripts/myjob1.py",
"PythonVersion":"3"
},
"MaxRetries":3,
"Timeout":2880,
"ExecutionProperty":{
"MaxConcurrentRuns":1
},
"NotificationProperty":{},
"DefaultArguments":{
"--class":"GlueApp",
"--job-language":"python",
"--job-bookmark-option":"job-bookmark-enable",
"--TempDir":"s3://assets/temporary/",
"--enable-metrics":"true",
"--enable-continuous-cloudwatch-log":"true",
"--enable-spark-ui":"true",
"--spark-event-logs-path":"s3://assets/sparkHistoryLogs/",
"--encryption-type":"sse-s3",
"--enable-glue-datacatalog":"true"
},
"Tags":{},
"DeveloperMode":false,
"WorkerType":"G.1X",
"NumberOfWorkers":10,
"CodeGenConfigurationNodes":{
"node-1":{
"S3CatalogSource":{
"Database":"database",
"Name":"S3 bucket",
"Table":"table1"
}
},
"node-2":{
"ApplyMapping":{
"Mapping":[
{
"FromPath":[
"col0"
],
"ToKey":"col0",
"ToType":"string",
"FromType":"string",
"Dropped":false
},
{
"FromPath":[
"col1"
],
"ToKey":"col1",
"ToType":"string",
"Dropped":false
},
{
"FromPath":[
"col2"
],
"ToKey":"col2",
6
Getting started
"ToType":"string",
"Dropped":false
},
{
"FromPath":[
"col3"
],
"ToKey":"col3",
"ToType":"string",
"Dropped":false
}
],
"Inputs":[
"node-1"
],
"Name":"ApplyMapping"
}
},
"node-3":{
"S3CatalogTarget":{
"Path":"s3://mypath/",
"UpdateCatalogOptions":"none",
"Inputs":[
"node-1",
"node-2"
],
"SchemaChangePolicy":{
"enableUpdateCatalog":false
},
"Name":"S3 bucket",
"Format":"json",
"PartitionKeys":[],
"Compression":"none"
}
}
}
}
The following is a more complex example:
{
"Name": "myjob2",
"Role": "arn:aws:iam::253723508848:role/myrole",
"Description": "",
"GlueVersion": "2.0",
"Command": {
"Name": "glueetl",
"ScriptLocation": "s3://myscripts/myjob1.py",
"PythonVersion": "3"
},
"MaxRetries": 3,
"Timeout": 2880,
"ExecutionProperty": {
"MaxConcurrentRuns": 1
},
"node-3": {
"S3DirectTarget": {
"Path": "s3://mypath/",
"UpdateCatalogOptions": "none",
"Inputs": [
7
Getting started
"node-1624994219677"
],
"SchemaChangePolicy": {
"EnableUpdateCatalog": false
},
"Name": "S3 bucket",
"Format": "json",
"PartitionKeys": [],
"Compression": "none"
}
},
"node-1624994205115": { "
CatalogSource": {
"Name": "AWS Glue Data Catalog",
"Database": "database2",
"Table": "table2"
}
},
"node-1624994219677": {
"Join": {
"Name": "Join",
"Inputs": [
"node-1624994205115",
"node-2"
],
"JoinType": "equijoin",
"Columns": [
{
"From": "node-1624994205115",
"Keys": [
"firstname"
]
},
{
"From": "node-2",
"Keys": [
"col0"
]
}
],
"ColumnConditions": [
"="
]
}
},
"node-2": {
"S3CatalogSource": {
"Database": "database",
“Input”: [“node-1624994219677”]
"Format": "json",
}
}
}
Updating jobs Since updateJob will also have a ‘codeGenConfigurationNodes’ field, the input format will
be the same. The get-job command will return a ‘codeGenConfigurationNodes’ field in the same format
as well.
8

Since the ‘codeGenConfigurationNodes’ parameter has been added to existing APIs, any limitations in
those APIs will be inherited. In addition, the codeGenConfigurationNodes and some nodes will be limited
in size. See the Appendix for full set of limitations. The limitations are applied on a per field basis.
The general shape is:
{
"Nodeid-1": {...},
"Nodeid-2": {...}
}
Things to note:
• keys uniquely identify a node

• the body contains the node specification
• every node will have a strongly typed model
• an exhaustive list of node models is present in the Appendix section
SDK Onboarding
To access the required files, go to the GitHub repository as described below.
CLI
Go to the GitHub repository to access the service-2.json file and download the file. If you're using Mac
or Linux, place this file in the folder ~/.aws/models/glue/2017- 03-31. If .aws does not exist, that mean
you have to configure the AWS CLI. AWS CLI installation instructions can be found here . If you do not
have the other folders, you can create them manually. The CLI with this custom model can be used in the
same way that CLI is normally used.
Java SDK
For older java clients, a JAR called AwsGlueJavaClient-1.12.x.jar is available on the GitHub
repository .
To use the newer AWS SDK for Java2.x, a JAR called AwsJavaSdk-Glue-2.0.jar is available on the
GitHub repository .
Add the JAR to your class path in your preferred way. After the JAR is added to your class path, it can be
used in the same way as you are using the existing AWS Glue SDK.
Appendix: Visual Job Examples and Model

Definitions
This appendix provides examples and model definitions of sources, data targets, and transforms.
Examples
Sources
9
Examples
S3CSVSource from a glue catalog table:
{
"Table": "table1",
"IsCatalog": true
}
CatalogSource for RDS:
{
"Table": "rdsSource",
"Name": "MyRdsSource",
"IsCatalog": true
}
Data Targets
S3CatalogTarget
{
"Inputs": [
"node-1625147321253"
],
"Database": "dbl",
"Table": "s3Table",
"Name": "s3 bucket",
"Format": "json",
"PartitionKeys": [
"col1"
],
"UpdateCatalogOptions": "schemaAndPartitions",
"EnableUpdateCatalog": true,
"UpdateBehavior": "UPDATE_IN_DATABASE"
}
}
S3DirectTarget
{
"Path": "s3://mypath/",
"UpdateCatalogOptions": "none",
"Inputs": [
"node-2"
],
"EnableUpdateCatalog": false
},
"Format": "json",
10
Model Definitions
"Classification": "DataSink",
}
Transforms
Rename Field
{
"Inputs": [
"node-1"
],
"Name": "MyRenameField",
"SourcePath": "col3"
"TargetPath": "name"
}
Filter
{
"Name": "Filter",
"Inputs": [
"node-2"
],
"LogicalOperator": "AND",
"Filters": [
{
"Operation": "ISNULL",
"Negated": false,
"Values": [
{
"Type": "COLUMNEXTRACTED",
"Value": "col1"
}
]
},
{
"Operation": "REGEX",
"Negated": false,
"Values": [
{
"Type": "CONSTANT",
"Value": ".*"
},
{
"Type": "COLUMNEXTRACTED",
"Value": "col2"
}
]
}
]
}
Model Definitions
Sources
11
Model Definitions
AthenaConnector
{
"Name": 100 character String Required,
"ConnectionName": 256 character String Required,
"ConnectorName": 256 character String,
"ConnectionType": 256 character String Required,
"ConnectionTable": 256 character String Required,
"SchemaName": 256 character String Required
}
JDBCConnector
JDBCDataType is an enu. To see a full list of possible values, see JDBCDataType.
{
"AdditionalOptions": JDBCConnectorOptions,
"ConnectionTable": 256 character String,
"Query": 256 character String
}
JDBCConnectorOptions:
{
"FilterPredicate": 256 character String,
"PartitionColumn": 256 character String,
"LowerBound": Non-Negative Long,
"UpperBound": Non-Negative Long,
"NumPartitions": Non-Negative Long,
"JobBookmarkKeys": List of Strings up to 100,
"JobBookmarkKeysSortOrder": ASC or DESC,
"DataTypeMapping": Map<DBCDataType, JDBCDataType>
SparkConnectorSource
{
"AdditionalOptions": Map<256 character String, Object>
}
CatalogSource:
{
"Database": 256 character String Required,
"Table": 256 character String Required
}
12
Model Definitions
CatalogKinesisSource – is a kind of CatalogSource
{
"WindowSize": Positive Integer,
"DetectSchema": Boolean,
"StreamingOptions": KinesisStreamingSourceOptions,
"Table": 256 character String Required
}
KinesisStreamingSourceOptions
{
"EndpointUrl":256 character String,
"StreamName":256 character String, "Classification":256 character
String,
"Delimiter":256 character String,
"StartingPosition": LATEST or TRIM_HORIZON or EARLIEST,
"MaxFetchTimeInMs": Non-negative Long,
"MaxFetchRecordsPerShard": non-negative Long,
"MaxRecordsPerRead": Non-negative Long,
"AddIdleTimeBetweenReads": Boolean,
"IdleTimeBetweenReadsInMs": Non-negative Long,
"DescribeShardInterval": Non-negative Long,
"NumRetries": Positive Integer,
"RetryIntervalInMs": Non-negative Long,
"MaxRetryIntervalMs": Non-negative Long,
"AvoidEmptyBatches": Boolean,
"StreamARN": 256 character String
"AwsSTSRoleARN": 256 character String,
"AwsSTSSessionName": 256 character String
}
DirectKinesisSource
{
"Name":100 character String Required,
"StreamingOptions": KinesisStreamingSourceOptions
}
CatalogKafkaSource
{
"StreamingOptions": KafkaStreamingSourceOptions,
"Table": 256 character String Required,
}
KafkaStreamingSourceOptions
13
Model Definitions
{
"BootstrapServers":256 character String,
"SecurityProtocol":256 character String,
"ConnectionName":256 character String,
"TopicName":256 character String,
"Assign":256 character String,
"SubscribePattern":256 character String,
"Classification":256 character String,
"Delimiter":256 character String,
"StartingOffsets":256 character String,
"EndingOffsets":256 character String,
"PollTimeoutInMs": Non-negative long,
"NumRetries": Positive integer,
"RetryIntervalMs": Non-negative long,
"MaxOffsetsPerTrigger": Non-negative long,
"MinPartitions": Non-negative integer
}
DirectKafkaSource
{
"StreamingOptions": KafkaStreamingSourceOptions
}
RedshiftSource - is a kind of CatalogSource:
{
"RedshiftTmpDir":256 character String,
"TmpDirIAMRole":256 character String
}
S3CatalogSource
{
"S3SourceAdditionalOptions": {
//Only one can be specified, or neither
"BoundedSize":Nullable Long,
"BoundedFiles":Nullable Long
}
}
S3CSVSource
14
Model Definitions

"Paths": List of Strings. Up to 100 256 character Strings Required,
"CompressionType":gzip or bzip2,
"Exclusions": List of Strings. Up to 100 256 character Strings,
"GroupFiles":256 character String,
"GroupSize":256 character String,
"Recurse":Boolean,
"MaxBand":Integer,
"MaxFilesInBand": Non negative Integer,
"boundedSize":Nullable Long,
"boundedFiles":Nullable Long
},
"Separator":256 character String,
"Escaper":256 character String,
"QuoteChar":256 character String,
"Multiline":Boolean,
"WithHeader":Boolean,
"WriteHeader":Boolean,
"SkipFirst":Boolean,
"UseArrowColumnVectors":Boolean,
"UseSimdCsvParser":Boolean
}
S3JSONSource
{
"Recurse":Boolean,
"MaxBand": Non negative Integer,
},
"JsonPath":256 character String,
"Multiline":Boolean
}
S3ParquetSource
{
"Recurse":Boolean,
"MaxBand": Non negative Integer,
15
Model Definitions

},
}
Targets
JDBCConnectorTarget
{
"Inputs": List of Strings. One 256 character String Required,
"ConnectionName":256 character String Required,
"ConnectionTable":256 character String,
"ConnectorName":256 character String,
"ConnectionType":256 character String Required,
"ConnectionTypeSuffix":256 character String,
"AdditionalOptions":Map<256 character String,Object>
}
SparkConnectorTarget
{
"ConnectionName":256 character String Required,
"ConnectionTable":256 character String,
"ConnectorName":256 character String,
"ConnectionType":256 character String Required,
"ConnectionTypeSuffix": 256 character String,
AdditionalOptions":Map<256 character String,Object>
}
CatalogTarget
{
"Database":256 character String Required,
"Table":256 character String Required
}
RedshiftTarget
{ "Name":100 character String Required, "Inputs": List of Strings. One 256 character String Required,
"Database":256 character String Required, "Table":256 character String Required, "RedshiftTmpDir":256
character String, "TmpDirIAMRole":256 character String }
S3CatalogTarget
{
16
Model Definitions

"SchemaChangePolicy": SchemaChangePolicy,
"PartitionKeys": List of Strings. Up to 100 256 character Strings,
"Database":256 character String Required,
"Table":256 character String Required
}
SchemaChangePolicy:
{
"EnableUpdateCatalog": Boolean,
"UpdateBehavior": "LOG" | "UPDATE_IN_DATABASE"
}
S3DirectTarget
{
"PartitionKeys": List of Strings. Up to 100 256 character Strings,
"Path":256 character String Required,
"Compression": gzip or bzip2,
"Format":json, csv, avro, orc, or parquet Required,
"SchemaChangePolicy": DirectSchemaChangePolicy
}
DirectSchemaChangePolicy:
{
"EnableUpdateCatalog": Boolean,
"UpdateBehavior": "LOG" | "UPDATE_IN_DATABASE",
"Database":256 character String,
"Table":256 character String
}
Transforms
ApplyMapping
See the end of the document for the possible values of ApplyMappingType
{
"Mapping":List of up to 250 Mapping Required
}
Mapping:
{
"ToKey":256 character String Required,
"FromPath": List of Strings. One 256 character String Required,
"FromType":ApplyMappingType Required,
"ToType": ApplyMappingType Required,
"Dropped":Boolean,
"Children": List of up to 250 Mapping
}
SelectFields
{
17
Model Definitions

"Paths": List of Strings. Up to 100 256 character Strings Required
}
DropFields
{
"Inputs: List of Strings. One 256 character String Required,
"Paths": List of Strings. Up to 100 256 character Strings Required
}
RenameField
{
"SourcePath":List of Strings. Up to 100 256 character Strings Required
"TargetPath":256 character String Required
}
Spigot
{
"name":100 character String Required,
"inputs": List of Strings. One 256 character String Required,
"path":256 character String Required,
"topk":Integer from 0 to 100,
"prob":Double from 0 to 1.0
}
Join
{
"Name": 100 character String Required
"Inputs": List of Strings. Two 256 character String Required
"JoinTYpe": equijoin, left, right, outer, leftsemi, or
leftanti Required
"Columns": List[Column] Required
}
Column:
{
"From": 256 character String Required
"Keys": List[String] Required
}
SplitFields
{
18
Model Definitions

"Paths":List of Strings. Up to 100 256 character Strings Required
}
SelectFromCollection
{
"Index": Non Negative Integer Required
}
FillMissingValues
{
"Inputs": List of Strings. One 256 character String RRequired,
"ImputedPath":256 character String Required
"FilledPath":256 character String
}
Filter
{
"LogicalOperator":String Required,
"Filters":List[FilterInstance] Required
}
FilterInstance:
{
"Operation": "EQ" | "LT" | "GT" | "LTE" | "GTE" | "REGEX" |
"ISNULL" Required,
"Negated":Boolean,
"Values":List[FilterValue] Required
}
FilterValue:
{
"Type": "COLUMNEXTRACTED" | "CONSTANT" Required,
"Value": Object Required,
CustomCode
{
"Inputs": List of Strings. One to fifty 256 character String Required,
"Code":Up to 51,200 character string or 50 KB Required,
"ClassName":256 character String Required
}
SparkSQL
19
Model Definitions
{
"SqlQuery": Up to 51,200 character string or 50 KB Required,
"SqlAliases":List of Alias. Up to 256 Aliases Required
}
Alias:
{
"From":256 character String Required,
"Alias":256 character String Required
}
DropNullFields
{
"Paths":List of Strings. Up to 100 256 character Strings Required
"NullCheckBoxList": NullCheckBoxList,
"NullTextList": List of NullValueField. Up to 50 NullValueField.
}
NullCheckboxList
{
"IsEmpty": Boolean,
"IsNullString": Boolean,
"IsNegOne": Boolean
}
NullvalueFields
{
"Value": 256 character String,
"DataType": DataType
}
DataType
{
"Id": 256 character String,
"Label": 256 character String
}
Union
{
"Inputs":List of Strings. Two 256 character String Required,
"Sources":List of Strings. Two 256 character String,
"UnionType": ALL or DISTINCT Required
}
Enums
JDBCDataType
ARRAY,BIGINT,BINARY,BIT,BLOB,BOOLEAN,CHAR,CLOB,DATALINK,DATE,DECIMAL,DISTINCT,DOUBL
E,FLOAT,INTEGER,JAVA_OBJECT,LONGNVARCHAR,LONGVARBINARY,LONGVARCHAR,NCHAR,NCLOB,NULL ,NUMERIC,NVARCHAR,
WITH_TIMEZONE,TIMESTAMP,TIMESTAMP_WITH_TIMEZONE,TINYINT,VARBINARY,VARCHAR
20
Model Definitions
ApplyMappingType
bigint,binary,boolean,char,date,decimal,double,float,int,interval,long,smallint,str
ing,timestamp,tinyint,varchar
21
Choosing how you want the data to be scanned
Detect PII using AWS Glue Studio

The Detect PII transform is in preview release for AWS Glue Studio and is subject to change.
Note
Using the Detect PII transform in AWS Glue Studio jobs requires AWS Glue 2.0.
The Detect PII transform identifies Personal Identifiable Information (PII) in your data source. You choose
the PII entity to identify, how you want the data to be scanned, and what to do with the PII entity that
have been identified by the Detect PII transform.
Topics
• Choosing how you want the data to be scanned (p. 22)
• Choosing the PII entities to take action on (p. 23)
• Choosing what to do with identified PII data (p. 24)
Choosing how you want the data to be scanned

You can choose to detect PII in the entire data source, or detect the fields columns that contain PII.
When you choose Detect PII in each cell, you’re choosing to scan all rows in the data source. This is a
comprehensive scan to ensure that PII entities are identified.
When you choose Detect fields containing PII, you’re choosing to scan a sample of rows for PII entities.
This is a way to keep costs and resources low while also identifying the fields where PII entities are found.
When you choose to detect fields that contain PII, you can reduce costs and improve performance by
sampling a portion of rows. Choosing this option will allow you to specify additional options:
• Sample portion: This allows you to specify the percentage of rows to sample. For example, if you
enter ‘50’, you’re specifying that you want 50 percent of scanned rows for the PII entity.
• Detection threshold: This allows you to specify the percentage of rows that contain the PII entity
in order for the entire column to be identified as having the PII entity. For example, if you enter ‘10’,
you’re specifying that the number of the PII entity, US Phone, in the rows that are scanned must be
10 percent or greater in order for the field to be identified as having the PII entity, US Phone. If the
percentage of rows that contain the PII entity is less than 10 percent, that field will not be labeled as
having the PII entity, US Phone, in it.
22
Choosing the PII entities to take action on
Choosing the PII entities to take action on

You can specify one or more PII entities that you want to detect and take action on.
• ITIN (US)
• Email
• Passport Number (US)
• US Phone
• Credit Card
• Bank Account (US, Canada)
• US Driving License
• IP Address
• MAC Address
• DEA Number (US)
• HCPCS Code (US)
• National Provider Identifier (US)
• National Drug Code (US)
• Health Insurance Claim Number (US)
• Medicare Beneficiary Identifier (US)
• CPT Code (US)
23
Choosing what to do with identified PII data
Choosing what to do with identified PII data

If you chose to detect PII in the entire datasource, you can choose to:
• Enrich data with detection results: If you chose Detect PII in each cell, you can store the detected
entities into a new column.
• Redact detected text: You can replace the detected PII value with a string that you specify in the
optional Replacing text input field. If no string is specified, the detected PII entity is replaced with
'*******'.
If you chose to detect fields containing PII, you chan choose to take the following actions:
• Output Detection Results: This creates a new dataframe with the detected PII information for each
column.
• Redact detected text: You can replace the detected PII value with a string that you specify. If no string
is specified, the detected PII entity is replaced with '*******'.
24
What is AWS Glue Studio?

AWS Glue Studio is a new graphical interface that makes it easy to create, run, and monitor extract,
transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows
and seamlessly run them on AWS Glue’s Apache Spark-based serverless ETL engine. You can inspect the
schema and data results in each step of the job.
AWS Glue Studio is designed not only for tabular data, but also for semi-structured data, which is
difficult to render in spreadsheet-like data preparation interfaces. Examples of semi-structured data
include application logs, mobile events, Internet of Things (IoT) event streams, and social feeds.
When creating a job in AWS Glue Studio, you can choose from a variety of data sources that are stored
in AWS services. You can quickly prepare that data for analysis in data warehouses and data lakes. AWS
Glue Studio also offers tools to monitor ETL workflows and validate that they are operating as intended.
You can preview the dataset for each node. This helps you to debug your ETL jobs by displaying a sample
of the data at each step of the job.
AWS Glue Studio provides a visual interface that makes it easy to:
• Pull data from an Amazon S3, Amazon Kinesis, or JDBC source.

• Configure a transformation that joins, samples, or transforms the data.
• Specify a target location for the transformed data.
• View the schema or a sample of the dataset at each point in the job.
25
Features of AWS Glue Studio
• Run, monitor, and manage the jobs created in AWS Glue Studio.
Features of AWS Glue Studio

AWS Glue Studio helps you to create and manage jobs that gather, transform, and clean data. Advanced
users can use AWS Glue Studio to troubleshoot and edit job scripts.
Visual job editor

You can perform the following actions when creating and editing jobs in AWS Glue Studio:
• Add additional nodes to the job to implement:

• Multiple data sources.
• Multiple data targets.
• Data sources and targets that use connectors for external data stores that were not previously
supported
• View a sample of the data at each node in the job diagram.
• Change the parent nodes for an existing node.
• Add transforms that:
• Join data sources.
• Select specific fields from the data.
• Drop fields.
• Rename fields.
• Change the data type of fields.
• Write select fields from the data into a JSON file in an Amazon S3 bucket (spigot).
• Filter out data from a dataset.
• Split a dataset into two datasets.
• Find missing values in a dataset and provide the missing value in a separate column.
• Use SQL to query and transform the data.
• Use custom code.
Notebook interface for interactively developing and

debugging job scripts
AWS Glue Studio provides an enhanced notebook experience with one-click setup for easy job authoring
and data exploration. The notebook and connections are configured automatically for you. You can use
the notebook interface based on Juypter Notebook to interactively develop, debug, and deploy scripts
and workflows using AWS Glue serverless Apache Spark ETL infrastructure. You can perform ad-hoc
queries, data analysis, and visualization (for example, tables and graphs) in the notebook environment.
The notebook editor interface in AWS Glue Studio offers the following features:
• No cluster to provision or manage

• No cost for idle clusters waiting to run notebooks
• No up-front configuration required
• No resource contention for the same development environment
• Easy installation and usage
26
Job script code editor
• Test in the exact same run environment that your AWS Glue ETL jobs run in
Job script code editor

AWS Glue Studio also has a script editor for writing or customizing the extract-transform-and-load (ETL)
code for your jobs. You can use the visual editor in AWS Glue Studio to quickly design your ETL job and
then edit the generated script to write code for the unique components of your job.
When creating a new job, you can choose to write scripts for Spark jobs or Python shell jobs. You can
code the job ETL script for Spark jobs using either Python or Scala. If you create a Python shell job, the
job ETL script uses Python 3.6.
The script editor interface in AWS Glue Studio offers the following features:
• Insert, modify, and delete sources, targets, and transforms in your script.
• Add or modify arguments for data sources, targets, and transforms.
• Syntax and keyword highlighting
• Auto-completion suggestions for local words, Python keywords, and code snippets.
Job performance dashboard

AWS Glue Studio provides a comprehensive run dashboard for your ETL jobs. The dashboard displays
information about job runs from a specific time frame. The information displayed on the dashboard
includes:
• Jobs overview summary – A high-level overview showing total jobs, current runs, completed runs, and
failed jobs.
• Status summaries – Provides high level job metrics based on job properties, such as worker type and
job type.
• Job runs time line – A bar graph summary of successful, failed, and total runs for the currently selected
time frame.
• Job run breakdown – A detailed list of job runs from the selected time frame.
Support for dataset partitioning

You can use AWS Glue Studio to efficiently process partitioned datasets. You can load, filter, transform,
and save your partitioned data by using SQL expressions or user-defined functions–to avoid listing and
reading unnecessary data from Amazon S3.
When should I use AWS Glue Studio?

Use AWS Glue Studio for a simple visual interface to create ETL workflows for data cleaning and
transformation, and run them on AWS Glue.
AWS Glue Studio makes it easy for ETL developers to create repeatable processes to move and transform
large-scale, semi-structured datasets, and load them into data lakes and data warehouses. It provides a
boxes-and-arrows style visual interface for developing and managing AWS Glue ETL workflows that you
can optionally customize with code. AWS Glue Studio combines the ease of use of traditional ETL tools,
and the power and flexibility of AWS Glue’s big data processing engine.
27
Accessing AWS Glue Studio
AWS Glue Studio provides multiple ways to customize your ETL scripts,including adding nodes that
represent code snippets in the visual editor.
Use AWS Glue Studio for easier job management. AWS Glue Studio provides you with job and job run
management interfaces that make it clear how jobs relate to each other, and give an overall picture of
your job runs. The job management page makes it easy to do bulk operations on jobs (previously difficult
to do in the AWS Glue console). All job runs are available in a single interface where you can search and
filter. This gives you a constantly updated view of your ETL operations and the resources you use. You
can use the real-time dashboard in AWS Glue Studio to monitor your job runs and validate that they are
operating as intended.
Accessing AWS Glue Studio

To access AWS Glue Studio, sign in to AWS as a user that has the required permissions, as described in
Set up IAM permissions for AWS Glue Studio (p. 35). Then you can sign in to the AWS Management
Console and open the AWS Glue console at https://console.aws.amazon.com/glue/. Click the AWS Glue
Studio link in the navigation pane.
Pricing for AWS Glue Studio

When using AWS Glue Studio, you are charged for data previews. After you specify an IAM role for
the job, the visual editor starts an Apache Spark session for sampling your source data and executing
transformations. This session runs for 30 minutes and then turns off automatically. AWS charges you for
2 DPUs at the development endpoint rate (DEVED-DPU-Hour), typically resulting in a charge of $0.44 for
each 30 minute session. The rate might vary for each region. At the end of the 30 minute session, you
can choose Retry on the Data preview tab for any node or reload the visual editor page to start a new
30 minute session at the same rates.
You also pay for the underlying AWS services that your jobs use or interact with–for example, AWS Glue,
your data sources, and your data targets. For pricing information, see AWS Glue Pricing.
28
Complete initial AWS configuration tasks
Setting up for AWS Glue Studio

Complete the tasks in this section when you're using AWS Glue Studio for the first time:
Topics
• Complete initial AWS configuration tasks (p. 29)
• Review IAM permissions needed for the AWS Glue Studio user (p. 31)
• Review IAM permissions needed for ETL jobs (p. 34)
• Set up IAM permissions for AWS Glue Studio (p. 35)
• Configure a VPC for your ETL job (p. 37)
• Populate the AWS Glue Data Catalog (p. 38)
Complete initial AWS configuration tasks

To use AWS Glue Studio you must first complete the following tasks:
• Sign up for AWS (p. 29)

• (Recommended) Create an IAM administrator user (p. 29)
• (Recommended) Create an AWS user for AWS Glue Studio.
You can either use the administrator user for creating and managing your ETL jobs, or you can create a
separate user for accessing AWS Glue Studio.
To create additional users for AWS Glue or AWS Glue Studio, follow the steps in Creating Your First IAM
Delegated User and Group in the IAM User Guide.
• Sign in as an IAM user (p. 30)
Sign up for AWS

If you do not have an AWS account, complete the following steps to create one.
To sign up for an AWS account
1. Open https://portal.aws.amazon.com/billing/signup.
2. Follow the online instructions.
Part of the sign-up procedure involves receiving a phone call and entering a verification code on the
phone keypad.
Create an IAM administrator user

If your account already includes an IAM user with full AWS administrative permissions, you can skip this
section.
29
Sign in as an IAM user
To create an administrator user for yourself and add the user to an administrators group
(console)
1. Sign in to the IAM console as the account owner by choosing Root user and entering your AWS
account email address. On the next page, enter your password.
Note
We strongly recommend that you adhere to the best practice of using the Administrator
IAM user that follows and securely lock away the root user credentials. Sign in as the root
user only to perform a few account and service management tasks.
2. In the navigation pane, choose Users and then choose Add user.
3. For User name, enter Administrator.
4. Select the check box next to AWS Management Console access. Then select Custom password, and
then enter your new password in the text box.
5. (Optional) By default, AWS requires the new user to create a new password when first signing in. You
can clear the check box next to User must create a new password at next sign-in to allow the new
user to reset their password after they sign in.
6. Choose Next: Permissions.
7. Under Set permissions, choose Add user to group.
8. Choose Create group.
9. In the Create group dialog box, for Group name enter Administrators.
10. Choose Filter policies, and then select AWS managed - job function to filter the table contents.
11. In the policy list, select the check box for AdministratorAccess. Then choose Create group.
Note
You must activate IAM user and role access to Billing before you can use the
AdministratorAccess permissions to access the AWS Billing and Cost Management
console. To do this, follow the instructions in step 1 of the tutorial about delegating access
to the billing console.
12. Back in the list of groups, select the check box for your new group. Choose Refresh if necessary to
see the group in the list.
13. Choose Next: Tags.
14. (Optional) Add metadata to the user by attaching tags as key-value pairs. For more information
about using tags in IAM, see Tagging IAM entities in the IAM User Guide.
15. Choose Next: Review to see the list of group memberships to be added to the new user. When you
are ready to proceed, choose Create user.
You can use this same process to create more groups and users and to give your users access to your AWS
account resources. To learn about using policies that restrict user permissions to specific AWS resources,
see Access management and Example policies.
Sign in as an IAM user

Sign in to the IAM console by choosing IAM user and entering your AWS account ID or account alias. On
the next page, enter your IAM user name and your password.
Note
For your convenience, the AWS sign-in page uses a browser cookie to remember your IAM user
name and account information. If you previously signed in as a different user, choose the sign-in
link beneath the button to return to the main sign-in page. From there, you can enter your AWS
account ID or account alias to be redirected to the IAM user sign-in page for your account.
30
Review IAM permissions needed
for the AWS Glue Studio user
Review IAM permissions needed for the AWS Glue

Studio user
To use AWS Glue Studio, the user must have access to various AWS resources. The user must be able to
view and select Amazon S3 buckets, IAM policies and roles, and AWS Glue Data Catalog objects.
AWS Glue service permissions

AWS Glue Studio uses the actions and resources of the AWS Glue service. Your user needs permissions
on these actions and resources to effectively use AWS Glue Studio. You can grant the AWS Glue Studio
user the AWSGlueConsoleFullAccess managed policy, or create a custom policy with a smaller set of
permissions.
Important
Per security best practices, it is recommended to restrict access by tightening policies to further
restrict access to Amazon S3 bucket and Amazon CloudWatch log groups. For an example
Amazon S3 policy, see Writing IAM Policies: How to Grant Access to an Amazon S3 Bucket.
Creating Custom IAM Policies for AWS Glue Studio

You can create a custom policy with a smaller set of permissions for AWS Glue Studio. The policy can
grant permissions for a subset of objects or actions. Use the following information when creating a
custom policy. Jobs marked with an asterisk (*) indicate an API that is only available with AWS Glue
Studio and do not exist for AWS Glue.
Job Actions
• GetJob
• CreateJob
• DeleteJob
• GetJobs
• UpdateJob
• *QueryJobs
• *SaveJob
Directed acyclic graph (DAG) Actions
• *CreateDag
• *UpdateDag
• *GetDag
• *DeleteDag
Database Actions
• GetDatabases
Plan Actions
• GetPlan
Job run Actions
31
Creating Custom IAM Policies for AWS Glue Studio
• StartJobRun
• GetJobRuns
• BatchStopJobRun
• GetJobRun
• *QueryJobRuns
• *QueryJobRunsAggregated
Schema Actions
• *GetSchema
• *GetInferredSchema
Table Actions
• SearchTables
• GetTables
• GetTable
Connection Actions
• COMMENT OUT -CreateConnections

• DeleteConnection
• UpdateConnection
• GetConnections
• GetConnection
File Actions
• GetFile
Mapping Actions
• GetMapping
COMMENT OUTJob Schedule Actions
• *GetNextScheduledJobs
Repository Actions
• *ListRepositories
Branch Actions
• *ListBranches
• *GetBranches
Commit Actions
32
Notebook and data preview permissions
• *CreateCommit
• *GetCommit
S3 Proxy Actions
• *ListBuckets
• *ListObjectsV2
• *GetBucketLocation
COMMENT OUTSchedule Actions
• *CreateSchedule
• *GetSchedule
• *DeleteSchedule
Security Configuration Actions
• GetSecurityConfigurations
Script Actions
• CreateScript (different from API of same name in AWS Glue)
Notebook and data preview permissions

Data previews and notebooks allow you to see a sample of your data at any stage of your job (reading,
transforming, writing), without having to run the job. You specify an AWS Identity and Access
Management (IAM) role for AWS Glue Studio to use when accessing the data. IAM roles are intended
to be assumable and do not have standard long-term credentials such as a password or access keys
associated with it. Instead, when AWS Glue Studio assumes the role, IAM provides it with temporary
security credentials.
To ensure data previews and notebook commands work correctly, use a role that has a name that starts
with the string AWSGlueServiceRole. If you choose to use a different name for your role, then you
must add the iam:passrole permission and configure a role trust policy for the role in IAM. Add the
AWS Glue service as a principal in this trust policy, as described in Create a trust policy for roles not
named "AWSGlueServiceRole*" (p. 36).
Warning
If a role grants the iam:passrole permission for a notebook, and you implement role
chaining, a user could unintentionally gain access to the notebook. There is currently no auditing
implemented which would allow you to monitor which users have been granted access to the
notebook.
Amazon CloudWatch permissions

You can monitor your AWS Glue Studio jobs using Amazon CloudWatch, which collects and processes
raw data from AWS Glue into readable, near-real-time metrics. By default, AWS Glue metrics data is sent
to CloudWatch automatically. For more information, see What Is Amazon CloudWatch? in the Amazon
CloudWatch User Guide, and AWS Glue Metrics in the AWS Glue Developer Guide.
To access CloudWatch dashboards, the user accessing AWS Glue Studio needs one of the following:
33
Review IAM permissions needed for ETL jobs
• The AdministratorAccess policy

• The CloudWatchFullAccess policy
• A custom policy that includes one or more of these specific permissions:
• cloudwatch:GetDashboard and cloudwatch:ListDashboards to view dashboards
• cloudwatch:PutDashboard to create or modify dashboards
• cloudwatch:DeleteDashboards to delete dashboards
For more information for changing permissions for an IAM user using policies, see Changing Permissions
for an IAM User in the IAM User Guide.
Review IAM permissions needed for ETL jobs

When you create a job using AWS Glue Studio, the job assumes the permissions of the IAM role that you
specify when you create it. This IAM role must have permission to extract data from your data source,
write data to your target, and access AWS Glue resources.
The name of the role that you create for the job must start with the string AWSGlueServiceRole
for it to be used correctly by AWS Glue Studio. For example, you might name your role
AWSGlueServiceRole-FlightDataJob.
Data source and data target permissions

An AWS Glue Studio job must have access to Amazon S3 for any sources, targets, scripts, and temporary
directories that you use in your job. You can create a policy to provide fine-grained access to specific
Amazon S3 resources.
• Data sources require s3:ListBucket and s3:GetObject permissions.

• Data targets require s3:ListBucket, s3:PutObject, and s3:DeleteObject permissions.
If you choose Amazon Redshift as your data source, you can provide a role for cluster permissions. Jobs
that run against a Amazon Redshift cluster issue commands that access Amazon S3 for temporary
storage using temporary credentials. If your job runs for more than an hour, these credentials will expire
causing the job to fail. To avoid this problem, you can assign a role to the Amazon Redshift cluster itself
that grants the necessary permissions to jobs using temporary credentials. For more information, see
Moving Data to and from Amazon Redshift in the AWS Glue Developer Guide.
If the job uses data sources or targets other than Amazon S3, then you must attach the necessary
permissions to the IAM role used by the job to access these data sources and targets. For more
information, see Setting Up Your Environment to Access Data Stores in the AWS Glue Developer Guide.
If you're using connectors and connections for your data store, you need additional permissions, as
described in the section called “Permissions required for using connectors” (p. 35).
Permissions required for deleting jobs

In AWS Glue Studio you can select multiple jobs in the console to delete. To perform this action, you
must have the glue:BatchDeleteJob permission. This is different from the AWS Glue console, which
requires the glue:DeleteJob permission for deleting jobs.
AWS Key Management Service permissions

If you plan to access Amazon S3 sources and targets that use server-side encryption with AWS Key
Management Service (AWS KMS), then attach a policy to the AWS Glue Studio role used by the job that
34
Permissions required for using connectors
enables the job to decrypt the data. The job role needs the kms:ReEncrypt, kms:GenerateDataKey,
and kms:DescribeKey permissions. Additionally, the job role needs the kms:Decrypt permission
to upload or download an Amazon S3 object that is encrypted with an AWS KMS customer master key
(CMK).
There are additional charges for using AWS KMS CMKs. For more information, see AWS Key Management
Service Concepts - Customer Master Keys (CMKs) and AWS Key Management Service Pricing in the AWS
Key Management Service Developer Guide.
Permissions required for using connectors

If you're using an AWS Glue Custom Connector and connection to access a data store, the role used to
run the AWS Glue ETL job needs additional permissions attached:
• The AWS managed policy AmazonEC2ContainerRegistryReadOnly for accessing connectors

purchased from AWS Marketplace.
• The glue:GetJob and glue:GetJobs permissions.
• AWS Secrets Manager permissions for accessing secrets that are used with connections. Refer to IAM
policy examples for secrets in AWS Secrets Manager for example IAM policies.
If your AWS Glue ETL job runs within a VPC running Amazon VPC, then the VPC must be configured as
described in the section called “Configure a VPC for your ETL job” (p. 37).
Set up IAM permissions for AWS Glue Studio

You can create the roles and assign policies to users and job roles by using the AWS administrator user.
You can use the AWSGlueConsoleFullAccess AWS managed policy to provide the necessary permissions
for using the AWS Glue Studio console.
To create your own policy, follow the steps documented in Create an IAM Policy for the AWS Glue
Service in the AWS Glue Developer Guide. Include the IAM permissions described previously in Review IAM
permissions needed for the AWS Glue Studio user (p. 31).
Topics
• Create an IAM Role (p. 35)
• Attach policies to the AWS Glue Studio user (p. 36)
• Create a trust policy for roles not named "AWSGlueServiceRole*" (p. 36)
Create an IAM Role

AWS Glue Studio needs permissions to access other services on your behalf. You provide those
permissions by creating an IAM role and assigning policies to the role. You specify this role when creating
jobs, when using the notebook editor, or when using data previews. AWS Glue Studio or your ETL job
assumes the role, gaining temporary permissions to access other services and data locations.
You need to grant your IAM role permissions that AWS Glue Studio and AWS Glue can assume when
calling other services on your behalf. This includes access to Amazon S3 for storing scripts and temporary
files, and any other sources or targets that you use with AWS Glue Studio.
To create a role for your ETL jobs
1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
35
Attach policies to the AWS Glue Studio user
2. In the left navigation pane, choose Roles.

3. Choose Create role.
4. For role type, choose AWS Service, find and choose Glue, and choose Next: Permissions.
5. On the Attach permissions policy page, choose the policies that contain the required permissions.
For example, you might choose the AWS managed policy AWSGlueServiceRole for general AWS
Glue Studio and AWS Glue permissions and the AWS managed policy AmazonS3FullAccess for
access to Amazon S3 resources.
6. Add additional policies as needed for additional data stores or services.
7. Choose Next: Review.
8. For Role name, enter a name for your role; for example, AWSGlueServiceRole-Studio. Choose
a name that begins with the string AWSGlueServiceRole to allow the role to be passed from
console users to the service.
If you choose a different name for your role, you must add a policy to allow your users the
iam:PassRole permission for IAM roles to match your naming convention.
Choose Create Role to finish creating the role.
Attach policies to the AWS Glue Studio user

Any AWS user that signs in to the AWS Glue Studio console must have permissions to access specific
resources. You provide those permissions by using assigning IAM policies to the user.
To attach the AWSGlueConsoleFullAccess managed policy to a user
2. In the navigation pane, choose Policies.
3. In the list of policies, select the check box next to the AWSGlueConsoleFullAccess. You can use the
Filter menu and the search box to filter the list of policies.
4. Choose Policy actions, and then choose Attach.
5. Choose the user to attach the policy to. You can use the Filter menu and the search box to filter the
list of principal entities. After choosing the user to attach the policy to, choose Attach policy.
6. Repeat the previous steps to attach additional policies to the user, as needed.
Create a trust policy for roles not named

"AWSGlueServiceRole*"
To configure a role trust policy for roles used by AWS Glue Studio
2. In the left-side navigation, choose Roles.
3. Locate the role used for data previews or your ETL job, and then choose the role name.
4. Choose the Trust Relationships tab, and then choose the Edit trust relationship button.
5. Copy and paste the following blocks into the policy under the "Statement" array.
{
"Action": ["iam:PassRole"],
36
Configure a VPC for your ETL job
"Effect": "Allow",
"Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*",
"Condition": {
"StringLike": {
"iam:PassedToService": ["glue.amazonaws.com"]
}
}
},
{
"Effect": "Allow",
"Principal": {
"Service": ["glue.amazonaws.com"]
},
"Action": "sts:AssumeRole"
}
Here is the full example with the Version and Statement arrays included in the policy
{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["iam:PassRole"],
"Effect": "Allow",
"Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*",
"Condition": {
"StringLike": {
"iam:PassedToService": ["glue.amazonaws.com"]
}
}
},
{
"Effect": "Allow",
"Principal": {
"Service": ["glue.amazonaws.com"]
},
"Action": "sts:AssumeRole"
}
]
}
• Choose Update Trust Policy to complete the configuration.
Configure a VPC for your ETL job

You can use Amazon Virtual Private Cloud (Amazon VPC) to define a virtual network in your own
logically isolated area within the AWS Cloud, known as a virtual private cloud (VPC). You can launch your
AWS resources, such as instances, into your VPC. Your VPC closely resembles a traditional network that
you might operate in your own data center, with the benefits of using the scalable infrastructure of AWS.
You can configure your VPC; you can select its IP address range, create subnets, and configure route
tables, network gateways, and security settings. You can connect instances in your VPC to the internet.
You can connect your VPC to your own corporate data center, making the AWS Cloud an extension
of your data center. To protect the resources in each subnet, you can use multiple layers of security,
including security groups and network access control lists. For more information, see the Amazon VPC
User Guide.
You can configure your AWS Glue ETL jobs to run within a VPC when using connectors. You must
configure your VPC for the following, as needed:
37
Populate the AWS Glue Data Catalog
• Public network access for data stores not in AWS. All data stores that are accessed by the job must be
available from the VPC subnet.
• If your job needs to access both VPC resources and the public internet, the VPC needs to have a
network address translation (NAT) gateway inside the VPC.
For more information, see Setting Up Your Environment to Access Data Stores in the AWS Glue
Developer Guide.
Populate the AWS Glue Data Catalog

AWS Glue Studio can use datasets that are defined in the AWS Glue Data Catalog. These datasets are
used as sources and targets for ETL workflows in AWS Glue Studio. If you choose the Data Catalog for
your data source or target, then the Data Catalog tables related to your data source or data target must
exist prior to creating a job.
When reading from or writing to a data source, your ETL job needs to know the schema of the data. The
ETL job can get this information from a table in the AWS Glue Data Catalog. You can use a crawler, the
AWS Glue console, AWS CLI, or an AWS CloudFormation template file to add databases and tables to
the Data Catalog. For more information about populating the Data Catalog, see Data Catalog in the AWS
Glue Developer Guide.
When using connectors, you can use the schema builder to enter the schema information when you
configure the data source node of your ETL job in AWS Glue Studio. For more information, see the
section called “Authoring jobs with custom connectors” (p. 85).
For some data sources, AWS Glue Studio can automatically infer the schema of the data it reads from the
files at the specified location.
• For Amazon S3 data sources, you can find more information at Using files in Amazon S3 for the data
source (p. 51).
• For streaming data sources, you can find more information at Using a streaming data source (p. 53).
38
Prerequisites
Tutorial: Getting started with AWS

Glue Studio
You can use AWS Glue Studio to create jobs that extract structured or semi-structured data from a data
source, perform a transformation of that data, and save the result set in a data target.
Topics
• Prerequisites (p. 39)
• Step 1: Start the job creation process (p. 39)
• Step 2: Edit the data source node in the job diagram (p. 40)
• Step 3: Edit the transform node of the job (p. 41)
• Step 4: Edit the data target node of the job (p. 41)
• Step 5: View the job script (p. 42)
• Step 6: Specify the job details and save the job (p. 42)
• Step 7: Run the job (p. 43)
• Next steps (p. 43)
Prerequisites
This tutorial has the following prerequisites:
• You have an AWS account.

• You have access to AWS Glue Studio.
• Your account has all the necessary permissions for creating and running a job for an Amazon S3 data
source and data target.
• You have created an AWS Identity and Access Management role for the job to use. You can also choose
an IAM role for the job that includes permissions for all your data sources, data targets, temporary
directory, scripts, and any libraries used by the job.
• The following components exist in AWS:
• The Flights Data Crawler crawler
• The flights-db database
• The flightscsv table
• The IAM role AWSGlueServiceRole-CrawlerTutorial
To create these components, you can complete the service tutorial Add a crawler, which populates
the AWS Glue Data Catalog with the necessary objects. This tutorial also creates an IAM role with
the necessary permissions. You can find the tutorial on the AWS Glue service page at https://
console.aws.amazon.com/glue/. The tutorial is located in the left-side navigation, under Tutorials.
Alternatively, you can use the documentation version of this tutorial, Tutorial: Adding an AWS Glue
crawler (p. 114).
Step 1: Start the job creation process

In this task, you choose to start the job creation by using a template.
39
Step 2: Edit the data source node in the job diagram
To create a job, starting with a template
1. Sign in to the AWS Management Console and open the AWS Glue Studio console at https://
console.aws.amazon.com/gluestudio/.
2. On the AWS Glue Studio landing page, choose View jobs under the heading Create and manage
jobs.
3. On the Jobs page, under the heading Create job, choose the Source and target added to the graph
option. Then, choose S3 for the Source and S3 for the Target.
4. Choose the Create button to start the job creation process.
The job editing page opens with a simple three-node job diagram displayed.
Step 2: Edit the data source node in the job

diagram
Choose the Data source - S3 bucket node in the job diagram to edit the data source properties.
To edit the data source node
1. On the Node properties tab in the node details pane, for Name, enter a name that is unique for this
job.
The value you enter is used as the label for the data source node in the job diagram. If you use
unique names for the nodes in your job, then it's easier to identify each node in the job diagram, and
also to select parent nodes. For this tutorial, enter the name S3 Flight Data.
2. Choose the Data source properties - S3 tab in the node details panel.
3. Choose the Data Catalog table option for the S3 source type.
4. For Database, choose the flights-db database from the list of available databases in your AWS Glue
Data Catalog.
5. For Table, enter flight in the search field, and then choose the flightscsv table from your AWS
Glue Data Catalog.
6. (Optional) Choose the Output schema tab in the node details panel to view the data schema.
7. (Optional) After configuring the node properties and data source properties, you can preview the
dataset from your data source by choosing the Data preview tab in the node details panel. The first
time you choose this tab for any node in your job, you are prompted to provide an IAM role to access
the data. There is a cost associated with using this feature, and billing starts as soon as you provide
an IAM role.
By default, the first 5 columns are selected for viewing in the Data preview tab. To view other
columns, choose Previewing 5 of 65 fields. For example, you can deselect the first 5 columns and
select fl_date, airline_id, fl_num, tail_num, and origin_airport_id. Scroll to the end of
the column list and choose Confirm to save your choices.
After you have provided the required information for the data source node, a green check mark appears
on the node in the job diagram.
40
Step 3: Edit the transform node of the job
Step 3: Edit the transform node of the job

The transform node is where you specify how you want to modify the data from its original format. An
ApplyMapping transform enables you to rename data property keys, change the data types, and drop
columns from the dataset.
When you edit the Transform - ApplyMapping node, the original schema for your data is shown in the
Source key column in the node details panel. This is the data property key name (column name) that is
obtained from the source data and stored in the table in the AWS Glue Data Catalog.
The Target key column shows the key name that will appear in the data target. You can use this field to
change the data property key name in the output. The Data type column shows the data type of the key
and allows you to change it to different data type for the target. The Drop column contains a check box.
This box allows you to choose a field to drop it from the target schema.
To edit the transform node
1. Choose the Transform - ApplyMapping node in the job diagram to edit the data transformation
properties.
2. In the node details panel, on the Node properties tab, review the information.
You can change the name of this node if you want.

3. Choose the Transform tab in the node details panel.
4. Choose to drop the keys quarter and day_of_week by selecting the check box in the Drop column
for each key.
5. For the key that shows day_of_month in the Source key column, change the Target key value to
day.
Change the data type for the month and day keys to tinyint. The tinyint data type stores integers
using 1 byte of storage, with a range of values from 0 to 255. When changing the data type, you
must verify that the data type is supported by your target.
6. (Optional) Choose the Output schema tab in the node details panel to view the modified schema.
7. (Optional) After configuring the node properties and transform properties, you can preview the
modified dataset by choosing the Data preview tab in the node details panel. The first time you
choose this tab for any node in your job, you are prompted to provide an IAM role to access the data.
There is a cost associated with using this feature, and billing starts as soon as you provide an IAM
role.
By default, the first 5 columns are selected for the data preview, but the columns are no longer the
same as the columns viewed on the data source node because we dropped two of the columns and
renamed a third column.
Notice that the Transform - Apply Mapping node in the job diagram now has a green check mark,
indicating that the node has been edited and has all the required information.
Step 4: Edit the data target node of the job

A data target node determines where the transformed output is sent. The location can be an Amazon S3
bucket, a Data Catalog table, or a connector and connection. If you choose a Data Catalog table, the data
is written to the location associated with that table. For example, if you use a crawler to create a table in
the Data Catalog for a JDBC target, the data is written to that JDBC table.
41
Step 5: View the job script
To edit the data target node
1. Choose the Data target - S3 bucket node in the job diagram to edit the data target properties.
2. In the node details panel on the right, choose the Node properties tab. For Name, enter a unique
name for the node, such as Revised Flight Data.
3. Choose the Data target properties - S3 tab.
4. For Format, choose JSON.
For Compression Type, keep the default value of None.
For the S3 Target Location, choose the Browse S3 button to see the Amazon S3 buckets that you
have access to, and choose one as the target destination.
For the Data Catalog update options, keep the default setting of Do not update the Data Catalog.
For more information about the available options, see Overview of data target options (p. 73).
Step 5: View the job script

After you configure all the nodes in the job, AWS Glue Studio generates a script that is used by the job to
read, transform, and write the data in the target location.
To view the script, choose the Script tab at the top of the job editing pane. Don’t click the Edit script
button, because this will take you out of visual editor mode.
If you clicked the Edit script button and confirmed your choice, you can reload the page (without saving
the job first), to reset the Script tab.
Step 6: Specify the job details and save the job

Before you can save and run your extract, transform, and load (ETL) job, you must first enter additional
information about the job itself.
To specify the job details and save the job
1. Choose the Job details tab.

2. Enter a name for the job–for example FlightDataETL. Provide a UTF-8 string with a maximum
length of 255 characters.
You can optionally enter a description of the job.

3. For the IAM role, choose AWSGlueServiceRole-CrawlerTutorial from the list of available
roles. You might have to add access to the target Amazon S3 bucket to this role.
If you have many roles to choose from, you can start entering part of the role name in the IAM role
search field, and the roles with the matching text string will be displayed. For example, you can
enter tutorial in the search field to find all roles with tutorial (case-insensitive) in the name.
The AWS Identity and Access Management (IAM) role is used to authorize access to resources that
are used to run the job. You can only choose roles that already exist in your account. The role you
choose must have permission to access your Amazon S3 sources, targets, temporary directory,
scripts, and any libraries used by the job, as well as access to AWS Glue service resources.
For the steps to create a role, see Create an IAM Role for AWS Glue in the AWS Glue Developer Guide.
4. For the remaining fields, use the default values.
42
Step 7: Run the job
5. Choose Save in the top-right corner of the page.
You should see a notification at the top of the page that the job was successfully saved.
If you don't see a notification that your job was successfully saved, then there is most likely information
missing that prevents the job from being saved.
• Review the job in the visual editor, and choose any node that doesn't have a green check mark.
• If any of the tabs above the visual editor pane have a callout, choose that tab and look for any fields
that are highlighted in red.
Step 7: Run the job

Now that the job has been saved, you can run the job. Choose the Run button at the top of the page. You
should then see a notification that the job was successfully started.
You can choose either the link in the notification for Run Details, or choose the Runs tab to view the run
status of the job.
On the Runs tab, there is a card for each recent run of the job with information about that job run.
For more information about the job run information, see the section called “View information for recent
job runs” (p. 108).
Next steps
After you start the job run, you might want to try some of the following tasks:
• View the job monitoring dashboard – Accessing the job monitoring dashboard (p. 101).
• Try a different transform on the data – Editing the data transform node (p. 55).
• View the jobs that exist in your account – View your jobs (p. 108).
• Run the job using a time-based schedule – Schedule job runs (p. 106).
43
Start the job creation process
Creating ETL jobs with AWS Glue

Studio
You can use the simple visual interface in AWS Glue Studio to create your ETL jobs. You use the Jobs
page to create new jobs. You can also use a script editor or notebook to work directly with code in the
AWS Glue Studio ETL job script.
On the Jobs page, you can see all the jobs that you have created either with AWS Glue Studio or AWS
Glue. You can view, manage, and run your jobs on this page.
Topics
• Start the job creation process (p. 44)
• Create jobs that use a connector (p. 45)
• Next steps for creating a job in AWS Glue Studio (p. 45)
Start the job creation process

You use the visual editor to create and customize your jobs. When you create a new job, you have the
option of starting with an empty canvas, a job with a data source, transform, and data target node, or
writing an ETL script.
To create a job in AWS Glue Studio
2. You can either choose Create and manage jobs from the AWS Glue Studio landing page, or you can
choose Jobs from the navigation pane.
The Jobs page appears.

3. In the Create job section, choose a configuration option for your job.
• Visual with a blank canvas – To create a job starting with an empty canvas
• Visual with a source and target – To create a job starting with source node, or with a source,
transform and target node
You then choose the data source type. You can also choose the data target type, or you can choose
the Choose later option from the Target drop-down list to start with only a data source node in
the graph.
• Spark script editor – For those familiar with programming and writing ETL scripts, choose this
option to create a new Spark ETL job. You then have the option of writing Python or Scala code
in a script editor window, or uploading an existing script from a local file. If you choose to use the
script editor, you can't use the visual job editor to design or edit your job.
A Spark job is run in an Apache Spark environment managed by AWS Glue. By default, new scripts
are coded in Python. To write a new Scala script, see Creating and editing Scala scripts in AWS
Glue Studio (p. 77).
• Python Shell script editor – For those familiar with programming and writing ETL scripts, choose
this option to create a new Python shell job. You write code in a script editor window starting with
a template (boilerplate), or you can upload an existing script from a local file. If you choose to use
the Python shell editor, you can't use the visual job editor to design or edit your job.
44
Create jobs that use a connector
A Python shell job runs Python scripts as a shell and supports a Python version that depends on
the AWS Glue version you choose for the job. You can use these jobs to schedule and run tasks
that don't require an Apache Spark environment.
• Jupyter Notebook – For those familiar with programming and writing ETL scripts, choose this
option to create a new Python or Scala job script using a notebook interface based on Jupyter
notebook. You write code in a notebook. If you choose to use the notebook interface to create
your job, you can't use the visual job editor to design or edit your job.
You can also use a command line interface to easily configure a notebook for authoring jobs.
4. Choose Create to create a job in the editing interface that you selected.
5. If you chose the Jupyter notebook option, the Create job in Jupyter notebook page appears instead
of the job editor interface. You must provide additional information before creating a notebook
authoring session. For more information about how to specify this information, see Getting started
with notebooks in AWS Glue Studio (p. 2).
Create jobs that use a connector

After you have added a connector to AWS Glue Studio and created a connection for that connector, you
can create a job that uses the connection for the data source.
For detailed instructions, see the section called “Authoring jobs with custom connectors” (p. 85).
Next steps for creating a job in AWS Glue Studio

You use the visual job editor to configure nodes for your job. Each node represents an action, such as
reading data from the source location or applying a transform to the data. Each node you add to your job
has properties that provide information about either the data location or the transform.
The next steps for creating and managing your jobs are:
• Editing ETL jobs in AWS Glue Studio (p. 47)
45
Next steps for creating a job in AWS Glue Studio
• Getting started with notebooks in AWS Glue Studio (p. 2)

• Adding connectors to AWS Glue Studio (p. 82)
• View the job script (p. 109)
• Modify the job properties (p. 109)
• Save the job (p. 111)
• Start a job run (p. 106)
• View information for recent job runs (p. 108)
• Accessing the job monitoring dashboard (p. 101)
46
Accessing the job diagram editor
Editing ETL jobs in AWS Glue Studio

While creating a new job, or after you have saved your job, you can use can AWS Glue Studio to modify
your ETL jobs. You can do this by editing the nodes in the visual editor or by editing the job script in
developer mode. You can also add and remove nodes in the visual editor to create more complicated ETL
jobs.
Topics
• Accessing the job diagram editor (p. 47)
• Job editor features (p. 47)
• Editing the data source node (p. 50)
• Editing the data transform node (p. 55)
• Configuring data target nodes (p. 73)
• Editing or uploading a job script (p. 76)
• Adding nodes to the job diagram (p. 79)
• Changing the parent nodes for a node in the job diagram (p. 79)
• Deleting nodes from the job diagram (p. 80)
Accessing the job diagram editor

Use the AWS Glue Studio job editor to edit your ETL jobs.
You can access the job editor in the following ways:
• Choose Jobs in the console navigation pane. On the Jobs page, locate the job in the Your jobs list. You
can then either:
• Choose the name of the job in the Name column to open the job editor for that job.
• Choose the job, and then choose Edit job from the Actions list.
• Choose Monitoring in the console navigation pane. On the Monitoring page, locate the job in the Job
runs list. You can filter the rows in the Job runs list, as described in Job runs view (p. 101). Choose
the job you want to edit, and then choose View job from the Actions menu.
Job editor features

The job editor provides the following features for creating and editing jobs.
• A visual diagram of your job, with a node for each job task: Data source nodes for reading the data;
transform nodes for modifying the data; data target nodes for writing the data.
You can view and configure the properties of each node in the job diagram. You can also view the
schema and sample data for each node in the job diagram. These features help you to verify that your
job is modifying and transforming the data in the right way, without having to run the job.
47
Using schema previews in the visual job editor
• A Script viewing and editing tab, where you can modify the code generated for your job.
• A Job details tab, where you can configure a variety of settings to customize the environment in which
your AWS Glue ETL job runs.
• A Runs tab, where you can view the current and previous runs of the job, view the status of the job run,
and access the logs for the job run.
• A Schedules tab, where you can configure the start time for you job, or set up a recurring job runs.
Using schema previews in the visual job editor

While creating or editing your job, you can use the Output schema tab to view the schema for your data.
Before you can see the schema, the job editor needs permissions to access the data source. You can
specify an IAM role on the Job details tab of the editor or on the Output schema tab for a node. If the
IAM role has all the necessary permissions to access the data source, you can then view the schema on
the Output schema tab for a node.
Using data previews in the visual job editor

While creating or editing your job, you can use the Data preview tab to view a sample of your data.
Before you can see the data sample, the job editor needs permissions to access the data source. The first
time you choose the Data preview tab, you are prompted to choose an IAM role to use. This can be the
same role that you plan to use for your job, or it can be a different role. The IAM role you choose must
have the necessary permissions to create the data previews.
After you choose an IAM role, it takes about 20 to 30 seconds before the data appears. You are charged
for data preview usage as soon as you choose the IAM role. The following features help you when
viewing the data.
• Choose the settings icon (a gear symbol) to configure your preferences for data previews. You can
change the sample size or you can choose to wrap the text from one line to the next. These settings
apply to all nodes in the job diagram.
• Choose the Previewing x of y fields button to select which columns (fields) to preview. When you
preview you data using the default settings, the job editor shows the first 5 columns of your dataset.
You can change this to show all or none (not recommended).
• You can scroll through the data preview window both horizontally and vertically.
• Use the split/whole screen button to expand the Data preview tab to the entire screen (over-laying the
job graph), to better view the data and data structures.
Data previews help you create and test your job, without having to repeatedly run the job.
• You can test an IAM role to make sure you have access to your data sources or data targets.
• You can check that the transform is modifying the data in the intended way. For example, if you use a
Filter transform, you can make sure that the filter is selecting the right subset of data.
• If your dataset contains columns with values of multiple types, the data preview shows a list of
tuples for these columns. Each tuple contains the data type and its value, as shown in the following
screenshot.
48
Restrictions when using data previews
Restrictions when using data previews

When using data previews, you might encounter the following restrictions or limitations.
• The first time you choose the Data preview tab you must choose IAM role. This role must have the
necessary permissions to access the data and other resources needed to create the data previews.
• After you provide an IAM role, it takes a while before the data is available for viewing. For datasets
with less than 1 GB of data, it can take up to one minute. If you have a large dataset, you should
use partitions to improve the loading time. Loading data directly from Amazon S3 has the best
performance.
• If you have a very large dataset, and it takes more than 30 minutes to query the data for the data
preview, the request will time out. You can reduce the dataset size to use data previews.
• By default, you see the first 5 columns in the Data preview tab. If the columns have no data values, you
will get a message that there is no data to display. You can increase the number of rows sampled, or
selected different columns to see data values.
• Data previews are currently not supported for streaming data sources, or for data sources that use
custom connectors.
• Errors on one node effect the entire job. If any one node has an error with data previews, the error will
show up on all nodes until you correct it.
• If you change a data source for the job, then the child nodes of that data source might need to be
updated to match the new schema. For example, if you have an ApplyMapping node that modifies a
column, and the column does not exist in the replacement data source, you will need to update the
ApplyMapping transform node.
• If you view the Data preview tab for a SQL query transform node, and the SQL query uses an incorrect
field name, the Data preview tab shows an error.
Script code generation

When you use the visual editor to create a job, the ETL code is automatically generated for you. AWS
Glue Studio creates a functional and complete job script, and saves it in an Amazon S3 location.
49
Editing the data source node
There are two forms of code generated by AWS Glue Studio: the original, or Classic version, and a newer,
streamlined version. By default, the new code generator is used to create the job script. You can generate
a job script using classic code generator on the Script tab by choosing the Generate classic script toggle
button.
Some of the differences in the new version of the generated code include:
• Large comment blocks are no longer added to the script

• Output structures in the code use the node name that you specify in the visual editor. In the class
script, the output structures are simply named DataSource0, DataSource1, Transform0,
Transform1, DataSink0, DataSink1, and so on.
• Long commands are split across multiple lines to remove the need to scroll across the page to see the
entire command.
New features in AWS Glue Studio require the new version of code generation, and will not work with the
classic code script. You are prompted to update these jobs when you attempt to run them.
Editing the data source node

To specify the data source properties, you first choose a data source node in the job diagram. Then, on
the right side in the node details panel, you configure the node properties.
To modify the properties of a data source node
1. Go to the visual editor for a new or saved job.

2. Choose a data source node in the job diagram.
3. Choose the Node properties tab in the node details panel, and then enter the following
information:
• Name: (Optional) Enter a name to associate with the node in the job diagram. This name should
be unique among all the nodes for this job.
• Node type: The node type determines the action that is performed by the node. In the list of
options for Node type, choose one of the values listed under the heading Data source.
4. Configure the Data source properties information. For more information, see the following sections:
• Using Data Catalog tables for the data source (p. 51)
• Using a connector for the data source (p. 51)
• Using files in Amazon S3 for the data source (p. 51)
• Using a streaming data source (p. 53)
5. (Optional) After configuring the node properties and data source properties, you can view the
schema for your data source by choosing the Output schema tab in the node details panel. The first
the data. If you have not specified an IAM role on the Job details tab, you are prompted to enter an
IAM role here.
an IAM role.
50
Using Data Catalog tables for the data source
Using Data Catalog tables for the data source

For all data sources except Amazon S3 and connectors, a table must exist in the AWS Glue Data Catalog
for the source type that you choose. AWS Glue Studio does not create the Data Catalog table.
To configure a data source node based on a Data Catalog table

3. Choose the Data source properties tab, and then enter the following information:
• S3 source type: (For Amazon S3 data sources only) Choose the option Select a Catalog table to
use an existing AWS Glue Data Catalog table.
• Database: Choose the database in the Data Catalog that contains the source table you want to use
for this job. You can use the search field to search for a database by its name.
• Table: Choose the table associated with the source data from the list. This table must already exist
in theAWS Glue Data Catalog. You can use the search field to search for a table by its name.
• Partition predicate: (For Amazon S3 data sources only) Enter a Boolean expression based on
Spark SQL that includes only the partitioning columns. For example: "(year=='2020' and
month=='04')"
• Temporary directory: (For Amazon Redshift data sources only) Enter a path for the location of a
working directory in Amazon S3 where your ETL job can write temporary intermediate results.
• Role associated with the cluster: (For Amazon Redshift data sources only) Enter a role for your
ETL job to use that contains permissions for Amazon Redshift clusters. For more information, see
the section called “Data source and data target permissions” (p. 34).
Using a connector for the data source

If you select a connector for the Node type, follow the instructions at Authoring jobs with custom
connectors (p. 85) to finish configuring the data source properties.
Using files in Amazon S3 for the data source

If you choose Amazon S3 as your data source, then you can choose either:
• A Data Catalog database and table.

• A bucket, folder, or file in Amazon S3.
If you use an Amazon S3 bucket as your data source, AWS Glue Studio detects the schema of the data
at the specified location from one of the files, or by using the file you specify as a sample file. Schema
detection occurs when you use the Infer schema button. If you change the Amazon S3 location or the
sample file, then you must choose Infer schema again to perform the schema detection using the new
information.
To configure a data source node that reads directly from files in Amazon S3

2. Choose a data source node in the job diagram for an Amazon S3 source.
• S3 source type: (For Amazon S3 data sources only) Choose the option S3 location.
51
Using files in Amazon S3 for the data source
• S3 URL: Enter the path to the Amazon S3 bucket, folder, or file that contains the data for your job.
You can choose Browse S3 to select the path from the locations available to your account.
• Recursive: Choose this option if you want AWS Glue Studio to read data from files in child folders
at the S3 location.
If the child folders contain partitioned data, AWS Glue Studio doesn't add any partition
information that's specified in the folder names to the Data Catalog. For example, consider the
following folders in Amazon S3:
S3://sales/year=2019/month=Jan/day=1
S3://sales/year=2019/month=Jan/day=2
If you choose Recursive and select the sales folder as your S3 location, then AWS Glue Studio
reads the data in all the child folders, but doesn't create partitions for year, month or day.
• Data format: Choose the format that the data is stored in. You can choose JSON, CSV, or Parquet.
The value you select tells the AWS Glue job how to read the data from the source file.
Note
If you don't select the correct format for your data, AWS Glue Studio might infer the
schema correctly, but the job won't be able to correctly parse the data from the source
file.
You can enter additional configuration options, depending on the format you choose.
• JSON (JavaScript Object Notation)
• JsonPath: Enter a JSON path that points to an object that is used to define a table schema.
JSON path expressions always refer to a JSON structure in the same way as XPath expression
are used in combination with an XML document. The "root member object" in the JSON path
is always referred to as $, even if it's an object or array. The JSON path can be written in dot
notation or bracket notation.
For more information about the JSON path, see JsonPath on the GitHub website.
• Records in source files can span multiple lines: Choose this option if a single record can span
multiple lines in the CSV file.
• CSV (comma-separated values)
• Delimiter: Enter a character to denote what separates each column entry in the row, for
example, ; or ,.
• Escape character: Enter a character that is used as an escape character. This character
indicates that the character that immediately follows the escape character should be taken
literally, and should not be interpreted as a delimiter.
• Quote character: Enter the character that is used to group separate strings into a single
value. For example, you would choose Double quote (") if you have values such as "This is
a single value" in your CSV file.
• Records in source files can span multiple lines: Choose this option if a single record can span
multiple lines in the CSV file.
• First line of source file contains column headers: Choose this option if the first row in the
CSV file contains column headers instead of data.
• Parquet (Apache Parquet columnar storage)
There are no additional settings to configure for data stored in Parquet format.
• Partition predicate: To partition the data that is read from the data source, enter a Boolean
expression based on Spark SQL that includes only the partitioning columns. For example:
"(year=='2020' and month=='04')"
• Advanced options: Expand this section if you want AWS Glue Studio to detect the schema of your
data based on a specific file. 52
Using a streaming data source
• Schema inference: Choose the option Choose a sample file from S3 if you want to use a
specific file instead of letting AWS Glue Studio choose a file.
• Auto-sampled file: Enter the path to the file in Amazon S3 to use for inferring the schema.
If you're editing a data source node and change the selected sample file, choose Reload schema to
detect the schema by using the new sample file.
4. Choose the Infer schema button to detect the schema from the sources files in Amazon S3. If you
change the Amazon S3 location or the sample file, you must choose Infer schema again to infer the
schema using the new information.

You can create streaming extract, transform, and load (ETL) jobs that run continuously and consume
data from streaming sources in Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed
Streaming for Apache Kafka (Amazon MSK).
To configure properties for a streaming data source
1. Go to the visual graph editor for a new or saved job.

2. Choose a data source node in the graph for Kafka or Kinesis Data Streams.
Kinesis
• Kinesis source type: Choose the option Stream details to use direct access to the streaming
source or choose Data Catalog table to use the information stored there instead.
If you choose Stream details, specify the following additional information.

• Location of data stream: Choose whether the stream is located within the current user
account, or if it is located in a different account.
• Region: Choose the AWS Region where the stream exists. This information is used to
construct the ARN for accessing the data stream.
• Stream ARN: Enter the Amazon Resource Name (ARN) for the Kinesis data stream. If the
stream is located within the current account, you can choose the stream name from the
drop-down list. You can use the search field to search for a data stream by its name or ARN.
• Data format: Choose the format used by the data stream from the list.
AWS Glue Studio automatically detects the schema from the streaming data.
If you choose Data Catalog table, specify the following additional information.
• Database: (Optional) Choose the database in the AWS Glue Data Catalog that contains the
table associated with your streaming data source. You can use the search field to search for
a database by its name.
• Table: (Optional) Choose the table associated with the source data from the list. This table
must already exist in the AWS Glue Data Catalog. You can use the search field to search for
a table by its name.
• Detect schema: Choose this option to have AWS Glue Studio detect the schema from the
streaming data, rather than using the schema information in a Data Catalog table. This
option is enabled automatically if you choose the Stream details option.
• Starting position: By default, the ETL job uses the Earliest option, which means it reads
data starting with the oldest available record in the stream. You can instead choose Latest,
which indicates the ETL job should start reading from just after the most recent record in the
stream.
53
• Window size: By default, your ETL job processes and writes out data in 100-second windows.
This allows data to be processed efficiently and permits aggregations to be performed on
data that arrives later than expected. You can modify this window size to increase timeliness
or aggregation accuracy.
AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that
has been read.
• Connection options: Expand this section to add key-value pairs to specify additional
connection options. For information about what options you can specify here, see
"connectionType": "kinesis" in the AWS Glue Developer Guide.
Kafka
• Apache Kafka source: Choose the option Stream details to use direct access to the streaming
source or choose Data Catalog table to use the information stored there instead.
If you choose Data Catalog table, specify the following additional information.
• Database: (Optional) Choose the database in the AWS Glue Data Catalog that contains the
table associated with your streaming data source. You can use the search field to search for
a database by its name.
• Table: (Optional) Choose the table associated with the source data from the list. This table
must already exist in the AWS Glue Data Catalog. You can use the search field to search for
a table by its name.
• Detect schema: Choose this option to have AWS Glue Studio detect the schema from the
streaming data, rather than storing the schema information in a Data Catalog table. This
option is enabled automatically if you choose the Stream details option.
If you choose Stream details, specify the following additional information.

• Connection name: Choose the AWS Glue connection that contains the access and
authentication information for the Kafka data stream. You must use a connection with
Kafka streaming data sources. If a connection doesn't exist, you can use the AWS Glue
console to create a connection for your Kafka data stream.
• Topic name: Enter the name of the topic to read from.
• Data format: Choose the format to use when reading data from the Kafka event stream.
• Starting position: By default, the ETL job uses the Earliest option, which means it reads
data starting with the oldest available record in the stream. You can instead choose Latest,
which indicates the ETL job should start reading from just after the most recent record in the
stream.
• Window size: By default, your ETL job processes and writes out data in 100-second windows.
This allows data to be processed efficiently and permits aggregations to be performed on
data that arrives later than expected. You can modify this window size to increase timeliness
or aggregation accuracy.
AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that
has been read.
• Connection options: Expand this section to add key-value pairs to specify additional
connection options. For information about what options you can specify here, see
"connectionType": "kafka" in the AWS Glue Developer Guide.
Note
Data previews are not currently supported for streaming data sources.
54
Editing the data transform node
Editing the data transform node

AWS Glue Studio provides a set of built-in transforms that you can use to process your data. Your data
passes from one node in the job diagram to another in a data structure called a DynamicFrame, which is
an extension to an Apache Spark SQL DataFrame.
In the pre-populated diagram for a job, between the data source and data target nodes is the Transform
- ApplyMapping node. You can configure this transform node to modify your data, or you can use
additional transforms.
The following built-in transforms are available with AWS Glue Studio:
• ApplyMapping (p. 55): Map data property keys in the data source to data property keys in the data
target. You can rename keys, modify the data types for keys, and choose which keys to drop from the
dataset.
• SelectFields (p. 56): Choose the data property keys that you want to keep.
• DropFields (p. 57): Choose the data property keys that you want to drop.
• RenameField (p. 57): Rename a single data property key.
• Spigot (p. 58): Write samples of the data to an Amazon S3 bucket.
• Join (p. 58): Join two datasets into one dataset using a comparison phrase on the specified data
property keys. You can use inner, outer, left, right, left semi, and left anti joins.
• SplitFields (p. 60): Split data property keys into two DynamicFrames. Output is a collection of
DynamicFrames: one with selected data property keys, and one with the remaining data property
keys.
• SelectFromCollection (p. 60): Choose one DynamicFrame from a collection of DynamicFrames.
The output is the selected DynamicFrame.
• FillMissingValues (p. 62): Locate records in the dataset that have missing values and add a new
field with a suggested value that is determined by imputation
• Filter (p. 62): Split a dataset into two, based on a filter condition.
• DropNullFields (p. 66) (p. 63): Removes columns from the dataset if all values in the column are
‘null’.
• SQL (p. 64): Enter SparkSQL code into a text entry field to use a SQL query to transform the data.
The output is a single DynamicFrame.
• Aggregate (p. 66): performs a calculation (such as average, sum, min, max) on selected fields and
rows, and creates a new field with the newly calculated value(s).
• Custom transform (p. 68): Enter code into a text entry field to use custom transforms. The output
is a collection of DynamicFrames.
Using ApplyMapping to remap data property keys

An ApplyMapping transform remaps the source data property keys into the desired configured for the
target data. In an ApplyMapping transform node, you can:
• Change the name of multiple data property keys.

• Change the data type of the data property keys, if the new data type is supported and there is a
transformation path between the two data types.
• Choose a subset of data property keys by indicating which data property keys you want to drop.
You can add additional ApplyMapping nodes to the job diagram as needed – for example, to modify
additional data sources or following a Join transform.
55
Using SelectFields to remove most data property keys
Note
The ApplyMapping transform is not case-sensitive.
To add an ApplyMapping transform node to your job diagram
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose
ApplyMapping to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent isn't
already selected, choose a node from the Node parents list to use as the input source for the
transform.
4. Modify the input schema:
• To rename a data property key, enter the new name of the key in the Target key field.
• To change the data type for a data property key, choose the new data type for the key from the
Data type list.
• To remove a data property key from the target schema, choose the Drop check box for that key.
5. (Optional) After configuring the transform node properties, you can view the modified schema for
your data by choosing the Output schema tab in the node details panel. The first time you choose
this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you
have not specified an IAM role on the Job details tab, you are prompted to enter an IAM role here.
role.
Using SelectFields to remove most data property

keys
You can create a subset of data property keys from the dataset using the SelectFields transform. You
indicate which data property keys you want to keep and the rest are removed from the dataset.
Note
The SelectFields transform is case sensitive. Use ApplyMapping if you need a case-insensitive way
to select fields.
To add a SelectFields transform node to your job diagram
SelectFields to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent is
not already selected, choose a node from the Node parents list to use as the input source for the
transform.
4. Under the heading SelectFields, choose the data property keys in the dataset that you want to keep.
Any data property keys not selected are dropped from the dataset.
You can also choose the check box next to the column heading Field to automatically choose all the
data property keys in the dataset. Then you can deselect individual data property keys to remove
them from the dataset.
56
Using DropFields to keep most data property keys
role.
Using DropFields to keep most data property keys

You can create a subset of data property keys from the dataset using the DropFields transform. You
indicate which data property keys you want to remove from the dataset and the rest of the keys are
retained.
Note
The DropFields transform is case sensitive. Use ApplyMapping if you need a case-insensitive way
to select fields.
To add a DropFields transform node to your job diagram
DropFields to add a new transform to your job diagram, if needed.
2. On the Node properties tab, enter a name for the node in the job diagram. If a node parent is not
already selected, then choose a node from the Node parents list to use as the input source for the
transform.
4. Under the heading DropFields, choose the data property keys to drop from the data source.
You can also choose the check box next to the column heading Field to automatically choose all the
data property keys in the dataset. Then you can deselect individual data property keys so they are
retained in the dataset.
role.
Renaming a field in the dataset

You can use the RenameField transform to change the name for an individual property key in the dataset.
Note
The RenameField transform is case sensitive. Use ApplyMapping if you need a case-insensitive
transform.
Tip
If you use the ApplyMapping transform, you can rename multiple data property keys in the
dataset with a single transform.
57
Using Spigot to sample your dataset
To add a RenameField transform node to your job diagram
RenameField to add a new transform to your job diagram, if needed.
transform.
3. Choose the Transform tab.
4. Under the heading Data field, choose a property key from the source data and then enter a new
name in the New field name field.
role.
Using Spigot to sample your dataset

To test the transformations performed by your job, you might want to get a sample of the data to
check that the transformation works as intended. The Spigot transform writes a subset of records from
the dataset to a JSON file in an Amazon S3 bucket. The data sampling method can be either a specific
number of records from the beginning of the file or a probability factor used to pick records.
To add a Spigot transform node to your job diagram
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose Spigot
to add a new transform to your job diagram, if needed.
transform.
4. Enter an Amazon S3 path or choose Browse S3 to choose a location in Amazon S3. This is the
location where the job writes the JSON file that contains the data sample.
5. Enter information for the sampling method. You can specify a value for Number of records to write
starting from the beginning of the dataset and a Probability threshold (entered as a decimal value
with a maximum value of 1) of picking any given record.
For example, to write the first 50 records from the dataset, you would set Number of records to 50
and Probability threshold to 1 (100%).
Joining datasets
The Join transform allows you to combine two datasets into one. You specify the key names in the
schema of each dataset to compare. The output DynamicFrame contains rows where keys meet the join
condition. The rows in each dataset that meet the join condition are combined into a single row in the
output DynamicFrame that contains all the columns found in either dataset.
58
Joining datasets
To add a Join transform node to your job diagram
1. If there is only one data source available, you must add a new data source node to the job diagram.
See Adding nodes to the job diagram (p. 79) for details.
2. Choose one of the source nodes for the join. Choose Transform in the toolbar at the top of the
visual editor, and then choose Join to add a new transform to your job diagram.
3. On the Node properties tab, enter a name for the node in the job diagram.
4. In the Node properties tab, under the heading Node parents, add a parent node so that there are
two datasets providing inputs for the join. The parent can be a data source node or a transform
node.
Note
A join can have only two parent nodes.
If you see a message indicating that there are conflicting key names, you can either:
• Choose Resolve it to automatically add an ApplyMapping transform node to your job diagram. The
ApplyMapping node adds a prefix to any keys in the dataset that have the same name as a key in
the other dataset. For example, if you use the default value of right, then any keys in the right
dataset that have the same name as a key in the left dataset will be renamed to (right)key
name.
• Manually add a transform node earlier in the job diagram to remove or rename the conflicting
keys.
6. Choose the type of join in the Join type list.
• Inner join: Returns a row with columns from both datasets for every match based on the join
condition. Rows that don't satisfy the join condition aren't returned.
• Left join: All rows from the left dataset and only the rows from the right dataset that satisfy the
join condition.
• Right join: All rows from the right dataset and only the rows from the left dataset that satisfy the
join condition.
• Outer join: All rows from both datasets.
• Left semi join: All rows from the left dataset that have a match in the right dataset based on the
join condition.
• Left anti join: All rows in the left dataset that don't have a match in the right dataset based on
join condition.
7. On the Transform tab, under the heading Join conditions, choose Add condition. Choose a
property key from each dataset to compare. Property keys on the left side of the comparison
operator are referred to as the left dataset and property keys on the right are referred to as the right
dataset.
For more complex join conditions, you can add additional matching keys by choosing Add condition
more than once. If you accidentally add a condition, you can choose the delete icon ( ) to remove
it.
role.
59
Using SplitFields to split a dataset into two
For an example of the join output schema, consider a join between two datasets with the following
property keys:
Left: {id, dept, hire_date, salary, employment_status}

Right: {id, first_name, last_name, hire_date, title}
The join is configured to match on the id and hire_date keys using the = comparison operator.
Because both datasets contain id and hire_date keys, you chose Resolve it to automatically add the
prefix right to the keys in the right dataset.
The keys in the output schema would be:
{id, dept, hire_date, salary, employment_status,

(right)id, first_name, last_name, (right)hire_date, title}
Using SplitFields to split a dataset into two

The SplitFields transform allows you to choose some of the data property keys in the input dataset
and put them into one dataset and the unselected keys into a separate dataset. The output from this
transform is a collection of DynamicFrames.
Note
You must use a SelectFromCollection transform to convert the collection of DynamicFrames
into a single DynamicFrame before you can send the output to a target location.
The SplitFields transform is case sensitive. Add an ApplyMapping transform as a parent node if you need
case-insensitive property key names.
To add a SplitFields transform node to your job diagram
SplitFields to add a new transform to your job diagram, if needed.
transform.
4. Choose which property keys you want to put into the first dataset. The keys that you do not choose
are placed in the second dataset.
role.
7. Configure a SelectFromCollection transform node to process the resulting datasets.
Overview of SelectFromCollection transform

Certain transforms have multiple datasets as their output instead of a single dataset, for example,
SplitFields. The SelectFromCollection transform selects one dataset (DynamicFrame) from a collection of
datasets (an array of DynamicFrames). The output for the transform is the selected DynamicFrame.
60
Using SelectFromCollection to
choose which dataset to keep
You must use this transform after you use a transform that creates a collection of DynamicFrames, such
as:
• Custom code transforms

• SplitFields
If you don't add a SelectFromCollection transform node to your job diagram after any of these
transforms, you will get an error for your job.
The parent node for this transform must be a node that returns a collection of DynamicFrames. If you
choose a parent for this transform node that returns a single DynamicFrame, such as a Join transform,
your job returns an error.
Similarly, if you use a SelectFromCollection node in your job diagram as the parent for a transform that
expects a single DynamicFrame as input, your job returns an error.
Using SelectFromCollection to choose which dataset

to keep
Use the SelectFromCollection transform to convert a collection of DynamicFrames into a single
DynamicFrame.
To add a SelectFromCollection transform node to your job diagram
SelectFromCollection to add a new transform to your job diagram, if needed.
transform.
4. Under the heading Frame index, choose the array index number that corresponds to the
DynamicFrame you want to select from the collection of DynamicFrames.
For example, if the parent node for this transform is a SplitFields transform, on the Output
schema tab of that node you can see the schema for each DynamicFrame. If you want to keep the
DynamicFrame associated with the schema for Output 2, you would select 1 for the value of Frame
index, which is the second value in the list.
Only the DynamicFrame that you choose is included in the output.

61
Find and fill missing values in a dataset
role.
Find and fill missing values in a dataset

You can use the FillMissingValues transform to locate records in the dataset that have missing values and
add a new field with a value determined by imputation. The input data set is used to train the machine
learning (ML) model that determines what the missing value should be. If you use incremental data sets,
then each incremental set is used as the training data for the ML model, so the results might not be as
accurate.
To use a FillMissingValues transform node in your job diagram
FillMissingValues to add a new transform to your job diagram, if needed.
already selected, choose a node from the Node parents list to use as the input source for the
transform.
4. For Data field, choose the column or field name from the source data that you want to analyze for
missing values.
5. (Optional) In the New field name field, enter a name for the field added to each record that will
hold the estimated replacement value for the analyzed field. If the analyzed field doesn't have a
missing value, the value in the analyzed field is copied into the new field.
If you don't specify a name for the new field, the default name is the name of the analyzed column
with _filled appended. For example, if you enter Age for Data field and don't specify a value for
New field name, a new field named Age_filled is added to each record.
role.
Filtering keys within a dataset

Use the Filter transform to create a new dataset by filtering records from the input dataset based on a
regular expression. Rows that don't satisfy the filter condition are removed from the output.
• For string data types, you can filter rows where the key value matches a specified string.
• For numeric data types, you can filter rows by comparing the key value to a specified value using the
comparison operators <, >, =, !=, <=, and >=.
If you specify multiple filter conditions, the results are combined using an AND operator by default, but
you can choose OR instead.
62
Using DropNullFields to remove fields with null values
The Filter transform is case sensitive. Add an ApplyMapping transform as a parent node if you need case-
insensitive property key names.
To add a Filter transform node to your job diagram
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose Filter to
add a new transform to your job diagram, if needed.
transform.
4. Choose either Global AND or Global OR. This determines how multiple filter conditions are
combined. All conditions are combined using either AND or OR operations. If you have only a single
filter conditions, then you can choose either one.
5. Choose the Add condition button in the Filter condition section to add a filter condition.
In the Key field, choose a property key name from the dataset. In the Operation field, choose the
comparison operator. In the Value field, enter the comparison value. Here are some examples of
filter conditions:
• year >= 2018

• State matches 'CA*'
When you filter on string values, make sure that the comparison value uses a regular expression
format that matches the script language selected in the job properties (Python or Scala).
6. Add additional filter conditions, as needed.
role.
Using DropNullFields to remove fields with null

values
Use the DropNullFields transform to remove fields from the dataset if all values in the field are ‘null’. By
default, AWS Glue Studio will recognize null objects, but some values such as empty strings, strings that
are “null”, -1 integers or other placeholders such as zeros, are not automatically recognized as nulls.
To use the DropNullFields
1. Add a DropNullFields node to the job diagram.

2. On the Node properties tab, choose additional values that represent a null value. You can choose to
select none or all of the values:
63
Using a SQL query to transform data
• Empty String ("" or '') - fields that contain empty strings will be removed
• "null string" - fields that contain the string with the word 'null' will be removed
• -1 integer - fields that contain a -1 (negative one) integer will be removed
3. If needed, you can also specify custom null values. These are null values that may be unique to your
dataset. To add a custom null value, choose Add new value.
4. Enter the custom null value. For example, this can zero, or any value that is being used to represent
a null in the dataset.
5. Choose the data type in the drop-down field. Data types can either be String or Integer.
Note
Custom null values and their data types must match exactly in order for the fields to be
recognized as null values and the fields removed. Partial matches where only the custom
null value matches but the data type does not will not reesult in the fields being removed.

You can use a SQL transform to write your own transform in the form of a SQL query.
A SQL transform node can have multiple datasets as inputs, but produces only a single dataset as output.
In contains a text field, where you enter the Apache SparkSQL query. You can assign aliases to each
dataset used as input, to help simply the SQL query. For more information about the SQL syntax, see the
Spark SQL documentation.
Note
If you use a Spark SQL transform with a data source located in a VPC, add an AWS Glue VPC
endpoint to the VPC that contains the data source. For more information about configuring
development endpoints, see Adding a Development Endpoint, Setting Up Your Environment for
Development Endpoints, and Accessing Your Development Endpoint in the AWS Glue Developer
Guide.
64
To use a SQL transform node in your job diagram
1. (Optional) Add a transform node to the job diagram, if needed. Choose Spark SQL for the node
type.
already selected, or if you want multiple inputs for the SQL transform, choose a node from the Node
parents list to use as the input source for the transform. Add additional parent nodes as needed.
4. The source datasets for the SQL query are identified by the names you specified in the Name field
for each node. If you do not want to use these names, or if the names are not suitable for a SQL
query, you can associate a name to each dataset. The console provides default aliases, such as
MyDataSource.
For example, if a parent node for the SQL transform node is named Rename Org PK field, you
might associate the name org_table with this dataset. This alias can then be used in the SQL query
in place of the node name.
5. In the text entry field under the heading Code block, paste or enter the SQL query. The text field
displays SQL syntax highlighting and keyword suggestions.
6. With the SQL transform node selected, choose the Output schema tab, and then choose Edit.
Provide the columns and data types that describe the output fields of the SQL query.
Specify the schema using the following actions in the Output schema section of the page:
• To rename a column, place the cursor in the Key text box for the column (also referred to as a field
or property key) and enter the new name.
• To change the data type for a column, select the new data type for the column from the drop-
down list.
• To add a new top-level column to the schema, choose the Overflow ( ) button, and then choose
Add root key. New columns are added at the top of the schema.
• To remove a column from the schema, choose the delete icon ( ) to the far right of the Key
name.
7. When you finish specifying the output schema, choose Apply to save your changes and exit the
schema editor. If you do not want to save you changes, choose Cancel to edit the schema editor.
65
Using Aggregate to perform summary
calculcations on selected fields
role.
Using Aggregate to perform summary calculcations

on selected fields
To use the Aggregate transform
1. Add the Aggregate node to the job diagram.

2. On the Node properties tab, choose fields to group together by selecting the drop-down field
(optional). You can select more than one field at a time or search for a field name by typing in the
search bar.
When fields are selected, the name and datatype are shown. To remove a field, choose 'X' on the
field.
66
3. Choose Aggregate another column. It is required to select at least one field.
67
Creating a custom transformation
4. Choose a field in the Field to aggregate drop-down.

5. Choose the aggregation function to apply to the chosen field:
• avg - calculates the average

• countDistinct - calculates the number of unique non-null values
• count - calculates the number of non-null values
• first - returns the first value that satisfies the 'group by' criteria
• last - returns the last value that satisfies the 'group by' criteria
• kurtosis - calculates the the sharpness of the peak of a frequency-distribution curve
• max - returns the highest value that satisfies the 'group by' criteria
• min - returns the lowest value that satisfies the 'group by' criteria
• skewness - measure of the asymmetry of the probability distribution of a normal distribution
• stddev_pop - calculates the population standard deviation and returns the square root of the
population variance
• sum - the sum of all values in the group
• sumDistinct - the sum of distinct values in the group
• var_samp - the sample variance of the group (ignores nulls)
• var_pop - the population variance of the group (ignores nulls)

If you need to perform more complicated transformations on your data, or want to add data property
keys to the dataset, you can add a Custom code transform to your job diagram. The Custom code node
allows you to enter a script that performs the transformation.
When using custom code, you must use a schema editor to indicate the changes made to the output
through the custom code. When editing the schema, you can perform the following actions:
• Add or remove data property keys

• Change the data type of data property keys
• Change the name of data property keys
• Restructure a nested property key
You must use a SelectFromCollection transform to choose a single DynamicFrame from the result of your
Custom transform node before you can send the output to a target location.
Use the following tasks to add a custom transform node to your job diagram.
Adding a custom code transform node to the job diagram

To add a custom transform node to your job diagram
1. (Optional) Choose Transform in the toolbar at the top of the visual editor, and then choose Custom
transform to add a custom transform to your job diagram.
already selected, or if you want multiple inputs for the custom transform, then choose a node from
the Node parents list to use as the input source for the transform.
68
Entering code for the custom transform node

You can type or copy code into an input field. The job uses this code to perform the data transformation.
You can provide a code snippet in either Python or Scala. The code should take one or more
DynamicFrames as input and returns a collection of DynamicFrames.
To enter the script for a custom transform node
1. With the custom transform node selected in the job diagram, choose the Transform tab.
2. In the text entry field under the heading Code block, paste or enter the code for the transformation.
The code that you use must match the language specified for the job on the Job details tab.
When referring to the input nodes in your code, AWS Glue Studio names the DynamicFrames
returned by the job diagram nodes sequentially based on the order of creation. Use one of the
following naming methods in your code:
• Classic code generation – Use functional names to refer to the nodes in your job diagram.
• Data source nodes: DataSource0, DataSource1, DataSource2, and so on.
• Transform nodes: Transform0, Transform1, Transform2, and so on.
• New code generation – Use the name specified on the Node properties tab of a node, appended
with '_node1', '_node2', and so on. For example, S3bucket_node1, ApplyMapping_node2,
S3bucket_node2, MyCustomNodeName_node1.
For more information about the new code generator, see Script code generation (p. 49).
The following examples show the format of the code to enter in the code box:
Python
The following example takes the first DynamicFrame received, converts it to a DataFrame to apply
the native filter method (keeping only records that have over 1000 votes), then converts it back to a
DynamicFrame before returning it.
def FilterHighVoteCounts (glueContext, dfc) -> DynamicFrameCollection:

df = dfc.select(list(dfc.keys())[0]).toDF()
df_filtered = df.filter(df["vote_count"] > 1000)
dyf_filtered = DynamicFrame.fromDF(df_filtered, glueContext, "filter_votes")
return(DynamicFrameCollection({"CustomTransform0": dyf_filtered}, glueContext))
Scala
The following example takes the first DynamicFrame received, converts it to a DataFrame to apply
the native filter method (keeping only records that have over 1000 votes), then converts it back to a
DynamicFrame before returning it.
object FilterHighVoteCounts {
def execute(glueContext : GlueContext, input : Seq[DynamicFrame]) : Seq[DynamicFrame]
= {
val frame = input(0).toDF()
val filtered = DynamicFrame(frame.filter(frame("vote_count") > 1000), glueContext)
Seq(filtered)
}
}
69
Editing the schema in a custom transform node

When you use a custom transform node, AWS Glue Studio cannot automatically infer the output
schemas created by the transform. You use the schema editor to describe the schema changes
implemented by the custom transform code.
A custom code node can have any number of parent nodes, each providing a DynamicFrame as
input for your custom code. A custom code node returns a collection of DynamicFrames. Each
DynamicFrame that is used as input has an associated schema. You must add a schema that describes
each DynamicFrame returned by the custom code node.
Note
When you set your own schema on a custom transform, AWS Glue Studio does not inherit
schemas from previous nodes.To update the schema, select the Custom transform node, then
choose the Data preview tab. Once the preview is generated, choose 'Use Preview Schema'. The
schema will then be replaced by the schema using the preview data.
To edit the output schemas for a custom transform node
1. With the custom transform node selected in the job diagram, in the node details panel, choose the
Output schema tab.
2. Choose Edit to make changes to the schema.
If you have nested data property keys, such as an array or object, you can choose the Expand-Rows
icon ( ) on the top right of each schema panel to expand the list of child data property keys. After
you choose this icon, it changes to the Collapse-Rows icon ( ), which you can choose to collapse
the list of child property keys.
3. Modify the schema using the following actions in the section on the right side of the page:
• To rename a property key, place the cursor in the Key text box for the property key, then enter the
new name.
• To change the data type for a property key, use the list to choose the new data type for the
property key.
• To add a new top-level property key to the schema, choose the Overflow ( ) icon to the left of
the Cancel button, and then choose Add root key.
• To add a child property key to the schema, choose the Add-Key icon associated with the parent
key. Enter a name for the child key and choose the data type.
• To remove a property key from the schema, choose the Remove icon ( ) to the far right of the
key name.
4. If your custom transform code uses multiple DynamicFrames, you can add additional output
schemas.
• To add a new, empty schema, choose the Overflow ( ) icon, and then choose Add output
schema.
• To copy an existing schema to a new output schema, make sure the schema you want to copy is
displayed in the schema selector. Choose the Overflow ( ) icon, and then choose Duplicate.
If you want to remove an output schema, make sure the schema you want to copy is displayed in the
schema selector. Choose the Overflow ( ) icon, and then choose Delete.
5. Add new root keys to the new schema or edit the duplicated keys.
6. When you are modifying the output schemas, choose the Apply button to save your changes and
exit the schema editor.
70
If you do not want to save your changes, choose the Cancel button.
Configure the custom transform output

A custom code transform returns a collection of DynamicFrames, even if there is only one
DynamicFrame in the result set.
To process the output from a custom transform node
1. Add a SelectFromCollection transform node, which has the custom transform node as its
parent node. Update this transform to indicate which dataset you want to use. See Using
SelectFromCollection to choose which dataset to keep (p. 61) for more information.
2. Add additional SelectFromCollection transforms to the job diagram if you want to use additional
DynamicFrames produced by the custom transform node.
Consider a scenario in which you add a custom transform node to split a flight dataset into multiple
datasets, but duplicate some of the identifying property keys in each output schema, such as the
flight date or flight number. You add a SelectFromCollection transform node for each output schema,
with the custom transform node as its parent.
3. (Optional) You can then use each SelectFromCollection transform node as input for other nodes in
the job, or as a parent for a data target node.
Using Aggregate to perform summary calculcations

on selected fields
To use the Aggregate transform
1. Add the Aggregate node to the job diagram.

2. On the Node properties tab, choose fields to group together by selecting the drop-down field
(optional). You can select more than one field at a time or search for a field name by typing in the
search bar.
When fields are selected, the name and datatype are shown. To remove a field, choose 'X' on the
field.
71
3. Choose Aggregate another column. It is required to select at least one field.
72
Configuring data target nodes
4. Choose a field in the Field to aggregate drop-down.

5. Choose the aggregation function to apply to the chosen field:
• avg - calculates the average

• countDistinct - calculates the number of unique non-null values
• count - calculates the number of non-null values
• first - returns the first value that satisfies the 'group by' criteria
• last - returns the last value that satisfies the 'group by' criteria
• kurtosis - calculates the the sharpness of the peak of a frequency-distribution curve
• max - returns the highest value that satisfies the 'group by' criteria
• min - returns the lowest value that satisfies the 'group by' criteria
• skewness - measure of the asymmetry of the probability distribution of a normal distribution
• stddev_pop - calculates the population standard deviation and returns the square root of the
population variance
• sum - the sum of all values in the group
• sumDistinct - the sum of distinct values in the group
• var_samp - the sample variance of the group (ignores nulls)
• var_pop - the population variance of the group (ignores nulls)
Configuring data target nodes

The data target is where the job writes the transformed data.
Overview of data target options

Your data target (also called a data sink) can be:
• S3 – The job writes the data in a file in the Amazon S3 location you choose and in the format you
specify.
If you configure partition columns for the data target, then the job writes the dataset to Amazon S3
into directories based on the partition key.
• AWS Glue Data Catalog – The job uses the information associated with the table in the Data Catalog
to write the output data to a target location.
You can create the table manually or with the crawler. You can also use AWS CloudFormation
templates to create tables in the Data Catalog.
• A connector – A connector is a piece of code that facilitates communication between your data
store and AWS Glue. The job uses the connector and associated connection to write the output data
to a target location. You can either subscribe to a connector offered in AWS Marketplace, or you
can create your own custom connector. For more information, see Adding connectors to AWS Glue
Studio (p. 82)
You can choose to update the Data Catalog when your job writes to an Amazon S3 data target. Instead
of requiring a crawler to update the Data Catalog when the schema or partitions change, this option
makes it easy to keep your tables up to date. This option simplifies the process of making your data
available for analytics by optionally adding new tables to the Data Catalog, updating table partitions,
and updating the schema of your tables directly from the job.
73
Editing the data target node

The data target is where the job writes the transformed data.
To add or configure a data target node in your job diagram
1. (Optional) If you need to add a target node, choose Target in the toolbar at the top of the visual
editor, and then choose either S3 or Glue Data Catalog.
• If you choose S3 for the target, then the job writes the dataset to one or more files in the Amazon
S3 location you specify.
• If you choose AWS Glue Data Catalog for the target, then the job writes to a location described by
the table selected from the Data Catalog.
2. Choose a data target node in the job diagram. When you choose a node, the node details panel
appears on the right-side of the page.
3. Choose the Node properties tab, and then enter the following information:
• Name: Enter a name to associate with the node in the job diagram.
• Node type: A value should already be selected, but you can change it as needed.
• Node parents: The parent node is the node in the job diagram that provides the output data you
want to write to the target location. For a pre-populated job diagram, the target node should
already have the parent node selected. If there is no parent node displayed, then choose a parent
node from the list.
A target node has a single parent node.

4. Configure the Data target properties information. For more information, see the following sections:
• Using Amazon S3 for the data target (p. 74)

• Using Data Catalog tables for the data target (p. 75)
• Using a connector for the data target (p. 76)
5. (Optional) After configuring the data target node properties, you can view the output schema for
Using Amazon S3 for the data target

for the source type that you choose. AWS Glue Studio does not create the Data Catalog table.
To configure a data target node that writes to Amazon S3

• Format: Choose a format from the list. The available format types for the data results are:
• JSON: JavaScript Object Notation.
• CSV: Comma-separated values.
• Avro: Apache Avro JSON binary.
• Parquet: Apache Parquet columnar storage.
74
• Glue Parquet: A custom Parquet writer type that is optimized for DynamicFrames as the data
format. Instead of requiring a precomputed schema for the data, it computes and modifies the
schema dynamically.
• ORC: Apache Optimized Row Columnar (ORC) format.
To learn more about these format options, see Format Options for ETL Inputs and Outputs in AWS
Glue in the AWS Glue Developer Guide.
• Compression Type: You can choose to optionally compress the data using either the gzip or
bzip2 format. The default is no compression, or None.
• S3 Target Location: The Amazon S3 bucket and location for the data output. You can choose the
Browse S3 button to see the Amazon S3 buckets that you have access to and choose one as the
target destination.
• Data catalog update options
• Do not update the Data Catalog: (Default) Choose this option if you don't want the job to
update the Data Catalog, even if the schema changes or new partitions are added.
• Create a table in the Data Catalog and on subsequent runs, update the schema and add new
partitions: If you choose this option, the job creates the table in the Data Catalog on the first
run of the job. On subsequent job runs, the job updates the Data Catalog table if the schema
changes or new partitions are added.
You must also select a database from the Data Catalog and enter a table name.
• Create a table in the Data Catalog and on subsequent runs, keep existing schema and add
new partitions: If you choose this option, the job creates the table in the Data Catalog on the
first run of the job. On subsequent job runs, the job updates the Data Catalog table only to add
new partitions.
You must also select a database from the Data Catalog and enter a table name.
• Partition keys: Choose which columns to use as partitioning keys in the output. To add more
partition keys, choose Add a partition key.
Using Data Catalog tables for the data target

for the target type that you choose. AWS Glue Studio does not create the Data Catalog table.
To configure the data properties for a target that uses a Data Catalog table

2. Choose a data target node in the job diagram.
3. Choose the Data target properties tab, and then enter the following information:
• Database: Choose the database that contains the table you want to use as the target from the list.
This database must already exist in the Data Catalog.
• Table: Choose the table that defines the schema of your output data from the list. This table must
already exist in the Data Catalog.
A table in the Data Catalog consists of the names of columns, data type definitions, partition
information, and other metadata about the target dataset. Your job writes to a location described
by this table in the Data Catalog.
For more information about creating tables in the Data Catalog, see Defining Tables in the Data
Catalog in the AWS Glue Developer Guide.
• Data catalog update options
75
Editing or uploading a job script
• Do not change table definition: (Default) Choose this option if you don't want the job to
update the Data Catalog, even if the schema changes, or new partitions are added.
• Update schema and add new partitions: If you choose this option, the job updates the Data
Catalog table if the schema changes or new partitions are added.
• Keep existing schema and add new partitions: If you choose this option, the job updates the
Data Catalog table only to add new partitions.
• Partition keys: Choose which columns to use as partitioning keys in the output. To add more
partition keys, choose Add a partition key.
Using a connector for the data target

If you select a connector for the Node type, follow the instructions at Authoring jobs with custom
connectors (p. 85) to finish configuring the data target properties.
Editing or uploading a job script

Use the AWS Glue Studio visual editor to edit the job script or upload your own script.
You can use the visual editor to edit job nodes only if the jobs were created with AWS Glue Studio. If the
job was created using the AWS Glue console, through API commands, or with the command line interface
(CLI), you can use the script editor in AWS Glue Studio to edit the job script, parameters, and schedule.
You can also edit the script for a job created in AWS Glue Studio by converting the job to script-only
mode.
To edit the job script or upload your own script
1. If creating a new job, on the Jobs page, choose the Spark script editor option to create a Spark job
or choose the Python Shell script editor to create a Python shell job. You can either write a new
script, or upload an existing script. If you choose Spark script editor, you can write or upload either
a Scala or Python script. If you choose Python Shell script editor, you can only write or upload a
Python script.
After choosing the option to create a new job, in the Options section that appears, you can choose
to either start with a starter script (Create a new script with boilerplate code), or you can upload a
local file to use as the job script.
If you chose Spark script editor, you can upload either Python or Scala script files. Scala scripts
must have the file extension .scala. Python scripts must be recognized as files of type Python. If
you chose Python Shell script editor, you can upload only Python script files.
When you are finished making your choices, choose Create to create the job and open the visual
editor.
2. Go to the visual job editor for the new or saved job, and then choose the Script tab.
3. If you didn't create a new job using one of the script editor options, and you have never edited the
script for an existing job, the Script tab displays the heading Script (Locked). This means the script
editor is in read-only mode. Choose Edit script to unlock the script for editing.
To make the script editable, AWS Glue Studio converts your job from a visual job to a script-only job.
If you unlock the script for editing, you can't use the visual editor anymore for this job after you save
it.
In the confirmation window, choose Confirm to continue or Cancel to keep the job available for
visual editing.
76
Creating and editing Scala scripts in AWS Glue Studio
If you choose Confirm, the Visual tab no longer appears in the editor. You can use AWS Glue Studio
to modify the script using the script editor, modify the job details or schedule, or view job runs.
Note
Until you save the job, the conversion to a script-only job is not permanent. If you refresh
the console web page, or close the job before saving it and reopen it in the visual editor, you
will still be able to edit the individual nodes in the visual editor.
4. Edit the script as needed.
When you are done editing the script, choose Save to save the job and permanently convert the job
from visual to script-only.
5. (Optional) You can download the script from the AWS Glue Studio console by choosing the
Download button on the Script tab. When you choose this button, a new browser window
opens, displaying the script from its location in Amazon S3. The Script filename and Script path
parameters in the Job details tab of the job determine the name and location of the script file in
Amazon S3.
When you save the job, AWS Glue save the job script at the location specified by these fields. If you
modify the script file at this location within Amazon S3, AWS Glue Studio will load the modified
script the next time you edit the job.
Creating and editing Scala scripts in AWS Glue Studio

When you choose the script editor for creating a job, by default, the job programming language is set
to Python 3. If you choose to write a new script instead of uploading a script, AWS Glue Studio starts a
new script with boilerplate text written in Python. If you want to write a Scala script instead, you must
first configure the script editor to use Scala.
Note
If you choose Scala as the programming language for the job and use the visual editor to design
your job, the generated job script is written in Scala, and no further actions are needed.
77
Creating and editing Python shell jobs in AWS Glue Studio
To write a new Scala script in AWS Glue Studio
1. Create a new job by choosing the Spark script editor option.

2. Under Options, choose Create a new script with boilerplate code.
3. Choose the Job details tab and set Language to Scala (instead of Python 3).
Note
The Type property for the job is automatically set to Spark when you choose the Spark
script editor option to create a job.
4. Choose the Script tab.
5. Remove the Python boilerplate text. You can replace it with the following Scala boilerplate text.
import com.amazonaws.services.glue.{DynamicRecord, GlueContext}

import org.apache.spark.SparkContext
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
object MyScript {
def main(args: Array[String]): Unit = {
val sc: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(sc)
}
}
6. Write your Scala job script in the editor. Add additional import statements as needed.
Creating and editing Python shell jobs in AWS Glue

Studio
When you choose the Python shell script editor for creating a job, you can upload an existing Python
script, or write a new one. If you choose to write a new script, boilerplate code is added to the new
Python job script.
To create a new Python shell job
Refer to the instructions at Start the job creation process (p. 44).
The job properties that are supported for Python shell jobs are not the same as those supported for
Spark jobs. The following list describes the changes to the available job parameters for Python shell jobs
on the Job details tab.
• The Type property for the job is automatically set to Python Shell and can't be changed.
• Instead of Language, there is a Python version property for the job. Currently, Python shell jobs
created in AWS Glue Studio use Python 3.6.
• The Glue version property is not available, because it does not apply to Python shell jobs.
• Instead of Worker type and Number of workers, a Data processing units property is shown instead.
This job property determines how many data processing units (DPUs) are consumed by the Python
shell when running the job.
• The Job bookmark property is not available, because it is not supported for Python shell jobs.
• Under Advanced properties, the following properties are not available for Python shell jobs.
• Job metrics
• Continuous logging
78
Adding nodes to the job diagram
• Spark UI and Spark UI logs path

• Dependent jars path, under the heading Libraries
Adding nodes to the job diagram

You can add additional data sources, transforms, and data targets to your job to support more complex
ETL actions.
To add nodes to a job diagram
1. Go to the visual editor for a new or saved job and choose the Visual tab.
2. Use the toolbar buttons to add a node of a specific type: Source, Transform, or Target.
3. Edit the node, as described in the following sections:
• For a source node, see Editing the data source node (p. 50).
• For a transform node, see Editing the data transform node (p. 55).
• For a data target node, see Editing the data target node (p. 74).
4. If you're inserting a node in between two nodes in the job diagram, then perform the following
actions:
a. Choose the node that will be the parent for the new node.
b. Choose one of the toolbar buttons to add a new node to the job diagram. The new node is
added as a child of the currently selected node.
c. Choose the node that will be the child of the newly added node and change its parent node to
point to the newly added node.
If you added a node by mistake, you can use the Undo button on the toolbar to reverse the action.
Changing the parent nodes for a node in the job

diagram
You can change a node's parents to move nodes within the job diagram or to change a data source for a
node.
To change the parent node
1. Choose the node in the job diagram that you want to modify.
2. In the node details panel, on the Node properties tab, under the heading Node parents remove the
current parent for the node.
3. Choose a new parent node from the list.
4. Modify the other properties of the node as needed to match the newly selected parent node.
79
Deleting nodes from the job diagram
If you modified a node by mistake, you can use the Undo button on the toolbar to reverse the action.
Deleting nodes from the job diagram

You can remove nodes from the job diagram.
To remove a node
1. Go to the visual editor for a new or saved job and choose the Visual tab.
2. Choose the node you want to remove.
3. In the toolbar at the top of the visual editing pane, choose the Remove button.
4. If the node you removed had children nodes, modify the parents for those nodes as needed.
If you removed a node by mistake, you can use the Undo button on the toolbar to reverse the action.
80
Overview of using connectors and connections
Using connectors and connections

with AWS Glue Studio
AWS Glue provides built-in support for the most commonly used data stores (such as Amazon Redshift,
Amazon Aurora, Microsoft SQL Server, MySQL, MongoDB, and PostgreSQL) using JDBC connections. AWS
Glue also allows you to use custom JDBC drivers in your extract, transform, and load (ETL) jobs. For data
stores that are not natively supported, such as SaaS applications, you can use connectors.
A connector is an optional code package that assists with accessing data stores in AWS Glue Studio. You
can subscribe to several connectors offered in AWS Marketplace.
When creating ETL jobs, you can use a natively supported data store, a connector from AWS Marketplace,
or your own custom connectors. If you use a connector, you must first create a connection for the
connector. A connection contains the properties that are required to connect to a particular data
store. You use the connection with your data sources and data targets in the ETL job. Connectors and
connections work together to facilitate access to the data stores.
Topics
• Overview of using connectors and connections (p. 81)
• Adding connectors to AWS Glue Studio (p. 82)
• Creating connections for connectors (p. 85)
• Authoring jobs with custom connectors (p. 85)
• Managing connectors and connections (p. 90)
• Developing custom connectors (p. 92)
• Restrictions for using connectors and connections in AWS Glue Studio (p. 94)
Overview of using connectors and connections

A connection contains the properties that are required to connect to a particular data store. When you
create a connection, it is stored in the AWS Glue Data Catalog. You choose a connector, and then create a
connection based on that connector.
You can subscribe to connectors for non-natively supported data stores in AWS Marketplace, and then
use those connectors when you're creating connections. Developers can also create their own connectors,
and you can use them when creating connections.
Note
Connections created using the AWS Glue console do not appear in AWS Glue Studio.
Connections created using custom or AWS Marketplace connectors in AWS Glue Studio appear in
the AWS Glue console with type set to UNKNOWN.
The following steps describe the overall process of using connectors in AWS Glue Studio:
1. Subscribe to a connector in AWS Marketplace, or develop your own connector and upload it to AWS
Glue Studio. For more information, see Adding connectors to AWS Glue Studio (p. 82).
2. Review the connector usage information. You can find this information on the Usage tab on the
connector product page. For example, if you click the Usage tab on this product page, AWS Glue
81
Adding connectors to AWS Glue Studio
Connector for Google BigQuery, you can see in the Additional Resources section a link to a blog
about using this connector. Other connectors might contain links to the instructions in the Overview
section, as shown on the connector product page for Cloudwatch Logs connector for AWS Glue.
3. Create a connection. You choose which connector to use and provide additional information for the
connection, such as login credentials, URI strings, and virtual private cloud (VPC) information. For
more information, see Creating connections for connectors (p. 85).
4. Create an IAM role for your job. The job assumes the permissions of the IAM role that you specify
when you create it. This IAM role must have the necessary permissions to authenticate with, extract
data from, and write data to your data stores. For more information, see Review IAM permissions
needed for ETL jobs (p. 34) and Permissions required for using connectors (p. 35).
5. Create an ETL job and configure the data source properties for your ETL job. Provide the connection
options and authentication information as instructed by the custom connector provider. For more
information, see Authoring jobs with custom connectors (p. 85).
6. Customize your ETL job by adding transforms or additional data stores, as described in Editing ETL
jobs in AWS Glue Studio (p. 47).
7. If using a connector for the data target, configure the data target properties for your ETL job.
Provide the connection options and authentication information as instructed by the custom
connector provider. For more information, see the section called “Authoring jobs with custom
connectors” (p. 85).
8. Customize the job run environment by configuring job properties, as described in Modify the job
properties (p. 109).
9. Run the job.
Adding connectors to AWS Glue Studio

A connector is a piece of code that facilitates communication between your data store and AWS Glue.
You can either subscribe to a connector offered in AWS Marketplace, or you can create your own custom
connector.
Topics
• Subscribing to AWS Marketplace connectors (p. 82)
• Creating custom connectors (p. 83)
Subscribing to AWS Marketplace connectors

AWS Glue Studio makes it easy to add connectors from AWS Marketplace.
To add a connector from AWS Marketplace to AWS Glue Studio
1. In the AWS Glue Studio console, choose Connectors in the console navigation pane.
2. On the Connectors page, choose Go to AWS Marketplace.
3. In AWS Marketplace, in Featured products, choose the connector you want to use. You can choose
one of the featured connectors, or use search. You can search on the name or type of connector, and
you can use options to refine the search results.
If you want to use one of the featured connectors, choose View product. If you used search to locate
a connector, then choose the name of the connector.
4. On the product page for the connector, use the tabs to view information about the connector. If you
decide to purchase this connector, choose Continue to Subscribe.
5. Provide the payment information, and then choose Continue to Configure.
82
Creating custom connectors
6. On the Configure this software page, choose the method of deployment and the version of the
connector to use. Then choose Continue to Launch.
7. On the Launch this software page, you can review the Usage Instructions provided by the
connector provider. When you're ready to continue, choose Activate connection in AWS Glue
Studio.
After a small amount of time, the console displays the Create marketplace connection page in AWS
Glue Studio.
8. Create a connection that uses this connector, as described in Creating connections for
connectors (p. 85).
Alternatively, you can choose Activate connector only to skip creating a connection at this time. You
must create a connection at a later date before you can use the connector.

You can also build your own connector and then upload the connector code to AWS Glue Studio.
Custom connectors are integrated into AWS Glue Studio through the AWS Glue Spark runtime API. The
AWS Glue Spark runtime allows you to plug in any connector that is compliant with the Spark, Athena,
or JDBC interface. It allows you to pass in any connection option that is available with the custom
connector.
You can encapsulate all your connection properties with AWS Glue Connections and supply the
connection name to your ETL job. Integration with Data Catalog connections allows you to use the same
connection properties across multiple calls in a single Spark application or across different applications.
You can specify additional options for the connection. The job script that AWS Glue Studio generates
contains a Datasource entry that uses the connection to plug in your connector with the specified
connection options. For example:
Datasource = glueContext.create_dynamic_frame.from_options(connection_type =
"custom.jdbc", connection_options = {"dbTable":"Account","connectionName":"my-custom-jdbc-
connection"}, transformation_ctx = "DataSource0")
To add a custom connector to AWS Glue Studio
1. Create the code for your custom connector. For more information, see Developing custom
connectors (p. 92).
2. Add support for AWS Glue features to your connector. Here are some examples of these features and
how they are used within the job script generated by AWS Glue Studio:
• Data type mapping – Your connector can typecast the columns while reading them from
the underlying data store. For example, a dataTypeMapping of {"INTEGER":"STRING"}
converts all columns of type Integer to columns of type String when parsing the records and
constructing the DynamicFrame. This helps users to cast columns to types of their choice.
DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type
= "custom.jdbc", connection_options = {"dataTypeMapping":{"INTEGER":"STRING"}",
connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
• Partitioning for parallel reads – AWS Glue allows parallel data reads from the data store by
partitioning the data on a column. You must specify the partition column, the lower partition
bound, the upper partition bound, and the number of partitions. This feature enables you to make
use of data parallelism and multiple Spark executors allocated for the Spark application.
83
DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type
= "custom.jdbc", connection_options = {"upperBound":"200","numPartitions":"4",
"partitionColumn":"id","lowerBound":"0","connectionName":"test-connection-jdbc"},
transformation_ctx = "DataSource0")
• Use AWS Secrets Manager for storing credentials –The Data Catalog connection can also
contain a secretId for a secret stored in AWS Secrets Manager. The AWS secret can securely
store authentication and credentials information and provide it to your ETL job at runtime.
Alternatively, you can specify the secretId from the Spark script as follows:
DataSource = glueContext.create_dynamic_frame.from_options(connection_type
= "custom.jdbc", connection_options = {"connectionName":"test-connection-jdbc",
"secretId"-> "my-secret-id"}, transformation_ctx = "DataSource0")
• Filtering the source data with row predicates and column projections – The AWS Glue Spark
runtime also allows users to push down SQL queries to filter data at the source with row
predicates and column projections. This allows your ETL job to load filtered data faster from data
stores that support push-downs. An example SQL query pushed down to a JDBC data source is:
SELECT id, name, department FROM department WHERE id < 200.
DataSource = glueContext.create_dynamic_frame.from_options(connection_type =
"custom.jdbc", connection_options = {"query":"SELECT id, name, department FROM
department
WHERE id < 200","connectionName":"test-connection-jdbc"}, transformation_ctx =
"DataSource0")
• Job bookmarks – AWS Glue supports incremental loading of data from JDBC sources. AWS Glue
keeps track of the last processed record from the data store, and processes new data records
in the subsequent ETL job runs. Job bookmarks use the primary key as the default column for
the bookmark key, provided that this column increases or decreases sequentially. For more
information about job bookmarks, see Job Bookmarks in the AWS Glue Developer Guide.
DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type =
"custom.jdbc", connection_options = {"jobBookmarkKeys":["empno"],
"jobBookmarkKeysSortOrder"
:"asc", "connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
3. Package the custom connector as a JAR file and upload the file to Amazon S3.
4. Test your custom connector. For more information, see the instructions on GitHub at Glue Custom
Connectors: Local Validation Tests Guide.
6. On the Connectors page, choose Create custom connector.
7. On the Create custom connector page, enter the following information:
• The path to the location of the custom code JAR file in Amazon S3.
• A name for the connector that will be used by AWS Glue Studio.
• Your connector type, which can be one of JDBC, Spark, or Athena.
• The name of the entry point within your custom code that AWS Glue Studio calls to use the
connector.
• For JDBC connectors, this field should be the class name of your JDBC driver.
• For Spark connectors, this field should be the fully qualified data source class name, or its alias,
that you use when loading the Spark data source with the format operator.
• (JDBC only) The base URL used by the JDBC connection for the data store.
• (Optional) A description of the custom connector.
8. Choose Create connector.
84
Creating connections for connectors
9. From the Connectors page, create a connection that uses this connector, as described in Creating
connections for connectors (p. 85).
Creating connections for connectors

An AWS Glue connection is a Data Catalog object that stores connection information for a particular data
store. Connections store login credentials, URI strings, virtual private cloud (VPC) information, and more.
Creating connections in the Data Catalog saves the effort of having to specify all connection details
every time you create a job.
Note
To create a connection for a connector
2. Choose the connector you want to create a connection for, and then choose Create connection.
3. On the Create connection page, enter a name for your connection, and optionally a description.
4. Enter the connection details. Depending on the type of connector you selected, you're prompted to
enter additional information:
• Enter the requested authentication information, such as a user name and password, or choose an
AWS secret.
• For connectors that use JDBC, enter the information required to create the JDBC URL for the data
store.
• If you use a virtual private cloud (VPC), then enter the network information for your VPC.
5. Choose Create connection.
You are returned to the Connectors page, and the informational banner indicates the connection
that was created. You can now use the connection in your AWS Glue Studio jobs, as described in the
section called “Create jobs that use a connector” (p. 45).
Authoring jobs with custom connectors

You can use connectors and connections for both data source nodes and data target nodes in AWS Glue
Studio.
Topics
• Create jobs that use a connector for the data source (p. 85)
• Configure source properties for nodes that use connectors (p. 86)
• Configure target properties for nodes that use connectors (p. 89)
Create jobs that use a connector for the data source

When you create a new job, you can choose a connector for the data source and data targets.
To create a job that uses connectors for the data source or data target
85
Configure source properties for nodes that use connectors
2. On the Connectors page, in the Your connections resource list, choose the connection you want to
use in your job, and then choose Create job.
Alternatively, on the AWS Glue Studio Jobs page, under Create job, choose Source and target
added to the graph. In the Source drop-down list, choose the custom connector that you want to
use in your job. You can also choose a connector for Target.
3. Choose Create to open the visual job editor.

4. Configure the data source node, as described in Configure source properties for nodes that use
connectors (p. 86).
5. Continue creating your ETL job by adding transforms, additional data stores, and data targets, as
described in Editing ETL jobs in AWS Glue Studio (p. 47).
6. Customize the job run environment by configuring job properties as described in Modify the job
properties (p. 109).
7. Save and run the job.
Configure source properties for nodes that use

connectors
After you create a job that uses a connector for the data source, the visual job editor displays a job graph
with a data source node configured for the connector. You must configure the data source properties for
that node.
To configure the properties for a data source node that uses a connector
1. Choose the connector data source node in the job graph or add a new node and choose the
connector for the Node type. Then, on the right-side, in the node details panel, choose the Data
source properties tab, if it's not already selected.
86
2. In the Data source properties tab, choose the connection that you want to use for this job.
Enter the additional information required for each connection type:
JDBC
• Data source input type: Choose to provide either a table name or a SQL query as the
data source. Depending on your choice, you then need to provide the following additional
information:
• Table name: The name of the table in the data source. If the data source does not use the
term table, then supply the name of an appropriate data structure, as indicated by the
custom connector usage information (which is available in AWS Marketplace).
• Filter predicate: A condition clause to use when reading the data source, similar to a WHERE
clause, which is used to retrieve a subset of the data.
• Query code: Enter a SQL query to use to retrieve a specific dataset from the data source. An
example of a basic SQL query is:
SELECT column_list FROM

table_name WHERE where_clause
• Schema: Because AWS Glue Studio is using information stored in the connection to access the
data source instead of retrieving metadata information from a Data Catalog table, you must
provide the schema metadata for the data source. Choose Add schema to open the schema
editor.
For instructions on how to use the schema editor, see Editing the schema in a custom
transform node (p. 70).
• Partition column: (Optional) You can choose to partition the data reads by providing values
for Partition column, Lower bound, Upper bound, and Number of partitions.
The lowerBound and upperBound values are used to decide the partition stride, not for
filtering the rows in table. All rows in the table are partitioned and returned.
87
Note
Column partitioning adds an extra partitioning condition to the query used to read
the data. When using a query instead of a table name, you should validate that the
query works with the specified partitioning condition. For example:
• If your query format is "SELECT col1 FROM table1", then test the query by
appending a WHERE clause at the end of the query that uses the partition column.
• If your query format is "SELECT col1 FROM table1 WHERE col2=val", then
test the query by extending the WHERE clause with AND and an expression that uses
the partition column.
• Data type casting: If the data source uses data types that are not available in JDBC, use this
section to specify how a data type from the data source should be converted into JDBC data
types. You can specify up to 50 different data type conversions. All columns in the data source
that use the same data type are converted in the same way.
For example, if you have three columns in the data source that use the Float data type, and
you indicate that the Float data type should be converted to the JDBC String data type,
then all three columns that use the Float data type are converted to String data types.
• Job bookmark keys: Job bookmarks help AWS Glue maintain state information and prevent
the reprocessing of old data. Specify one more one or more columns as bookmark keys.
AWS Glue Studio uses bookmark keys to track data that has already been processed during a
previous run of the ETL job. Any columns you use for custom bookmark keys must be strictly
monotonically increasing or decreasing, but gaps are permitted.
If you enter multiple bookmark keys, they're combined to form a single compound key. A
compound job bookmark key should not contain duplicate columns. If you don't specify
bookmark keys, AWS Glue Studio by default uses the primary key as the bookmark key,
provided that the primary key is sequentially increasing or decreasing (with no gaps). If the
table doesn't have a primary key, but the job bookmark property is enabled, you must provide
custom job bookmark keys. Otherwise, the search for primary keys to use as the default will
fail and the job run will fail.
• Job bookmark keys sorting order: Choose whether the key values are sequentially increasing
or decreasing.
Spark
editor.
• Connection options: Enter additional key-value pairs as needed to provide additional
connection information or options. For example, you might enter a database name, table
name, a user name, and password.
For example, for OpenSearch, you enter the following key-value pairs, as described in Tutorial:
Using the open-source Elasticsearch Spark Connector (p. 95):
• es.net.http.auth.user : username
• es.net.http.auth.pass : password
• es.nodes : https://<Elasticsearch endpoint>
• es.port : 443
• path: <Elasticsearch resource>
• es.nodes.wan.only : true
88
Configure target properties for nodes that use connectors
For an example of the minimum connection options to use, see the sample test script
MinimalSparkConnectorTest.scala on GitHub, which shows the connection options you would
normally provide in a connection.
Athena
• Table name: The name of the table in the data source. If you're using a connector for reading
from Athena-CloudWatch logs, you would enter the table name all_log_streams.
• Athena schema name: Choose the schema in your Athena data source that corresponds to
the database that contains the table. If you're using a connector for reading from Athena-
CloudWatch logs, you would enter a schema name similar to /aws/glue/name.
editor.
• Additional connection options: Enter additional key-value pairs as needed to provide
additional connection information or options.
For an example, see the README.md file at https://github.com/aws-samples/aws-glue-

samples/tree/master/GlueCustomConnectors/development/Athena. In the steps in this
document, the sample code shows the minimal required connection options, which are
tableName, schemaName, and className. The code example specifies these options as part
of the optionsMap variable, but you can specify them for your connection and then use the
connection.
3. (Optional) After providing the required information, you can view the resulting data schema for your
data source by choosing the Output schema tab in the node details panel. The schema displayed on
this tab is used by any child nodes that you add to the job graph.
an IAM role.
Configure target properties for nodes that use

connectors
If you use a connector for the data target type, you must configure the properties of the data target
node.
To configure the properties for a data target node that uses a connector
1. Choose the connector data target node in the job graph. Then, on the right-side, in the node details
panel, choose the Data target properties tab, if it's not already selected.
2. In the Data target properties tab, choose the connection to use for writing to the target.
Enter the additional information required for each connection type:
89
Managing connectors and connections
JDBC
• Connection: Choose the connection to use with your connector. For information about how to
create a connection, see Creating connections for connectors (p. 85).
• Table name: The name of the table in the data target. If the data target does not use the
term table, then supply the name of an appropriate data structure, as indicated by the custom
connector usage information (which is available in AWS Marketplace).
• Batch size (Optional): Enter the number of rows or records to insert in the target table in a
single operation. The default value is 1000 rows.
Spark
• Connection: Choose the connection to use with your connector. If you did not create a
connection previously, choose Create connection to create one. For information about how to
create a connection, see Creating connections for connectors (p. 85).
• Connection options: Enter additional key-value pairs as needed to provide additional
connection information or options. You might enter a database name, table name, a user
name, and password.
For example, for OpenSearch, you enter the following key-value pairs, as described in Tutorial:
Using the open-source Elasticsearch Spark Connector (p. 95):
• es.net.http.auth.user : username
• es.net.http.auth.pass : password
• es.nodes : https://<Elasticsearch endpoint>
• es.port : 443
• path: <Elasticsearch resource>
• es.nodes.wan.only : true
For an example of the minimum connection options to use, see the sample test script
MinimalSparkConnectorTest.scala on GitHub, which shows the connection options you would
normally provide in a connection.
3. After providing the required information, you can view the resulting data schema for your data
source by choosing the Output schema tab in the node details panel.
Managing connectors and connections

You use the Connectors page in AWS Glue Studio to manage your connectors and connections.
Topics
• Viewing connector and connection details (p. 90)
• Editing connectors and connections (p. 91)
• Deleting connectors and connections (p. 91)
• Cancel a subscription for a connector (p. 92)
Viewing connector and connection details

You can view summary information about your connectors and connections in the Your connectors and
Your connections resource tables on the Connectors page. To view detailed information, perform the
following steps.
90
Editing connectors and connections
Note
To view connector or connection details
2. Choose the connector or connection that you want to view detailed information for.
3. Choose Actions, and then choose View details to open the detail page for that connector or
connection.
4. On the detail page, you can choose to Edit or Deleteguilabel> the connector or connection.
• For connectors, you can choose Create connection to create a new connection that uses the
connector.
• For connections, you can choose Create job to create a job that uses the connection.
Editing connectors and connections

You use the Connectors page to change the information stored in your connectors and connections.
To modify a connector or connection
2. Choose the connector or connection that you want to change.
3. Choose Actions, and then choose Edit.
You can also choose View details and on the connector or connection detail page, you can choose
Edit.
4. On the Edit connector or Edit connection page, update the information, and then choose Save.
Deleting connectors and connections

You use the Connectors page to delete connectors and connections. If you delete a connector, then any
connections that were created for that connector should also be deleted.
To remove connectors from AWS Glue Studio
2. Choose the connector or connection you want to delete.
3. Choose Actions, and then choose Delete.
You can also choose View details, and on the connector or connection detail page, you can choose
Delete.
4. Verify that you want to remove the connector or connection by entering Delete, and then choose
Delete.
When deleting a connector, any connections that were created for that connector are also deleted.
Any jobs that use a deleted connection will no longer work. You can either edit the jobs to use a different
data store, or remove the jobs. For information about how to delete a job, see Delete jobs (p. 113).
If you delete a connector, this doesn't cancel the subscription for the connector in AWS Marketplace.
To remove a subscription for a deleted connector, follow the instructions in Cancel a subscription for a
connector (p. 92) .
91
Cancel a subscription for a connector
Cancel a subscription for a connector

After you delete the connections and connector from AWS Glue Studio, you can cancel your subscription
in AWS Marketplace if you no longer need the connector.
Note
If you cancel your subscription to a connector, this does not remove the connector or connection
from your account. Any jobs that use the connector and related connections will no longer be
able to use the connector and will fail.
Before you unsubscribe or re-subscribe to a connector from AWS Marketplace, you should delete
existing connections and connectors associated with that AWS Marketplace product.
To unsubscribe from a connector in AWS Marketplace
1. Sign in to the AWS Marketplace console at https://console.aws.amazon.com/marketplace.

2. Choose Manage subscriptions.
3. On the Manage subscriptions page, choose Manage next to the connector subscription that you
want to cancel.
4. Choose Actions and then choose Cancel subscription.
5. Select the check box to acknowledge that running instances are charged to your account, and then
choose Yes, cancel subscription.
Developing custom connectors

You can write the code that reads data from or writes data to your data store and formats the data for
use with AWS Glue Studio jobs. You can create connectors for Spark, Athena, and JDBC data stores.
Sample code posted on GitHub provides an overview of the basic interfaces you need to implement.
You will need a local development environment for creating your connector code. You can use any IDE
or even just a command line editor to write your connector. Examples of development environments
include:
• A local Scala environment with a local AWS Glue ETL Maven library, as described in Developing Locally
with Scala in the AWS Glue Developer Guide.
• IntelliJ IDE, by downloading the IDE from https://www.jetbrains.com/idea/.
Topics
• Developing Spark connectors (p. 92)
• Developing Athena connectors (p. 93)
• Developing JDBC connectors (p. 93)
• Examples of using custom connectors with AWS Glue Studio (p. 93)
• Developing AWS Glue connectors for AWS Marketplace (p. 94)
Developing Spark connectors

You can create a Spark connector with Spark DataSource API V2 (Spark 2.4) to read data.
To create a custom Spark connector
92
Developing Athena connectors
Follow the steps in the AWS Glue GitHub sample library for developing Spark connectors, which is
located at https://github.com/aws-samples/aws-glue-samples/tree/master/GlueCustomConnectors/
development/Spark/README.md.
Developing Athena connectors

You can create an Athena connector to be used by AWS Glue and AWS Glue Studio to query a custom
data source.
To create a custom Athena connector
Follow the steps in the AWS Glue GitHub sample library for developing Athena connectors, which is
located at https://github.com/aws-samples/aws-glue-samples/tree/master/GlueCustomConnectors/
development/Athena.
Developing JDBC connectors

You can create a connector that uses JDBC to access your data stores.
To create a custom JDBC connector
1. Install the AWS Glue Spark runtime libraries in your local development environment. Refer to the
instructions in the AWS Glue GitHub sample library at https://github.com/aws-samples/aws-glue-
samples/tree/master/GlueCustomConnectors/development/GlueSparkRuntime/README.md.
2. Implement the JDBC driver that is responsible for retrieving the data from the data source. Refer to
the Java Documentation for Java SE 8.
Create an entry point within your code that AWS Glue Studio uses to locate your connector. The
Class name field should be the full path of your JDBC driver.
3. Use the GlueContext API to read data with the connector. Users can add more input options in the
AWS Glue Studio console to configure the connection to the data source, if necessary. For a code
example that shows how to read from and write to a JDBC database with a custom JDBC connector,
see Custom and AWS Marketplace connectionType values.
Examples of using custom connectors with AWS Glue

Studio
You can refer to the following blogs for examples of using custom connectors:
• Developing, testing, and deploying custom connectors for your data stores with AWS Glue
• Apache Hudi: Writing to Apache Hudi tables using AWS Glue Custom Connector
• Google BigQuery: Migrating data from Google BigQuery to Amazon S3 using AWS Glue custom
connectors
• Snowflake (JDBC): Performing data transformations using Snowflake and AWS Glue
• SingleStore: Building fast ETL using SingleStore and AWS Glue
• Salesforce: Ingest Salesforce data into Amazon S3 using the CData JDBC custom connector with AWS
Glue -
• MongoDB: Building AWS Glue Spark ETL jobs using Amazon DocumentDB (with MongoDB
compatibility) and MongoDB
• Amazon Relational Database Service (Amazon RDS): Building AWS Glue Spark ETL jobs by bringing
your own JDBC drivers for Amazon RDS
93
Developing AWS Glue connectors for AWS Marketplace
• MySQL (JDBC): https://github.com/aws-samples/aws-glue-samples/blob/master/

GlueCustomConnectors/development/Spark/SparkConnectorMySQL.scala
Developing AWS Glue connectors for AWS

Marketplace
As an AWS partner, you can create custom connectors and upload them to AWS Marketplace to sell to
AWS Glue customers.
The process for developing the connector code is the same as for custom connectors, but the process
of uploading and verifying the connector code is more detailed. Refer to the instructions in Creating
Connectors for AWS Marketplace on the GitHub website.
Restrictions for using connectors and connections

in AWS Glue Studio
When you're using custom connectors or connectors from AWS Marketplace, take note of the following
restrictions:
• The testConnection API isn't supported with connections created for custom connectors.
• Data Catalog connection password encryption isn't supported with custom connectors.
• You can't use job bookmarks if you specify a filter predicate for a data source node that uses a JDBC
connector.
94
Prerequisites
Tutorial: Using the open-source

Elasticsearch Spark Connector
Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics,
real-time application monitoring, and clickstream analysis. You can use OpenSearch as a data store for
your extract, transform, and load (ETL) jobs by configuring the Elasticsearch Spark Connector in AWS
Glue Studio. This connector is available for free from AWS Marketplace.
In this tutorial, we will show how to connect to your Amazon OpenSearch Service nodes with a minimal
number of steps.
Topics
• Step 1: (Optional) Create an AWS secret for your OpenSearch cluster information (p. 95)
• Step 2: Subscribe to the connector (p. 96)
• Step 3: Activate the connector in AWS Glue Studio and create a connection (p. 97)
• Step 4: Configure an IAM role for your ETL job (p. 97)
• Step 5: Create a job that uses the OpenSearch connection (p. 98)
• Step 6: Run the job (p. 100)
Prerequisites
To use this tutorial, you must have the following:
• Access to AWS Glue Studio

• Access to an OpenSearch cluster in the AWS Cloud
• Configured access to the Amazon VPC that contains your data store, as described in Configure a VPC
for your ETL job (p. 37).
• Configured permissions according to Review IAM permissions needed for ETL jobs (p. 34)
• (Optional) Access to AWS Secrets Manager.
Step 1: (Optional) Create an AWS secret for your

OpenSearch cluster information
To safely store and use your connection credential, save your credential in AWS Secrets Manager. The
secret you create will be used later in the tutorial by the connection. The credential key-value pairs will
be fed into the Elasticsearch Spark Connector as normal connection options.
For more information about creating secrets, see Creating and Managing Secrets with AWS Secrets
Manager in the AWS Secrets Manager User Guide.
95
Next step
To create an AWS secret
1. Sign in to the AWS Secrets Manager console.

2. On either the service introduction page or the Secrets list page, choose Store a new secret.
3. On the Store a new secret page, choose Other type of secret. This option means that you must
supply the structure and details of your secret.
4. Add a Key and Value pair for the OpenSearch cluster user name. For example:
es.net.http.auth.user: username
5. Choose + Add row, and enter another key-value pair for the password. For example:
es.net.http.auth.pass: password
6. Choose Next.
7. Enter a secret name. For example: my-es-secret. You can optionally include a description.
Record the secret name, which is used later in this tutorial, and then choose Next.
8. Choose Next again, and then choose Store to create the secret.
Next step
Step 2: Subscribe to the connector (p. 96)
Step 2: Subscribe to the connector

The Elasticsearch Spark Connector is available for free from AWS Marketplace.
To subscribe to the Elasticsearch Spark Connector on AWS Marketplace
1. If you have not already configured your AWS account to use License Manager, do the following:
a. Open the AWS License Manager console at https://console.aws.amazon.com/license-manager.

b. Choose Create customer managed license.
c. In the IAM permissions (one-time setup) window, choose I grant AWS License Manager the
required permissions, and then choose Grant permissions.
If you do not see this window, then you have already configured the necessary permissions.
2. Open the AWS Glue Studio console at https://console.aws.amazon.com/gluestudio/.
3. In the AWS Glue Studio console, expand the menu icon ( ), and then choose Connectors in the
navigation pane.
4. On the Connectors page, choose Go to AWS Marketplace.
5. In AWS Marketplace, in the Search AWS Glue Studio products section, enter elasticsearch
connector in the search field, and then press Enter.
6. Choose the name of the connector, ElasticSearch Spark connector for AWS Glue .
7. On the product page for the connector, use the tabs to view information about the connector. When
you're ready to continue, choose Continue to Subscribe.
8. Review and accept the terms of use for the software.
9. When the subscription process completes, choose Continue to Configuration.
10. Keep the default choices on the Configure this software page, and choose Continue to Launch.
96
Next step
Next step
Step 3: Activate the connector in AWS Glue Studio and create a connection (p. 97)
Step 3: Activate the connector in AWS Glue Studio

and create a connection
After you choose Continue to Launch, you see the Launch this software page in AWS Marketplace. After
you use the link to activate the connector in AWS Glue Studio, you create a connection.
To deploy the connector and create a connection in AWS Glue Studio
1. On the Launch this software page in the AWS Marketplace console, choose Usage Instructions, and
then choose the link in the window that appears.
Your browser is redirected to the AWS Glue Studio console Create marketplace connection page.
2. Enter a name for the connection. For example: my-es-connection.
3. In the Connection access section, for Connection credential type, choose User name and
password.
4. For the AWS secret, enter the name of your secret. For example: my-es-secret.
5. In the Network options section, enter the VPC information to connect to Elastic Search cluster.
6. Choose Create connection and activate connector.
Next step
Step 4: Configure an IAM role for your ETL job (p. 97)
Step 4: Configure an IAM role for your ETL job

When you create the AWS Glue ETL job, you specify an AWS Identity and Access Management (IAM) role
for the job to use. The role must grant access to all resources used by the job, including Amazon S3 (for
any sources, targets, scripts, driver files, and temporary directories), and also AWS Glue Data Catalog
objects.
The assumed IAM role for the AWS Glue ETL job must also have access to the secret that was created in
the previous section. By default, the AWS managed role AWSGlueServiceRole does not have access
to the secret. To set up access control for your secrets, see Authentication and Access Control for AWS
Secrets Manager and Limiting Access to Specific Secrets.
To configure an IAM role for your ETL job
1. Configure the permissions described in the section called “Review IAM permissions needed for ETL
jobs” (p. 34).
2. Configure the additional permissions needed when using connectors with AWS Glue Studio, as
described in the section called “Permissions required for using connectors” (p. 35).
Next step
Step 5: Create a job that uses the OpenSearch connection (p. 98)
97
Step 5: Create a job that uses the OpenSearch connection
Step 5: Create a job that uses the OpenSearch

connection
After creating a role for your ETL job, you can create a job in AWS Glue Studio that uses the connection
and connector for Open Spark ElasticSearch.
If your job runs within a Amazon Virtual Private Cloud (Amazon VPC), make sure the VPC is configured
correctly. For more information, see the section called “Configure a VPC for your ETL job” (p. 37).
To create a job that uses the Elasticsearch Spark Connector
1. In AWS Glue Studio, choose Connectors.

2. In the Your connections list, select the connection you just created and choose Create job.
3. In the visual job editor, choose the Data source node. On the right, on the Data source properties -
Connector tab, configure additional information for the connector.
a. Choose Add schema and enter the schema of the data set in the data source. Connections do
not use tables stored in the Data Catalog, which means that AWS Glue Studio doesn't know the
schema of the data. You must manually provide this schema information. For instructions on
how to use the schema editor, see the section called “Editing the schema in a custom transform
node” (p. 70).
b. Expand Connection options.
c. Choose Add new option and enter the information needed for the connector that was not
entered in the AWS secret:
• es.nodes : https://<ElasticSearch endpoint>

• es.port : 443
• path : test
• es.nodes.wan.only. : true
For an explanation of these connection options, refer to: https://www.elastic.co/guide/en/

elasticsearch/hadoop/current/configuration.html.
98
Step 5: Create a job that uses the OpenSearch connection
4. Add a target node to the graph as described in the section called “Adding nodes to the job
diagram” (p. 79) and the section called “Editing the data target node” (p. 74).
Your data target can be Amazon S3, or it can use information from an AWS Glue Data Catalog or
a connector to write data in a different location. For example, you can use a Data Catalog table to
write to a database in Amazon RDS, or you can use a connector as your data target to write to data
stores that are not natively supported in AWS Glue.
If you choose a connector for your data target, you must choose a connection created for that
connector. Also, if required by the connector provider, you must add options to provide additional
information to the connector. If you use a connection that contains information for an AWS secret,
then you don’t need to provide the user name and password authentication in the connection
options.
5. Optionally, add additional data sources and one or more transform nodes as described in the section
called “Editing the data transform node” (p. 55).
6. Configure the job properties as described in the section called “Modify the job properties” (p. 109),
starting with step 3, and save the job.
99
Next step
Next step
Step 6: Run the job (p. 100)
Step 6: Run the job

After you save your job, you can run the job to perform the ETL operations.
To run the job you created for the Elasticsearch Spark Connector
1. Using the AWS Glue Studio console, on the visual editor page, choose Run.
2. In the success banner, choose Run Details, or you can choose the Runs tab of the visual editor to
view information about the job run.
100
Accessing the job monitoring dashboard
Monitoring ETL jobs in AWS Glue

Studio
Monitoring is an important part of maintaining the reliability, availability, and performance of ETL jobs
used in AWS Glue and AWS Glue Studio. You should collect monitoring data from all of the parts of your
AWS solution so that you can more easily debug a multipoint failure if one occurs.
Topics
• Accessing the job monitoring dashboard (p. 101)
• Overview of the job monitoring dashboard (p. 101)
• Job runs view (p. 101)
• Viewing the job run logs (p. 103)
• Viewing the details of a job run (p. 103)
• Viewing Amazon CloudWatch metrics for a job run (p. 104)
Accessing the job monitoring dashboard

You access the job monitoring dashboard by choosing the Monitoring link in the AWS Glue Studio
navigation pane.
Overview of the job monitoring dashboard

The job monitoring dashboard provides an overall summary of the job runs, with totals for the jobs with
a status of Running, Canceled, Success, or Failed. Additional tiles provide the overall job run success
rate, the estimated DPU usage for jobs, a breakdown of the job status counts by job type, worker type,
and by day.
The graphs in the tiles are interactive. You can choose any block in a graph to run a filter that displays
only those jobs in the Job runs table at the bottom of the page.
You can change the date range for the information displayed on this page by using the Date range
selector. When you change the date range, the information tiles adjust to show the values for the
specified number of days before the current date. You can also use a specific date range if you choose
Custom from the date range selector.
Job runs view

The Job runs resource list shows the jobs for the specified date range and filters.
You can filter the jobs on additional criteria, such as status, worker type, job type, and the job name.
In the filter box at the top of the table, you can enter the text to use as a filter. The table results are
updated with rows that contain matching text as you enter the text.
You can view a subset of the jobs by choosing elements from the graphs on the job monitoring
dashboard. For example, if you choose the number of running jobs in the Job runs summary tile, then
101
Job runs view
the Job runs list displays only the jobs that currently have a status of Running. If you choose one of the
bars in the Worker type breakdown bar chart, then only job runs with the matching worker type and
status are shown in the Job runs list.
The Job runs resource list displays the details for the job runs. You can sort the rows in the table by
choosing a column heading. The table contains the following information:
Property Description
Job name The name of the job
Type The type of job environment:
• Glue ETL: Runs in an Apache Spark environment

managed by AWS Glue.
• Glue Streaming: Runs in an Apache Spark
environment and performs ETL on data
streams.
• Python shell: Runs Python scripts as a shell
Start time The date and time at which this job run was
started.
End time The date and time that this job run completed.
Run status The current state of the job run. Values can be:
• STARTING
• RUNNING
• STOPPING
• STOPPED
• SUCCEEDED
• FAILED
• TIMEOUT
Run time The amount of time that the job run consumed
resources.
Capacity The number of AWS Glue data processing units

(DPUs) that were allocated for this job run. For
more information about capacity planning, see
Monitoring for DPU Capacity Planning in the AWS
Worker type The type of predefined worker that was allocated

when the job ran. Values can be Standard, G.1X,
or G.2X.
DPU hours The estimated number of DPUs used for the job
run. A DPU is a relative measure of processing
power. DPUs are used to determine the cost of
running your job. For more information, see the
AWS Glue pricing page.
You can choose any job run in the list and view additional information. Choose a job run, and then do
one of the following:
102
Viewing the job run logs
• Choose the Actions menu and the View job option to view the job in the visual editor.
• Choose the Actions menu and the Stop run option to stop the current run of the job.
• Choose the View CloudWatch logs button to view the job run logs for that job.
• Choose View run details to view the job run details page.
Viewing the job run logs

You can view the job logs in a variety of ways:
• On the Monitoring page, in the Job runs table, choose a job run, and then choose View CloudWatch
logs.
• In the visual job editor, on the Runs tab for a job, choose the hyperlinks to view the logs:
• Logs – Links to the Apache Spark job logs written when continuous logging is enabled for a job run.
When you choose this link, it takes you to the Amazon CloudWatch logs in the /aws-glue/jobs/
logs-v2 log group. By default, the logs exclude non-useful Apache Hadoop YARN heartbeat and
Apache Spark driver or executor log messages. For more information about continuous logging, see
Continuous Logging for AWS Glue Jobs in the AWS Glue Developer Guide.
• Error logs – Links to the logs written to stderr for this job run. When you choose this link, it takes
you to the Amazon CloudWatch logs in the /aws-glue/jobs/error log group. You can use these
logs to view details about any errors that were encountered during the job run.
• Output logs – Links to the logs written to stdout for this job run. When you choose this link, it
takes you to the Amazon CloudWatch logs in the /aws-glue/jobs/output log group. You can use
these logs to see all the details about the tables that were created in the AWS Glue Data Catalog and
any errors that were encountered.
Viewing the details of a job run

You can choose a job in the Job runs list on the Monitoring page, and then choose View run details to
see detailed information for that run of the job.
The information displayed on the job run detail page includes:
Job name The name of the job
Run Status The current state of the job run. Values can be:
• STARTING
• RUNNING
• STOPPING
• STOPPED
• SUCCEEDED
• FAILED
• TIMEOUT
Glue version The AWS Glue version used by the job run
Recent attempt The number of automatic retry attempts for this

job run
103
Viewing Amazon CloudWatch metrics for a job run
Start time The date and time at which this job run was
started
End time The date and time that this job run completed
Start-up time The amount of time spent preparing to run the

job
Execution time The amount of time spent running the job script
Trigger name The name of the trigger associated with the job
Last modified on The date when the job was last modified
Security configuration The security configuration for the job, which

includes Amazon S3 encryption, CloudWatch
encryption, and job bookmarks encryption
settings
Timeout The job run timeout threshold value
Allocated capacity The number of AWS Glue data processing units

(DPUs) that were allocated for this job run. For
more information about capacity planning, see
Monitoring for DPU Capacity Planning in the AWS
Max capacity The maximum capacity available to the job run.
Number of workers The number of workers used for the job run
Worker type The type of predefined workers allocated for the

job run. Values can be Standard, G.1X, or G.2X.
Logs A link to the job logs for continuous logging (/

aws-glue/jobs/logs-v2)
Output Logs A link to the job output log files (/aws-glue/

jobs/output)
Error logs A link to the job error log files (/aws-glue/

jobs/error)

On the details page for a job run, below the Run details section, you can view the job metrics. AWS Glue
Studio sends job metrics to Amazon CloudWatch for every job run.
AWS Glue reports metrics to Amazon CloudWatch every 30 seconds. The AWS Glue metrics represent
delta values from the previously reported values. Where appropriate, metrics dashboards aggregate
(sum) the 30-second values to obtain a value for the entire last minute. However, the Apache Spark
metrics that AWS Glue passes on to Amazon CloudWatch are generally absolute values that represent
the current state at the time they are reported.
Note
You must configure your account to access Amazon CloudWatch, as described in Amazon
CloudWatch permissions (p. 33).
104
The metrics provide information about your job run, such as:
• ETL Data Movement – The number of bytes read from or written to Amazon S3.
• Memory Profile: Heap used – The number of memory bytes used by the Java virtual machine (JVM)
heap.
• Memory Profile: heap usage – The fraction of memory (scale: 0–1), shown as a percentage, used by
the JVM heap.
• CPU Load – The fraction of CPU system load used (scale: 0–1), shown as a percentage.
105
Start a job run
Managing ETL jobs with AWS Glue

Studio
You can use the simple graphical interface in AWS Glue Studio to manage your ETL jobs. Using the
navigation menu, choose Jobs to view the Jobs page. On this page, you can see all the jobs that you have
created either with AWS Glue Studio or the AWS Glue console. You can view, manage, and run your jobs
on this page.
You can also perform the following tasks on this page:

• Start a job run (p. 106)
• Schedule job runs (p. 106)
• Manage job schedules (p. 107)
• Stop job runs (p. 108)
• View your jobs (p. 108)
• View information for recent job runs (p. 108)
• View the job script (p. 109)
• Modify the job properties (p. 109)
• Save the job (p. 111)
• Clone a job (p. 113)
• Delete jobs (p. 113)
Start a job run

In AWS Glue Studio, you can run your jobs on demand. A job can run multiple times, and each time you
run the job, AWS Glue collects information about the job activities and performance. This information is
referred to as a job run and is identified by a job run ID.
You can initiate a job run in the following ways in AWS Glue Studio:
• On the Jobs page, choose the job you want to start, and then choose the Run job button.
• If you're viewing a job in the visual editor and the job has been saved, you can choose the Run button
to start a job run.
For more information about job runs, see Working with Jobs on the AWS Glue Console in the AWS Glue
Developer Guide.
Schedule job runs

In AWS Glue Studio, you can create a schedule to have your jobs run at specific times. You can specify
constraints, such as the number of times that the jobs run, which days of the week they run, and at what
time. These constraints are based on cron and have the same limitations as cron. For example, if you
choose to run your job on day 31 of each month, keep in mind that some months don't have 31 days. For
more information about cron, see Cron Expressions in the AWS Glue Developer Guide.
To run jobs according to a schedule
1. Create a job schedule using one of the following methods:
106
Manage job schedules
• On the Jobs page, choose the job you want to create a schedule for, choose Actions, and then
choose Schedule job.
• If you're viewing a job in the visual editor and the job has been saved, choose the Schedules tab.
Then choose Create Schedule.
2. On the Schedule job run page, enter the following information:
• Name: Enter a name for your job schedule.

• Frequency: Enter the frequency for the job schedule. You can choose the following:
• Hourly: The job will run every hour, starting at a specific minute. You can specify the Minute
of the hour that the job should run. By default, when you choose hourly, the job runs at the
beginning of the hour (minute 0).
• Daily: The job will run every day, starting at a time. You can specify the Minute of the hour that
the job should run and the Start hour for the job. Hours are specified using a 23-hour clock,
where you use the numbers 13 to 23 for the afternoon hours. The default values for minute and
hour are 0, which means that if you select Daily, the job by default will run at midnight.
• Weekly: The job will run every week on one or more days. In addition to the same settings
described previous for Daily, you can choose the days of the week on which the job will run. You
can choose one or more days.
• Monthly: The job will run every month on a specific day. In addition to the same settings
described previous for Daily, you can choose the day of the month on which the job will run.
Specify the day as a numeric value from 1 to 31. If you select a day that does not exist in a
th
month, for example the 30 day of February, then the job does not run that month.
• Custom: Enter an expression for your job schedule using the cron syntax. Cron expressions
allow you to create more complicated schedules, such as the last day of the month (instead of a
th st
specific day of the month) or every third month on the 7 and 21 days of the month.
See Cron Expressions in the AWS Glue Developer Guide

• Description: You can optionally enter a description for your job schedule. If you plan to use the
same schedule for multiple jobs, having a description makes it easier to determine what the job
schedule does.
3. Choose Create schedule to save the job schedule.
4. After you create the schedule, a success message appears at the top of the console page. You can
choose Job details in this banner to view the job details. This opens the visual job editor page, with
the Schedules tab selected.
Manage job schedules

After you have created schedules for a job, you can open the job in the visual editor and choose the
Schedules tab to manage the schedules.
On the Schedules tab of the visual editor, you can perform the following tasks:
• Create a new schedule.
Choose Create schedule, then enter the information for your schedule as described in the section
called “Schedule job runs” (p. 106).
• Edit an existing schedule.
Choose the schedule you want to edit, then choose Action followed by Edit schedule. When you
choose to edit an existing schedule, the Frequency shows as Custom, and the schedule is displayed
as a cron expression. You can either modify the cron expression, or specify a new schedule using the
Frequency button. When you finish with your changes, choose Update schedule.
107
Stop job runs
• Pause an active schedule.
Choose an active schedule, and then choose Action followed by Pause schedule. The schedule is
instantly deactivated. Choose the refresh (reload) button to see the updated job schedule status.
• Resume a paused schedule.
Choose a deactivated schedule, and then choose Action followed by Resume schedule. The schedule is
instantly activated. Choose the refresh (reload) button to see the updated job schedule status.
• Delete a schedule.
Choose the schedule you want to remove, and then choose Action followed by Delete schedule. The
schedule is instantly deleted. Choose the refresh (reload) button to see the updated job schedule list.
The schedule will show a status of Deleting until it has been completely removed.
Stop job runs

You can stop a job before it has completed its job run. You might choose this option if you know that the
job isn't configured correctly, or if the job is taking too long to complete.
On the Monitoring page, in the Job runs list, choose the job that you want to stop, choose Actions, and
then choose Stop run.
View your jobs

You can view all your jobs on the Jobs page. You can access this page by choosing Jobs in the navigation
pane.
On the Jobs page, you can see all the jobs that were created in your account. The Your jobs list shows
the job name, its type, the status of the last run of that job, and the dates on which the job was created
and last modified. You can choose the name of a job to see detailed information for that job.
You can also use the Monitoring dashboard to view all your jobs. You can access the dashboard by
choosing Monitoring in the navigation pane. For more information about using the dashboard, see
Monitoring ETL jobs in AWS Glue Studio (p. 101).
Customize the job display

You can customize how the jobs are displayed in the Your jobs section of the Jobs page. Also, you can
enter text in the search text field to display only jobs with a name that contains that text.
If you choose the settings icon in the Your jobs section, you can customize how AWS Glue Studio
displays the information in the table. You can choose to wrap the lines of text in the display, change the
number of jobs displayed on the page, and specify which columns to display.
View information for recent job runs

A job can run multiple times as new data is added at the source location. Each time a job runs, the
job run is assigned a unique ID, and information about that job run is collected. You can view this
information by using the following methods:
• Choose the Runs tab of the visual editor to view the job run information for the currently displayed
job.
108
View the job script
On the Runs tab (the Recent job runs page), there is a card for each job run. The information displayed
on the Runs tab includes:
• Job run ID
• Number of attempts to run this job
• Status of the job run
• Start and end time for the job run
• The runtime for the job run
• A link to the job log files
• A link to the job error log files
• The error returned for failed jobs
• In the navigation pane, choose Monitoring. Scroll down to the Job runs list. Choose the job and then
choose View run details.
The information displayed on the job run detail page accessed from the Monitoring page is more
comprehensive. The contents are described in Viewing the details of a job run (p. 103).
For more information about the job logs, see Viewing the job run logs (p. 103).
View the job script

After you provide information for all the nodes in the job, AWS Glue Studio generates a script that is
used by the job to read the data from the source, transform the data, and write the data in the target
location. If you save the job, you can view this script at any time.
To view the generated script for your job
1. Choose Jobs in the navigation pane.

2. On the Jobs page, in the Your Jobs list, choose the name of the job you want to review.
Alternatively, you can select a job in the list, choose the Actions menu, and then choose Edit job.
3. On the visual editor page, choose the Script tab at the top to view the job script.
If you want to edit the job script, see Editing or uploading a job script (p. 76).
Modify the job properties

The nodes in the job diagram define the actions performed by the job, but there are several properties
that you can configure for the job as well. These properties determine the environment that the job runs
in, the resources it uses, the threshold settings, the security settings, and more.
To customize the job run environment
1. Choose Jobs in the navigation pane.

2. On the Jobs page, in the Your Jobs list, choose the name of the job you want to review.
3. On the visual editor page, choose the Job details tab at the top of the job editing pane.
4. Modify the job properties, as needed.
For more information about the job properties, see Defining Job Properties in the AWS Glue
Developer Guide.
5. Expand the Advanced properties section if you need to specify these additional job properties:
109
Store Spark shuffle files on Amazon S3
• Script filename – The name of the file that stores the job script in Amazon S3.
• Script path – The Amazon S3 location where the job script is stored.
• Job metrics – (Not available for Python shell jobs) Turns on the creation of Amazon CloudWatch
metrics when this job runs.
• Continuous logging – (Not available for Python shell jobs) Turns on continuous logging to
CloudWatch, so the logs are available for viewing before the job completes.
• Spark UI and Spark UI logs path – (Not available for Python shell jobs) Turns on the use of Spark
UI for monitoring this job and specifies the location for the Spark UI logs.
• Maximum concurrency – Sets the maximum number of concurrent runs that are allowed for this
job.
• Temporary path – The location of a working directory in Amazon S3 where temporary
intermediate results are written when AWS Glue runs the job script.
• Delay notification threshold (minutes) – Specify a delay threshold for the job. If the job runs for a
longer time than that specified by the threshold, then AWS Glue sends a delay notification for the
job to CloudWatch.
• Security configuration and Server-side encryption – Use these fields to choose the encryption
options for the job.
• Use Glue Data Catalog as the Hive metastore – Choose this option if you want to use the AWS
Glue Data Catalog as an alternative to Apache Hive Metastore.
• Additional network connection – For a data source in a VPC, you can specify a connection of type
Network to ensure your job accesses your data through the VPC.
• Python library path, Dependent jars path (Not available for Python shell jobs), or Referenced
files path – Use these fields to specify the location of additional files used by the job when it runs
the script.
• Job Parameters – You can add a set of key-value pairs that are passed as named parameters to
the job script. In Python calls to AWS Glue APIs, it's best to pass parameters explicitly by name.
For more information about using parameters in a job script, see Passing and Accessing Python
Parameters in AWS Glue in the AWS Glue Developer Guide.
• Tags – You can add tags to the job to help you organize and identify them.
6. After you have modified the job properties, save the job.
Store Spark shuffle files on Amazon S3

Some ETL jobs require reading and combining information from multiple partitions, for example,
when using a join transform. This operation is referred to as shuffling. During a shuffle, data is written
to disk and transferred across the network. With AWS Glue version 3.0, you can configure Amazon
S3 as a storage location for these files. AWS Glue provides a shuffle manager which writes and reads
shuffle files to and from Amazon S3. Writing and reading shuffle files from Amazon S3 is slower (by
5%-20%) compared to local disk (or Amazon EBS which is heavily optimized for Amazon EC2). However,
Amazon S3 provides unlimited storage capacity, so you don't have to worry about "No space left on
device" errors when running your job.
To configure your job to use Amazon S3 for shuffle files
1. On the Jobs page, in the Your Jobs list, choose the name of the job you want to modify.
2. On the visual editor page, choose the Job details tab at the top of the job editing pane.
Scroll down to the Job parameters section.

3. Specify the following key-value pairs.
• --write-shuffle-files-to-s3 — true
110
Save the job
This is the main parameter that configures the shuffle manager in AWS Glue to use Amazon S3
buckets for writing and reading shuffle data. By default, this parameter has a value of false.
• (Optional) --write-shuffle-spills-to-s3 — true
This parameter allows you to offload spill files to Amazon S3 buckets, which provides additional
resiliency to your Spark job in AWS Glue. This is only required for large workloads that spill a lot of
data to disk. By default, this parameter has a value of false.
• (Optional) --conf spark.shuffle.glue.s3ShuffleBucket — S3://<shuffle-bucket>
This parameter specifies the Amazon S3 bucket to use when writing the shuffle files. If you do
not set this parameter, the location is the shuffle-data folder in the location specified for
Temporary path (--TempDir).
Note
Make sure the location of the shuffle bucket is in the same AWS Region in which the job
runs.
Also, the shuffle service does not clean the files after the job finishes running, so you
should configure the Amazon S3 storage life cycle policies on the shuffle bucket location.
For more information, see Managing your storage lifecycle in the Amazon S3 User Guide.
Save the job

A red Job has not been saved callout is displayed to the left of the Save button until you save the job.
To save your job
1. Provide all the required information in the Visual and Job details tabs.
2. Choose the Save button.
After you save the job, the 'not saved' callout changes to display the time and date that the job was
last saved.
If you exit AWS Glue Studio before saving your job, the next time you sign in to AWS Glue Studio, a
notification appears. The notification indicates that there is an unsaved job, and asks if you want to
restore it. If you choose to restore the job, you can continue to edit it.
Troubleshooting errors when saving a job

If you choose the Save button, but your job is missing some required information, then a red callout
appears on the tab where the information is missing. The number in the callout indicates how many
missing fields were detected.
• If a node in the visual editor isn't configured correctly, the Visual tab shows a red callout, and the node
with the error displays a warning symbol .
1. Choose the node. In the node details panel, a red callout appears on the tab where the missing or
incorrect information is located.
111
Troubleshooting errors when saving a job
2. Choose the tab in the node details panel that shows a red callout, and then locate the problem
fields, which are highlighted. An error message below the fields provides additional information
about the problem.
• If there is a problem with the job properties, the Job details tab shows a red callout. Choose that tab
and locate the problem fields, which are highlighted. The error messages below the fields provide
additional information about the problem.
112
Clone a job
Clone a job
You can use the Clone job action to copy an existing job into a new job.
To create a new job by copying an existing job
1. On the Jobs page, in the Your jobs list, choose the job that you want to duplicate.
2. From the Actions menu, choose Clone job.
3. Enter a name for the new job. You can then save or edit the job.
Delete jobs
You can remove jobs that are no longer needed. You can delete one or more jobs in a single operation.
To remove jobs from AWS Glue Studio
1. On the Jobs page, in the Your jobs list, choose the jobs that you want to delete.
2. From the Actions menu, choose Delete job.
3. Verify that you want to delete the job by entering delete.
You can also delete a saved job when you're viewing the Job details tab for that job in the visual editor.
113
Prerequisites
Tutorial: Adding an AWS Glue

crawler
For this AWS Glue scenario, you're asked to analyze arrival data for major air carriers to calculate the
popularity of departure airports month over month. You have flights data for the year 2016 in CSV
format stored in Amazon S3. Before you transform and analyze your data, you catalog its metadata in
the AWS Glue Data Catalog.
In this tutorial, let’s add a crawler that infers metadata from these flight logs in Amazon S3 and creates a
table in your Data Catalog.
Topics
• Step 1: Add a crawler (p. 114)
• Step 2: Run the crawler (p. 115)
• Step 3: View AWS Glue Data Catalog objects (p. 115)
Prerequisites
This tutorial assumes that you have an AWS account and access to AWS Glue.
Step 1: Add a crawler

Use these steps to configure and run a crawler that extracts the metadata from a CSV file stored in
Amazon S3.
To create a crawler that reads files stored on Amazon S3
1. On the AWS Glue service console, on the left-side menu, choose Crawlers.
2. On the Crawlers page, choose Add crawler. This starts a series of pages that prompt you for the
crawler details.
3. In the Crawler name field, enter Flights Data Crawler, and choose Next.
Crawlers invoke classifiers to infer the schema of your data. This tutorial uses the built-in classifier
for CSV by default.
4. For the crawler source type, choose Data stores and choose Next.
5. Now let's point the crawler to your data. On the Add a data store page, choose the Amazon S3 data
store. This tutorial doesn't use a connection, so leave the Connection field blank if it's visible.
For the option Crawl data in, choose Specified path in another account. Then, for the Include path,
enter the path where the crawler can find the flights data, which is s3://crawler-public-us-
east-1/flight/2016/csv. After you enter the path, the title of this field changes to Include
path. Choose Next.
6. You can crawl multiple data stores with a single crawler. However, in this tutorial, we're using only a
single data store, so choose No, and then choose Next.
114
Step 2: Run the crawler
7. The crawler needs permissions to access the data store and create objects in the AWS Glue Data
Catalog. To configure these permissions, choose Create an IAM role. The IAM role name starts
with AWSGlueServiceRole-, and in the field, you enter the last part of the role name. Enter
CrawlerTutorial, and then choose Next.
Note
To create an IAM role, your AWS user must have CreateRole, CreatePolicy, and
AttachRolePolicy permissions.
The wizard creates an IAM role named AWSGlueServiceRole-CrawlerTutorial, attaches the

AWS managed policy AWSGlueServiceRole to this role, and adds an inline policy that allows read
access to the Amazon S3 location s3://crawler-public-us-east-1/flight/2016/csv.
8. Create a schedule for the crawler. For Frequency, choose Run on demand, and then choose Next.
9. Crawlers create tables in your Data Catalog. Tables are contained in a database in the Data Catalog.
First, choose Add database to create a database. In the pop-up window, enter test-flights-db
for the database name, and then choose Create.
Next, enter flights for Prefix added to tables. Use the default values for the rest of the options,
and choose Next.
10. Verify the choices you made in the Add crawler wizard. If you see any mistakes, you can choose Back
to return to previous pages and make changes.
After you have reviewed the information, choose Finish to create the crawler.
Step 2: Run the crawler

After creating a crawler, the wizard sends you to the Crawlers view page. Because you create the crawler
with an on-demand schedule, you're given the option to run the crawler.
To run the crawler
1. The banner near the top of this page lets you know that the crawler was created, and asks if you
want to run it now. Choose Run it now? to run the crawler.
The banner changes to show "Attempting to run" and Running" messages for your crawler. After the
crawler starts running, the banner disappears, and the crawler display is updated to show a status of
Starting for your crawler. After a minute, you can click the Refresh icon to update the status of the
crawler that is displayed in the table.
2. When the crawler completes, a new banner appears that describes the changes made by the crawler.
You can choose the test-flights-db link to view the Data Catalog objects.
Step 3: View AWS Glue Data Catalog objects

The crawler reads data at the source location and creates tables in the Data Catalog. A table is the
metadata definition that represents your data, including its schema. The tables in the Data Catalog do
not contain data. Instead, you use these tables as a source or target in a job definition.
To view the Data Catalog objects created by the crawler
1. In the left-side navigation, under Data catalog, choose Databases. Here you can view the flights-
db database that is created by the crawler.
2. In the left-side navigation, under Data catalog and below Databases, choose Tables. Here you can
view the flightscsv table created by the crawler. If you choose the table name, then you can view
115
Step 3: View AWS Glue Data Catalog objects
the table settings, parameters, and properties. Scrolling down in this view, you can view the schema,
which is information about the columns and data types of the table.
3. If you choose View partitions on the table view page, you can see the partitions created for the
data. The first column is the partition key.
116
Document history for AWS Glue

Studio User Guide
Latest documentation update: October 11, 2021
The following table describes the important changes in each revision of the AWS Glue Studio User Guide.
For notification about updates to this documentation, you can subscribe to an RSS feed.
update-history-change update-history-description update-history-date
Glue Studio is now available in AWS Glue Studio is now October 11, 2021
China (p. 117) available in the China Beijing
and Ningxia regions.
Direct access to streaming When adding data sources to September 30, 2021
sources now available (p. 117) your ETL job in the visual editor,
you can supply information to
access the data stream instead
of having to use a Data Catalog
database and table.
Custom connectors can When editing data source node September 24, 2021
now be used with data using a custom connector, you
previews (p. 117) can preview the dataset by
choosing the Dat preview tab.
For more information, see
Custom Connectors
AWS Glue Studio supports AWS When creating jobs in AWS Glue August 18, 2021
Glue version 3.0 (p. 117) Studio, you can choose Glue 3.0
as the version for your job in the
Job details tab. If you do not
choose a version for your ETL
job, Glue 2.0 is used by default.
AWS GovCloud (US) AWS Glue Studio is now August 18, 2021
Region (p. 117) available in the AWS GovCloud
(US) Region
Python shell authoring available When creating a new job, you August 13, 2021
in AWS Glue Studio (p. 117) can now choose to create a
Python shell job. For more
information, see Start the job
creation process and Editing
Python shell jobs in AWS Glue
Studio.
Upload scripts to AWS Glue In conjunction with the script June 14, 2021
Studio (p. 117) editor feature, you can upload
job scripts to AWS Glue Studio.
For more information, see Start
the job creation process and
Editing or uploading a job script.
117
View your job's dataset You can use the new Data June 7, 2021
while creating and editing preview tab for a node in your
jobs (p. 117) job diagram to see a sample of
the data processed by that node.
For more information, see Using
data previews in the visual job
editor.
Specify settings for your You can configure additional June 4, 2021
streaming ETL job in the visual connection settings for
job editor (p. 117) streaming data sources in the
visual job editor to optimize
your streaming ETL jobs. For
more information, see Using a
streaming data source.
Network connection support If you want to access a data May 24, 2021
added (p. 117) source located in your VPC,
you can specify a network
connection for the job. For more
information, see Modify the job
properties.
Edit job scripts (p. 117) You can now edit scripts in the May 24, 2021
job editor. For more information,
see Editing a job script.
Delete jobs using the AWS Glue You can now delete jobs in AWS May 24, 2021
Studio console (p. 117) Glue Studio. To learn how, see
Delete jobs.
Read data from files in child You can specify a single folder in April 30, 2021
folders in Amazon S3 (p. 117) Amazon S3 as your data source
and use the Recursive option to
include all the child folders as
part of the data source. For more
information, see Using files in
Amazon S3 for the data source.
Delete connectors and You can now delete connectors April 30, 2021
connections functionality and connections in AWS Glue
added (p. 117) Studio. For more information,
see Deleting connectors and
connections.
Fill missing values transform You can use the FillMissingValues March 29, 2021
added (p. 117) transform in AWS Glue Studio to
locate records in the dataset that
have missing values and add
a new field with an estimated
value. For more information, see
Editing the data transform node.
118
SQL transform You can use a SQL transform March 23, 2021
available (p. 117) node to write your own
transform in the form of a SQL
query. For more information, see
Using a SQL query to transform
data.
JDBC source nodes now support Job bookmarks help AWS Glue March 15, 2021
job bookmark keys (p. 117) maintain state information and
prevent the reprocessing of old
data. For more information,
see Authoring jobs with custom
connectors.
Connectors can be used for data Using a custom or AWS March 15, 2021
targets (p. 117) Marketplace connector for your
data target is now supported.
For more information, see
Authoring jobs with custom
connectors.
A new toolbar is available for the A more streamlined and March 8, 2021
visual job editor (p. 117) functional toolbar is available
for the visual job editor of AWS
Glue Studio. This feature makes
it easier to add nodes to your
graph.
Read data from Amazon S3 AWS Glue Studio now allows February 5, 2021
without creating Data Catalog you to read data directly from
tables (p. 117) Amazon S3 without first creating
a table in the AWS Glue Data
Catalog. For more information,
see Editing the data source node.
AWS Glue Studio jobs can AWS Glue Studio now supports February 5, 2021
now update Data Catalog updating the AWS Glue Data
tables (p. 117) Catalog during job runs. This
feature makes it easy to keep
your tables up to date as
your jobs write new data into
Amazon S3. This makes the data
immediately available for query
from any analytics service that
is compatible with the AWS
Glue Data Catalog. For more
information, see Configuring
data target nodes.
Job scheduling now available in You can define a time-based December 21, 2020
AWS Glue Studio (p. 117) schedule for your job runs in
AWS Glue Studio. You can use
the console to create a basic
schedule, or define a more
complex schedule using the
Unix-like cron syntax. For more
information, see Schedule job
runs.
119
AWS Glue Custom Connectors AWS Glue Custom Connectors December 21, 2020
released (p. 117) allow you to discover and
subscribe to connectors in
AWS Marketplace. We also
released AWS Glue Spark
runtime interfaces to plug in
connectors built for Apache
Spark Datasource, Athena
federated query, and JDBC APIs.
For more information, see Using
Connectors and connections
with AWS Glue Studio.
Support for running streaming AWS Glue Studio now supports November 11, 2020
ETL jobs in AWS Glue version running streaming ETL jobs
2.0 (p. 117) using AWS Glue version 2.0. For
more information, see Adding
Streaming ETL Jobs in AWS Glue
in the AWS Glue Developer Guide.
Availability of AWS Glue Studio AWS Glue Studio provides a September 23, 2020
announced (p. 117) visual interface that simplifies
the creation of jobs that prepare
the data for analysis. The
initial version of this guide was
published on the same day AWS
Glue Studio launched.
120
AWS glossary
For the latest AWS terminology, see the AWS glossary in the AWS General Reference.
121

AWS Glue Studio

Uploaded by

Copyright:

Available Formats

AWS Glue Studio

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AWS Glue Studio

Uploaded by

Copyright:

Available Formats

AWS Glue Studio

AWS Glue Studio: User Guide

Step 3: Edit the transform node of the job .................................................................................. 41

Conﬁgure target properties for nodes that use connectors ..................................................... 89

Using Notebooks with AWS Glue

Overview of using notebooks

Some beneﬁts of using notebooks include:

• No cluster to provision or manage

To create jobs using the notebook interface:

• conﬁgure the necessary IAM permissions.

Getting started with notebooks in AWS Glue

Creating an ETL job using notebooks in AWS Glue

After a short time period, the notebook editor appears.

Notebook editor components

• Notebook interface (main panel) and toolbar

The notebook editor

• currently, the AWS Glue Studio notebook cannot install extensions

AWS Glue Studio job editing tabs

Saving your notebook and job script

Managing notebook sessions

Change the default timeout for all notebook sessions

Installing additional Python modules

%additional_python_modules awswrangler, s3://mybucket/mymodule.whl

All arguments to additional_python_modules are passed to pip3 install -m <>

Changing AWS Glue Conﬁguration

For more information see Deﬁning Job Properties

Stop a notebook session

AWS Glue Visual Job API (Preview)

API design and CRUD APIs

Things to keep in mind:

• The ‘codeGenConﬁgurationNodes’ ﬁeld is a map of nodeId to node.

The following is a JSON representation of a createJob input.

The following is a more complex example:

API design and CRUD APIs

The general shape is:

• keys uniquely identify a node

Appendix: Visual Job Examples and Model

S3CSVSource from a glue catalog table:

CatalogSource for RDS:

JDBCDataType is an enu. To see a full list of possible values, see JDBCDataType.

CatalogKinesisSource – is a kind of CatalogSource

RedshiftSource - is a kind of CatalogSource:

"Name":100 character String Required,

//Only one can be specified, or neither

"Inputs": List of Strings. One 256 character String Required,

"Inputs": List of Strings. One 256 character String Required,

"Inputs": List of Strings. One 256 character String Required,

Detect PII using AWS Glue Studio

Choosing how you want the data to be scanned

Choosing the PII entities to take action on

Choosing what to do with identiﬁed PII data

What is AWS Glue Studio?

• Pull data from an Amazon S3, Amazon Kinesis, or JDBC source.

Features of AWS Glue Studio

Visual job editor

• Add additional nodes to the job to implement:

Notebook interface for interactively developing and

• No cluster to provision or manage

Job script code editor

Job performance dashboard