The Modern Bloq - Webflow Ecommerce website template

‍

When starting with data engineering on AWS, the first service that comes to mind is undoubtedly AWS Glue. This managed service is essential for data engineering, serving as the backbone for ETL (Extract, Transform, Load) pipelines. AWS Glue simplifies the process of discovering, preparing, and combining data for analytics, machine learning, and application development

‍

With its serverless architecture, AWS Glue eliminates the need for managing infrastructure, allowing you to focus on writing and deploying your ETL scripts. It also integrates seamlessly with other AWS services such as S3, RDS, and Redshift, making it a powerful tool for building comprehensive data workflows in the cloud.

‍

However, like other cloud service providers, AWS has a reputation for making it challenging to invoke their managed services locally, often leading to a sense of vendor lock-in and a lack of developer tooling. Take for example AWS Lambda, which is designed for seamless cloud-based execution but can be difficult to test and deploy locally without using additional tools or configurations.

‍

Side note: if you want to learn more about testing Lambda functions locally, check my previous blog post here!

‍

The same can be said for AWS Glue jobs, which are designed to run efficiently in the cloud but can pose challenges when it comes to local testing and development. However you might ask me:

‍

“Why should I locally run my AWS Glue jobs?”

‍

Benefits of Running AWS Glue Jobs Locally

‍

Running AWS Glue jobs locally offers several benefits. Beyond just testing the jobs, local development provides access to a wide range of developer tools that can enhance productivity and streamline the development process.

‍

1. Cost Savings

‍

Developing AWS Glue jobs locally helps avoid the costs associated with running jobs on AWS Glue. When developing on AWS, you need to consider two main costs: the cost of running AWS Glue jobs and the cost of interactive sessions. By running jobs locally, you can minimize these expenses.

Here are two sample pricing cases from the AWS Glue documentation:

‍

“ETL job: Consider an AWS Glue Apache Spark job that runs for 15 minutes and uses 6 DPU. The price of 1 DPU-Hour is $0.44. Since your job ran for 1/4th of an hour and used 6 DPUs, AWS will bill you 6 DPU * 1/4 hour * $0.44, or $0.66.

AWS Glue Studio Job Notebooks and Interactive Sessions: Suppose you use a notebook in AWS Glue Studio to interactively develop your ETL code. An Interactive Session has 5 DPU by default. If you keep the session running for 24 minutes or 2/5th of an hour, you will be billed for 5 DPUs * 2/5 hour at $0.44 per DPU-Hour or $0.88.”

‍

2. Faster Feedback Loop

‍

Developing locally is faster since you can quickly run, debug, and tweak your code locally without waiting for cloud resources to be provisioned and jobs to be executed in the AWS environment.

Additionally, immediate access to logs and error messages simplifies troubleshooting, enabling developers to address issues promptly and refine their code more efficiently.

‍

3. Debugging and Troubleshooting

‍

Debugging locally allows you to use familiar tools and IDEs. You can set breakpoints, inspect variables, and step through your code, making it easier to identify and fix issues.

You can quickly iterate on your code, test changes in real-time, and ensure that your fixes are effective before deploying them to the cloud. This hands-on approach reduces the complexity and time involved in resolving issues, ultimately leading to a more robust and reliable ETL process.

‍

4. Version Control

‍

Working locally allows for seamless integration with version control systems like Git. As a data engineer, you are likely working as part of a team. By developing locally and using Git, you can easily manage and collaborate on code changes with your team, ensuring that everyone stays synchronized and changes are tracked efficiently.

‍

Tutorial

‍

In this blog post, we will be demonstrating how to run AWS Glue jobs locally using VS Code. By leveraging the powerful features of VS Code, you can streamline your development workflow and take advantage of its extensive debugging and troubleshooting tools as well as its extensions ecosystem.

‍

In a nutshell, we will be developing inside a Docker container that is pre-configured to mimic an AWS Glue production environment. By the end of this guide, you will be well-equipped to develop, test, and debug AWS Glue jobs locally with ease.

‍

Prerequisites

‍

IAM User with at least Amazon S3 access
Configure your AWS profile: export AWS_PROFILE=<your-profile-name>
Install Docker, Python, and VS Code
Install the Dev Containers extension in VS Code
Create a directory/workspace for your AWS Glue ETL script

‍

Pulling the AWS Glue Docker Image

‍


docker pull amazon/aws-glue-libs:glue_libs_4.0.0_image_01

‍

To ensure you have the same environment in testing your AWS Glue jobs, a Docker image provided by AWS is constantly being maintained by AWS themselves. Later in this tutorial, we would be using this image to run the container where we are going to be develop, test, and run AWS Glue jobs.

‍

Setting up VS Code

‍

Open the directory or workspace you are going to develop AWS Glue jobs in in VS Code. Then go to Settings

‍

‍

Then go to the Workspaces tab and select Open Settings (JSON).

‍

‍

Doing this will generate a .vscode folder containing a settings.json file.

‍

‍

In the settings.json copy the block of code below:

‍


{
    "python.defaultInterpreterPath": "/usr/bin/python3",
    "python.analysis.extraPaths": [
        "/home/glue_user/aws-glue-libs/PyGlue.zip:/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip:/home/glue_user/spark/python/",
    ]
}

‍

This JSON configuration is intended for setting up VS Code to work seamlessly inside an AWS Glue Docker container, which we will be spinning up in the next step.

‍

The field python.defaultInterpreterPath which has a value of /usr/bin/python3, indicates the default Python interpreter that VS Code should use.

‍

“Shouldn’t I use venv ?”
“But my Python interpreter is not located there.”

‍

Remember, we will be developing within a Docker container on our local machine, not directly in our local environment. This approach ensures that our development environment closely mirrors the AWS Glue production setup.

‍

Moving forward, you can observe that the field python.analysis.extraPaths has three (3) values:

‍


/home/glue_user/aws-glue-libs/PyGlue.zip

‍

This a zip file inside the Docker container contains the AWS Glue libraries that you need to run your Glue ETL jobs. Including this in the extraPaths allows VS Code to recognize and autocomplete AWS Glue-specific modules and functions.

‍


/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip

‍

This path inside the Docker container includes the Py4J library, which is used for communication between Python and the Java-based Apache Spark. Adding this path ensures that VS Code can provide code analysis for Spark-related Python code.

‍


/home/glue_user/spark/python/

‍

This path inside the Docker container is where the standard Python libraries for Apache Spark are located. Including this ensures that all the Spark Python APIs are recognized by the Python language server in VS Code.

‍

Essentially, without configuring the .vscode/settings.json file, developing AWS Glue jobs locally within a Docker container using VS Code would lack IntelliSense which could significantly hinder developer productivity.

‍

Running the Docker Container

Before running the Docker container, we must first declare the WORKSPACE_LOCATION and PROFILE_NAME environment variables.

‍


export PROFILE_NAME=$AWS_PROFILE
export WORKSPACE_LOCATION="path-of-your-project-directory"

‍

In the prerequisites, we were tasked to create a project directory. To get the path of this directory, open the terminal in VS Code and enter the command pwd. In this example, the path of my project directory is

‍


/Users/iggyyuson/personal/awsglue

‍

Therefore, the command that I must execute is the following:

‍


export WORKSPACE_LOCATION=/Users/iggyyuson/personal/awsglue

‍

Moving forward, we can now run the Docker container. Execute the command below:

‍


docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_4.0.0_image_01 pyspark

‍

IMPORTANT NOTE: If you encounter a “permission denied” error, it’s likely that your Docker daemon lacks the necessary permissions to bind mount the directory. Not all machines allow the Docker daemon to access the ~/.aws folder by default. You will need to configure these permissions via the CLI or Docker Desktop. See the guide below.

‍

‍

After running the Docker container, you will then be greeted by an interactive PySpark shell. If you are seeing the illustration below, then that means that the container is running as intended.

‍

‍

Developing Inside the Docker Container

‍

In VS Code, open your Dev Containers extension. If you are using the latest version of the Dev Containers extension, this is named as Remote Explorer in the left menu.

‍

‍

As you can see, you can already see the Docker container that we ran awhile ago. This view may vary depending on how many Docker containers running in your local machine. Right click the correct container and select Attach in Current Window.

‍

‍

“So, what happened?”

‍

After clicking Attach to Current Window, your VS Code is now connected to that Docker container we ran awhile ago. If you go to your terminal in VS Code and enter the command whoami, you will see a similar view below:

‍

‍

Creating a Sample AWS Glue ETL Script

‍

Create a Python script in your workspace and paste the block of code below:

‍


import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

input_df = glueContext.create_dynamic_frame_from_options(
  connection_type="s3",
  connection_options = {
    "paths": ["s3://awsglue-datasets/examples/us-legislators/all/persons.json"]
  },
  format = "json"
)

input_df.show()

df = input_df.toDF()

df_pd = df.toPandas()

df_pd.head()

print('test script has successfully ran')

‍

If you are already familiar with PySpark and AWS Glue, then you can construct your own ETL script for this part, but for those new these technologies, you can simply copy the sample above and follow along!

‍

“If we are developing inside a Docker container, our newly created Python file will get deleted when the container gets terminated, right?”

‍

If you were paying attention earlier when we ran the docker run command, you might have noticed the -v flag with the value of

‍


$WORKSPACE_LOCATION:/home/glue_user/workspace/

‍

This flag mounts your local workspace directory to the specified path inside the Docker container, ensuring that any changes made locally are reflected within the container, and vice versa.

‍

For additional context, there is another -v flag in the docker run command mentioned earlier, with the value of ~/.aws:/home/glue_user/.aws. This flag mounts your AWS credentials from your local .aws directory into the Docker container. This step is crucial to ensure that the Docker container uses the same credentials as your IAM user, allowing it to access AWS services seamlessly depending on the permissions granted.

‍

Running the Sample Script

‍

Open your newly created sample script in VS Code. You can run this script by simply clicking the Run Python File button located in the upper right section.

‍

‍

When you run the Python file, VS Code will open the terminal and you can view the logs of your sample script there. ETL scripts might take a while to execute depending on how complex the data transformations are.

‍

‍

Congratulations! You have now successfully ran your first AWS Glue ETL script locally!

‍

“What did the script do?”

‍

Since this was just a sample script, it did not really do any complex transformations. This script basically loaded a JSON document from a public Amazon S3 bucket and converted it into a Pandas data frame.

‍

Step Debugging

‍

While running AWS Glue jobs locally helps optimize costs, the real advantage of local development lies in the ability to enable developer tooling. Step debugging is a powerful feature allows you to meticulously examine and debug your code, making the development process more efficient and effective.

‍

Gone are those console logs and print statements! This is how you debug like a pro!

In order execute step debugging in VS Code, first you must add a breakpoint.

‍

‍

After adding a breakpoint, click the down arrow beside the Run Python File button and select Python Debugger: Debug Python File.

‍

‍

Notice how your code execution pauses at the line where you added a breakpoint. This demonstrates the power of step debugging. At this point, you can inspect the variables on the right side of your screen and continue running your script line by line using the step controls. The step debugging session will end either when the code has fully executed as intended or by selecting the stop button.

‍

The Data Wrangler Extension

‍

When practicing data engineering or any data-related practice, it is crucial to visualize your data clearly. In the previous example, when the command df.pd.head() was executed, the terminal output was not very user-friendly. This is where the Data Wrangler VS Code extension comes in handy.

‍

This VS Code extension provides an intuitive interface for viewing and interacting with your data frames, making it easier to understand and analyze your data. With features like data profiling, filtering, and visualization, this extension enhances your ability to work with data efficiently and effectively which can overall significantly improve your productivity and the quality of your data analysis.

‍

The only prerequisites in using this extension is to install the Python and Jupyter extensions as well.

‍

‍

To use Data Wrangler, you must first run a step debugging session. Ensure your breakpoint is placed on a line where a data frame has been executed. Then, in the Variables menu, right click on the data frame. If you followed this tutorial and used the given sample script, you can right click the df_pd variable and select View Value in Data Viewer.

‍

‍

You will then be redirected to this detailed data viewer interface, where you can explore, filter, and analyze your data frame in a more user-friendly and interactive manner.

‍

‍

One Final Tip

‍

To enhance the developer experience even more, since you are already testing your AWS Glue jobs locally, this setup also enables you to easily perform unit testing on your ETL pipelines.

‍

By running your pipelines in a local environment, you can thoroughly test individual components and transformations, ensuring they work as expected before deploying to the cloud.

This approach not only helps in catching errors early but also improves the overall reliability of your data workflows. Incorporating unit tests into your development process will lead to more maintainable and higher-quality ETL pipelines.

‍

This blog post is getting quite lengthy, so I will discuss unit testing your AWS Glue jobs in a future blog post. Stay tuned!

‍

Conclusion

‍

Running AWS Glue jobs locally using VS Code provides significant advantages for data engineering workflows. By leveraging Docker containers to replicate the AWS Glue environment, you can optimize costs, accelerate your feedback loop, and enhance debugging capabilities.

‍

The use of familiar tools and IDEs, combined with powerful extensions like Data Wrangler, allows for a more efficient and effective development process. This approach not only improves productivity but also ensures a seamless transition from local development to cloud deployment.

‍

By following the steps outlined in this guide, you will be well-equipped to develop, test, and debug AWS Glue jobs locally, making your data engineering tasks more manageable and streamlined.

‍

References and Further Reading

‍

Technology

Local AWS Glue Jobs

July 28, 2024