How to use a Lambda function to extract a ZIP file stored in S3 bucket

Simplifying website or other file updates using AWS Lambda and S3 to automatically extract ZIP files.

Mert Alnuaimi
Level Up Coding

--

Automating ZIP extraction with AWS Lambda and S3 as if automating the building of a car
Photo by Lenny Kuhne on Unsplash

As part of the infrastructure in a recent project we had a frontend portal website built with ReactJS that is built and zipped through a pipeline and sent to an artifact repository. The ZIP file is then sent by Terraform to an S3 bucket where the website is hosted.

Once the ZIP file is in the S3 bucket it has to be extracted in-place and any changed files replaced with new ones.

This is where the Lambda function comes in.

The Lambda function

import zipfile
import urllib.parse
from io import BytesIO
from mimetypes import guess_type
import boto3

s3 = boto3.client('s3')

def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
zip_key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])

try:
# Get the zipfile from S3
obj = s3.get_object(Bucket=bucket, Key=zip_key)
z = zipfile.ZipFile(BytesIO(obj['Body'].read()))

# Extract and upload each file in the zipfile
for filename in z.namelist():
content_type = guess_type(filename, strict=False)[0]
s3.upload_fileobj(
Fileobj=z.open(filename),
Bucket=bucket,
Key=filename,
ExtraArgs={'ContentType': content_type}
)
except Exception as e:
print('Error getting object {zip_key} from bucket {bucket}.')
raise e

The Lambda function is triggered by an event, such as the upload of the ZIP file to the S3 bucket.

The following is a JSON representation of an event trigger to invoke the Lambda function:

{
"LambdaFunctionConfigurations": [
{
"Id": "ID",
"LambdaFunctionArn": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"Events": [
"s3:ObjectCreated:*"
],
"Filter": {
"Key": {
"FilterRules": [
{
"Name": "suffix",
"Value": ".zip"
}
]
},
"Condition": {
"ArnLike": {
"AWS:SourceArn": "arn:aws:s3:::BUCKET_NAME"
}
}
}
}
]
}

Once a new .zip file is uploaded to the bucket the Lambda function will begin its execution.

The following policy will be responsible for allowing the event in the S3 bucket to trigger the Lambda function:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowS3ToInvokeLambda",
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:function:FUNCTION_NAME",
"Principal": "s3.amazonaws.com",
"Condition": {
"ArnLike": {
"AWS:SourceArn": "arn:aws:s3:::BUCKET_NAME"
}
}
}
]
}

Once the Lambda function is invoked it starts by extracting the necessary information from the event object, then it will use the boto3 library to interact with the S3 service and retrieve the ZIP file from the S3 bucket. The ZIP file is then loaded into memory using the zipfile.ZipFile class.

The function then iterates over the files in the ZIP file using the namelist method and for each file, it uses the guess_type function from the mimetypes library to determine the content type of the file, and the ExtraArgs parameter is set to include the ContentType.

This is important because the content type is used when uploading the file to the S3 bucket, so that the file can be correctly handled by the browser when it is requested. One of the main problems that may occur is that when visiting for example an index.html page the browser, instead of opening the webpage, will download it as a file.

As the Lambda function iterates through all the files in the zipfile and uploads them to the S3 bucket, it replaces the previous files with the new ones, effectively updating the website or other files hosted on the S3 bucket.

Once the extraction finishes it will use the s3.upload_fileobj method to upload the extracted files to the S3 bucket replacing the previous files with the new ones, effectively updating the website or other files hosted on the S3 bucket.

For the Lambda function to interact with the S3 bucket it will need a similar policy to the following:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"s3:HeadObject",
"s3:GetObjectVersion"
],
"Resource": [
"arn:aws:s3:::BUCKET_NAME/*"
]
}
]
}

It’s worth mentioning that the function also has error handling in place, using a try-except block to catch any exceptions that may occur while interacting with the S3 service, and it logs the error using the print() function, so you can easily trace the error with CloudWatch for example and fix it.

Overall, this Lambda function provides an automated and efficient way to update your website or other files hosted on S3, saving time and resources and eliminating the need for manual intervention.

BONUS: Multipart Upload and S3 Select

The Lambda function above works great for interacting with smaller sized files in the ZIP file. If your ZIP file contains large files such as the assets of your website including large video files then it is better to use multipart upload for better performance.

Adding multipart upload to the Lambda function:

# Multipart upload process of large fi
Config = boto3.s3.transfer.TransferConfig(
multipart_threshold=1024*25,
max_concurrency=10,
use_threads=True,
)

The above config will check for files larger than 25MB using the multipart_threshold and use multipart upload to divide the file into smaller parts and then upload them in parallel. The max_concurrency will specify he number of parts that can be uploaded in parallel.

While the multipart upload method will greatly improve performance for extracting larger files from the ZIP file we might also need to specify which specific files should be selected by the Lambda function and only those files extracted from the ZIP file and uploaded to the S3 bucket. This way the Lambda function will not load the whole ZIP file into memory and only select specific files.

Adding S3 Select to the Lambda function:

# Use S3 Select to extract only the mp4 files you need
response = s3.select_object_content(
Bucket=bucket,
Key=zip_key,
ExpressionType='SQL',
Expression="SELECT * FROM S3Object WHERE S3Object['key'].endsWith('.mp4')",
InputSerialization={'application/zip': {'MultiPartUploadThreshold': 6291456}},
OutputSerialization={'application/json': {}},
)

By using the SELECT * FROM S3Object SQL expression it will extract only the files needed, which in this case is the .mp4 file(s).

The MultiPartUploadThreshold parameter is used to specify the part size of the multipart upload when reading the ZIP file. This allows S3 Select to read large ZIP files efficiently by using multipart uploads.

In summary, the Lambda function will use S3 Select to filter and extract only the .mp4 files from the ZIP file that is stored in an S3 bucket, and then it will read the selected files, and upload them to S3 in a multipart process, which allows for more efficient and parallel uploads of large files. The multipart upload process will handle any .mp4 file that is larger than 25MB.

Thanks for reading!

Level Up Coding

Thanks for being a part of our community! Before you go:

πŸš€πŸ‘‰ Join the Level Up talent collective and find an amazing job

--

--