Photo by panumas nikhomkhai from Pexels

AWS, Azure, GCP: Object Storage Services

Petteri Kivimäki
Level Up Coding
Published in
14 min readJan 6, 2020

--

The aim of this article is to provide a high level summary of the object storage services of the three biggest cloud providers: Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP). The aim is not to rate different platforms, put them in order or recommend one over the other. Different features are discussed at a high level and links to more detailed information are provided.

The table below contains an overview of the object storage services of AWS, Azure and GCP. All the platforms provide more or less the same features, but the way how they have been designed and implemented varies between the platforms. Later in the article different features will be discussed in more detail.

Image 1. An overview of the object storage services of AWS, Azure and GCP

Object Storage Service

All the three platforms provide globally accessible, fully managed object storage services that have nearly unlimited storage capacity with no administrative overhead. The services can be used to store binary objects and unstructured data inside a named unit of storage. The data is highly available and durable, and replicated based on the selected redundancy level. All the platforms support multiple redundancy levels and access tiers, encryption in transit and at rest, object lifecycle management, retention policies, static website hosting and more.

The user experience of these object storage services has many similarities to filesystems and all the three platforms provide filesystem-like APIs. However, it must be noted that object storage service is not a filesystem. Inside a storage unit objects are stored as key-value pairs, and each object has a key, data and metadata. The key or key name must be unique within a storage unit. The metadata contains information about the object, and may contain both system generated and custom metadata. Objects keys are usually paths (e.g. “/dir/file.txt”) even if the underlying storage structure is flat, and there is no hierarchy of subfolders.

AWS

Simple Object Storage (S3) is Amazon’s distributed object storage service. In S3 the unit of deployment is called bucket and binary objects are called objects. The amount of data that can be stored in a single bucket is unlimited, but the maximum size of a single object is 5 TB. Buckets are identified with a globally unique key and object names must be unique within a bucket.

Data stored in an S3 bucket is redundant at a regional level. Region is a single geographic area, e.g. Ireland, that consists of multiple, isolated and physically separate data centers. The region is selected when a new bucket is created and the data stored in the bucket never leaves the selected region unless it is explicitly transferred to another region. The region of an existing bucket cannot be changed. In case objects must be replicated to multiple regions, cross-region replication can be used to copy objects across buckets in different regions.

S3 provides strong read-after-write consistency for creating new objects and related metadata. This means that the next read operation after a write operation returns the newly created object. However, if a read request to the key name has been made before creating the object, S3 provides eventual read-after-write consistency. This means that a subsequent read might or might not return the newly created object. For update and delete operations S3 provides eventual consistency. This means that a subsequent read might return the old data or the updated data. Nevertheless, corrupted or partial data is never returned.

An object’s availability and pricing model are set through the access tier. In S3 the property that defines the access tier is called storage class. The default storage class is defined by S3 and it cannot be changed at a bucket level. Instead, the storage class can be defined at object level when a new object is created and the storage class of an existing object can be changed at any point during the object’s lifecycle either manually or automatically through lifecycle management.

The supported storage classes are STANDARD, STANDARD_IA, INTELLIGENT_TIERING, ONEZONE_IA, GLACIER and DEEP_ARCHIVE. STANDARD is a general-purpose storage for frequently accessed data and it’s also the S3 default storage class. It has the lowest access cost, but higher storage cost than the other tiers. STANDARD_IA and ONEZONE_IA are for long-lived, less frequently accessed data and they have a minimum storage duration of 30 days. They have lower storage costs and higher access costs compared to STANDARD storage. GLACIER and DEEP_ARCHIVE are for long-term archive and they have the minimum storage duration of 90 (GLACIER) and 180 (DEEP_ARCHIVE) days. They have the lowest storage cost, but highest data retrieval costs. Data retrieval times from GLACIER are ranging from minutes to hours, and the default data retrieval time from DEEP_ARCHIVE is 12 hours. INTELLIGENT_TIERING optimizes costs by automatically moving data to the most cost effective access tier. This is done by monitoring changes in access patterns, and moving objects between a frequent access tier and a lower-cost infrequent access tier.

Azure

In Azure multiple data storage services, including object storage, are organized under a storage account that provides a unique namespace for the data. There are multiple different storage account types that support different features and have their own pricing models. The account type defines supported storage services, redundancy levels, access tiers and encryption alternatives. In this case I’m going to concentrate on the storage account type that is recommended for most scenarios using Azure storage: general-purpose v2 accounts.

Azure’s object storage service is Blob storage and it’s one the services grouped under a storage account. The unit of deployment is called container and binary objects are called blobs that can be of three different types: block blobs, append blobs and page blobs. Block blobs are for storing binary data and text. Append blobs are like block blobs, but are optimized for append operations. Page blobs serve as disks for Azure virtual machines. The maximum amount of data that can be stored in a single container is 2 PB and the limit is applied at the storage account level — if a storage account contains multiple containers and/or other storage services, the sum of their size cannot exceed 2 PB. The maximum size of a single blob varies between the blob types from 195 GB up to 8 GB. Containers are identified with an account-level unique key and blob names must be unique within a container. Blob storage provides strong read-after-write consistency — when an object is changed or deleted, subsequent operations always return the latest version of the object.

Data redundancy is defined at storage account level which means that all objects stored in containers under the same storage account have the same redundancy level. Replication strategy is selected when a new storage account is created and it is possible to change the replication strategy of an existing storage account without any downtime. The supported redundancy levels are locally-redundant (LRS), zone-redundant (ZRS), geo-redundant (GRS), read-access geo-redundant (RA-GRS), geo-zone-rendundant (GZRS) and read-access geo-zone-rendundant (RA-GZRS). LRS replicates data three times within a data center, ZRS replicates data across three zonal data centers within a region, GRS and RA-GRS replicate data across two regions, and GZRS and RA-GZRS replicate data across three zones within a region and across two regions. When GRS and GZRS are used, the data in the secondary region is available to be read only if Microsoft initiates failover from the primary region to the secondary region. When RA-GRS and RA-GZSR are used, the data in the secondary region is available to be read all the time.

An object’s availability and pricing model are set through the access tier and redundancy level. A default access tier can be defined at the storage account level and it’s applied to all blobs that doesn’t have it explicit set at the object level. When the access tier is set at the blob level, the default tier does not apply. The default access tier can be changed any time.

Blob storage supports three access tiers: hot, cool and archive. Hot is for frequently accessed data and it has the lowest access cost, but higher storage cost than the other tiers. Cool is for infrequently accessed data that is stored at least for 30 days, and it has lower storage costs and higher access costs compared to hot storage. Compared to hot tier, data in the cool tier has slightly lower availability. Archive is for rarely accessed data that is stored at least for 180 days, and it has the lowest storage cost, but highest data retrieval costs. Data in the archive tier is stored offline and therefore, it can take several hours to retrieve. In case a blob (cool or archive) is deleted or moved to other tier before it has reached the minimum storage duration, early deletion fee is charged.

GCP

Google Cloud Storage is GCP’s distributed object storage service. In Google Cloud storage the unit of deployment is called bucket and binary objects are called objects. The amount of data that can be stored in a single bucket is unlimited, but the maximum size of a single object is 5 TB. Buckets are identified with a globally unique key and object names must be unique within a bucket.

Data redundancy is defined at bucket level which means that all the objects in the same bucket have the same redundancy level. The supported redundancy levels are regional, dual-regional and multi-regional. Region is a single geographic area, e.g. London, that consists of multiple, isolated and physically separate data centers. A dual-region is a pair of regions, e.g. Finland and Netherlands, and a multi-region is a large geographic area containing at least two regions, e.g. European Union. The location of a bucket is selected when a new bucket is created and it cannot be changed afterwards. When uploading data to a bucket, the data is always replicated to multiple data centers inside the selected location.

Cloud Storage provides strong consistency for operations regarding writing and deleting objects. In practice, this means that read operations after a successful write or delete operation always return the latest up-to-date version of the object and its metadata. Instead, operations related to access rights management are eventually consistent. It means that after a successful operation subsequent read operation may return stale data, and it may take some time for the operation to take effect.

An object’s availability and pricing model are set through the access tier. In addition, the availability is affected by the redundancy level of the bucket where the object is stored. In Cloud Storage the property that defines the access tier is called storage class. It’s possible to define a default storage class for a bucket which is inherited by individual objects unless explicitly set otherwise. It is possible to change the default storage class of a bucket any time, but the change does not affect any objects that already exist in the bucket. The storage class of an existing object can be changed any time manually or automatically through lifecycle management.

The supported storage classes are standard, nearline and coldline. Standard storage is best suited for frequently accessed data and/or data that is stored for only a short period of time. Standard storage class does not have minimum storage period, and it provides the lowest data access cost and the highest at-rest storage cost. Nearline storage is for infrequently accessed data and it has a minimum storage duration of 30 days. It is ideal for data that is accessed on average once per month or less. Compared to standard storage, at-rest storage costs are lower, but data access costs are higher and availability is slightly lower. Coldline storage is for data archiving and it has a minimum storage duration of 90 days. Despite the data being cold, it is available withing milliseconds. Coldline storage is ideal for data that is accessed on average once per year or less. Compared to standard and nearline storage, at-rest storage costs are low, but data access and per-operation costs are higher, and the availability is slightly lower. In case an object (nearline or coldline) is deleted before it has reached the minimum storage duration, early deletion charge is applied.

Lifecycle Management, Retention Policies and Versioning

AWS

Lifecycle configuration rules are defined at a bucket level and they may apply to all or a subset of objects in that bucket. A lifecycle configuration consists of conditions, actions and filters — conditions and filters define a group of objects which actions are applied to. A filter defines whether the rule applies to all or a subset of objects in a bucket. In addition, each action in a lifecycle rule has an age condition. The action is applied to all objects that meet the age condition and match the filter criteria. Therefore, a single lifecycle configuration rule may define multiple actions that are applied to objects that meet the same filter criteria, but have a different age. The supported action types are transition actions and expiration actions, and in case versioning is enabled in the bucket, they can be applied to both current and noncurrent objects. Transition actions are used to transition objects from one storage class to another, and expiration actions are used to delete expired objects. There are some constraints regarding transitioning objects between storage classes, but in general, transitioning objects to a less frequently accessed storage tier is supported.

Object versioning can be used to protect objects from unintended overwrites and deletes by keeping multiple versions of an object in a bucket. Versioning is enabled at a bucket level and it is applied to all the objects in the bucket. When versioning is enabled, a noncurrent version of an object is created every time when the latest version is updated or deleted. Different versions of an object are identified by a version ID. An object can be permanently deleted by defining the version ID in the delete request. When the version ID is not defined in the delete request, the current object becomes a noncurrent object, and a delete marker becomes the new current object. Noncurrent object versions can be permanently deleted manually or automatically using lifecycle management.

Object lock can be used to prevent objects from being overwritten or deleted for a fixed time period or permanently. Object lock is enabled at a bucket level and it requires that object versioning is enabled in the bucket — an object lock applies to an individual object version, not to all versions of the object. Enabling object lock is only possible for new buckets and once a bucket is created with object lock enabled, it’s not possible to disable the lock or suspend versioning for the bucket. If a default retention period is configured for a bucket, it automatically applies to all the new objects placed in the bucket. Object lock supports two types of locks: retention periods and legal holds. A retention period locks an object version for a fixed period of time and a legal hold locks an object version until the hold is explicitly removed. An object version can have both locks, only one of them or neither of them. Object lock supports two retention modes: governance mode and compliance mode. In governance mode users with special permissions can alter the lock settings and update/delete object versions. In protected mode object versions cannot be updated/deleted, the retention mode cannot be changed and the retention period cannot be shortened. In addition, a bucket that contains object versions protected by an object lock cannot be deleted.

Azure

Lifecycle management rules are defined at the storage account level and they may apply to containers and/or a subset of blobs. A lifecycle management rule consists of an action set and a filter set. An action set consists of action-condition pairs and the only supported condition is the age of a blob. Supported actions are transitioning blobs to a cooler storage tier and deleting blobs. Filters can be used to target the rule to specific containers and/or a subset of blobs. A lifecycle management rule may contain multiple action-condition pairs, and each action in the rule is applied to all containers and/or blobs that match the conditions. In other words, the same lifecycle management rule may apply different actions to different containers and/or blobs depending on their age.

Immutable storage for blob storage provides a retention policy at a container level, and it applies to all existing and future blobs in the container. A retention policy defines a minimum time that blobs must stay in the container — blobs cannot be deleted or updated during the retention period. However, the storage class can be changed despite the retention period. In addition, immutable storage supports legal holds that can be used to store immutable data until the legal hold is cleared, in case the retention interval is unknown. Once a blob’s age is greater than the retention period and/or a legal hold is cleared, the blob can be updated or deleted. Deleting a storage account or a container is not permitted in case they contain blobs that are protected by immutable policy. A time-based retention policy can be locked which means that it cannot be removed and a maximum of five increases to the effective retention period is allowed.

Blob storage does not directly support versioning of blobs, but the same outcome can be achieved creating a blob snapshot. A snapshot is a read-only version of a blob at a particular moment in time. The snapshot can be read, copied and deleted, but not modified.

GCP

Lifecycle management rules are assigned to a bucket and they may apply to all or a subset of objects in that bucket. A lifecycle management rule consists of a set of conditions and an action — the action is applied to objects that match all the conditions defined in the rule. Lifecycle rules support two types of actions: delete and set storage class. Delete action can be used to delete objects and set storage class action can be used to change the storage class of objects. In case a single object is subject to multiple actions, only one of the actions is performed and the object is re-evaluated before any other actions are taken. The supported conditions are object’s age, creation date, version and current storage class. All conditions are optional, but at least one condition is required.

Retention policy is defined at a bucket level, and it applies to all existing and future objects in the bucket. A retention policy defines a minimum time that objects must stay in the bucket — objects cannot be deleted or updated during the retention period. However, the storage class can be changed despite the retention period. Once an object’s age is greater than the retention period, the object can be updated or deleted. In case the object is updated, after the update it will be subject to the retention period again and cannot be deleted until its age is greater than the retention period. A retention policy can be locked to permanently set it on a bucket. This means that the policy cannot be removed and removing the bucket is possible only if every object in the bucket meets the requirements of the retention period.

Object holds are another way to prevent deleting objects. Holds are metadata flags that are placed at object level — an object with a hold placed on it cannot be deleted. Cloud Storage supports two types of holds: event-based and temporary holds. Both types behave in the same way when a bucket does not have a retention policy. Instead, when a bucket has a retention policy, the difference between the two types is that an event-based hold resets the object’s time in the bucket and a temporary hold does not.

Object versioning can be used to protect objects from being overwritten or deleted accidentally. When versioning is enabled for a bucket, a noncurrent version of an object is created every time when the latest version is updated or deleted. Noncurrent versions are identified by a generation number. An object version can be permanently deleted by defining the version number in the delete request. Older object versions can be permanently deleted manually or automatically using lifecycle management.

Encryption

All the three platforms provide data encryption in transit and at rest. By default, the data at rest is encrypted using platform managed encryption keys. In case more control over the encryption is needed, all the three platforms provide additional alternatives for data encryption and key management.

AWS

Azure

GCP

Static Website Hosting

Object storage services can also be used to host static websites that contain static content, such as HTML pages, CSS styles, JavaScript scripts and images. Instead, server-side scripting, such as PHP, Python, JSP etc., is not supported. All the three platforms provide other services for hosting dynamic websites, but they’re out of scope of this article.

--

--