FCeph — What is Multipart upload, OMAP, and Resharding?

Avi Mor
Level Up Coding
Published in
5 min readMar 21, 2020

--

What is Multipart?

Generally, with multipart you can upload large object files in parts. The multipart includes three steps, as specified below.

The benefits of multipart uploads are the following:

  • Pause/Resume the necessary parts;
  • Improve performance — better throughput
  • Upload the object during creation.

The three steps:

  • Multipart Upload Initiation: When a request comes to upload an object file, the first thing you get is the Upload ID. This is a unique number/identifier for your upload.
  • Parts Upload: It’s important to remember that besides the upload ID, we need the part ID. It means that for every upload, there’s Upload ID and Part ID. Please note, if you upload a new file with an existing Part ID, this part will be overwritten.
  • Multipart Upload Completion or Abort: In order to complete the multipart process, we need to finish uploading all our parts. Only when the process is completed, we get the ACK that all the parts are okay, and only then can we mark the upload as completed. Please note that if the upload process is aborted, then the multipart process gets stuck and never ends, unless there’s a lifecycle rule, or you re-upload the multipart objects files again.

Let’s practice:

First, you can use the Ceph-Nano project to create a Ceph cluster. Here you can find more detail about it: https://github.com/ceph/cn.

Once the cluster is working and ready, follow these steps to understand what multipart looks like in the Ceph pool.

Let’s have a look at how an object looks in the pool after it gets uploaded. Start by uploading an object:

aws s3 cp avi_test s3://test --endpoint-url http://localhost:8080

Then, let’s view the pool called: default.rgw.buckets.data with the command:

rados ls -p default.rgw.buckets.data

By default, this pool contains all the data users have uploaded to the cluster via RGW. The output should be something like:

cdeb898f-18fb-4509-b886-5bd67c627abb.14119.1_avi_test

Please note that cdeb898f-18fb-4509-b886–5bd67c627abb.14119.1 is the bucket marker.

For multipart to take place, we can use the awscli tool. I have to say, the user side should set the multipart threshold for the multipart chuncks. let’s create one with dd command:

dd if=/dev/zero of=testfile bs=1024 count=10240

You can also use the truncate command:

truncate -s 20M text.txt

Let’s upload this file. The awscli tool supports multipart upload by default.

aws s3 cp testfile s3://test --endpoint-url http://localhost:8080

Type the command “rados ls -p pool name” again, Please note that this time, it looks like:

cdeb898f-18fb-4509-b886-5bd67c627abb.14119.1__multipart_testfile.2~Ebr8ghHTE-SGwufDxS_GaDp6WnZ9AA7.1

As I mentioned earlier, we have an Upload ID and a Part ID:

UploadID: .2~Ebr8ghHTE-SGwufDxS_GaDp6WnZ9AA7 | PartID: . 1 (at the end of the line)

Also, if we run the command:

radosgw-admin bucket stats | grep "bucket name" -A 15

We will see that the bucket id is unchanged:

"id": "cdeb898f-18fb-4509-b886-5bd67c627abb.14119.1"

When multipart is used, objects are uploaded to a pool called: “default.rgw.buckets.non-ec”. This pool will be empty when multipart is not in use. Only when multipart is running we will see objects in this pool.

To see all the multipart uploads status, run the command:

aws s3api list-multipart-uploads --bucket my-bucket

The GC (Garbage Collector):

When users delete files or upload files with the same name, the files are overwritten (also in re-multipart), and Ceph will insert them into something called GC.

Ceph does not remove the files immediately, we can use the commands to list all the files scheduled for removal:

radosgw-admin gc list

And:

radosgw-admin gc list --include-all

specify — include-all to list all entries, including unexpired.

By default, Ceph waits for 2 hours between gc cycles. To manually run the gc deletion process, run:

radosgw-admin gc process --include-all

Ceph index and resharding

What is OMAP?

The index object’s OMAP contains the key-value for all the objcet’s in the pool.

The important thing is the pool called: “default.rgw.buckets.index”. Basically it’s is the index of the bucket. Let’s run the command:

rados ls -p default.rgw.buckets.index

We will see the bucket id, beginning with ״.dir.״:

.dir.cdeb898f-18fb-4509-b886-5bd67c627abb.14119.1

To list the objects inside this bucket, write the following command:

rados listomapkeys -p default.rgw.buckets.index .dir.cdeb898f-18fb-4509-b886-5bd67c627abb.14119.1

Output:

avi_test

There are some situations when the OMAP gets bigger and bigger, for example: 10 Million objects in one OMAP. It will probably impact our performance. Basically, we read the index while write request is sent, then it search for the right place that the object should be writen.

Because of that, we can use the resharding option, this command takes a large OMAP and cuts it into pieces. So for one bucket, we will have a few OMAPs. The best practice is 100,000 objects per shard.

To check the status of every bucket we can use the command:

radosgw-admin check limit | grep "OVER 100.000000%"

This command finds for us the buckets that have large OMAPs.

Partial output:

"fill_status": "OVER 100.000000%"

It means that the bucket gets over the limit, and we need to make the resharding process.

Important thing: Please note, when the resharding process is running, there is no I/O to this bucket, although this is a rather quick process for buckets that are not so big. My suggestion is, before every reshard process, please consult with the support about it.

To list all buckets, use the following command:

radosgw-admin bucket list

Once we found the problematic buckets, we can use the command:

radosgw-admin bucket reshard --num-shard 2 --bucket=test

We will see that now, we have 2 OMAPS:

rados ls -p default.rgw.buckets.index.dir.cdeb898f-18fb-4509-b886-5bd67c627abb.14270.1.0
.dir.cdeb898f-18fb-4509-b886-5bd67c627abb.14270.1.1

Basically, we will see that parts of the files are located in the first OMAP and the other files are in the second one.

To change the value of max objects per shard, edit the value (please consult with the support beforehand):

rgw_max_objects_per_shard = 100000 to something else.

What is RGW dynamic bucket index resharding?

Basically, working with a large environment, the resharding process can be very discouraging. This feature can help with that, it detects the problematic bucket and runs the process in the background and creates the required shards to this bucket.

How does it work?

When a new object is added to the bucket. Also, there is an automatic background process that runs periodically. Once we have the required bucket, the bucket is added to the reshard queue.

To list all the bucket that is currently in the list, type the command:

radosgw-admin reshard list

To check the status of the process:

radosgw-admin reshard status --bucket <bucket_name>

Enable/Disable dynamic bucket index resharding, make the change in: /etc/ceph/ceph.conf, under rgw section.

rgw_dynamic_resharding = true

--

--