S3 API

Slack Docker Pulls GitHub edit source

Alluxio supports a RESTful API that is compatible with the basic operations of the Amazon S3 API.

The Alluxio S3 API should be used by applications designed to communicate with an S3-like storage and would benefit from the other features provided by Alluxio, such as data caching, data sharing with file system based applications, and storage system abstraction (e.g., using Ceph instead of S3 as the backing store). For example, a simple application that downloads reports generated by analytic tasks can use the S3 API instead of the more complex file system API.

Limitations and Disclaimers

Alluxio Filesystem Limitations

Only top-level Alluxio directories are treated as buckets by the S3 API.

  • Hence the root directory of the Alluxio filesystem is not treated as an S3 bucket. Any root-level objects (eg: alluxio://file) will be inaccessible through the Alluxio S3 API.
  • To treat sub-directories as a bucket, the separator : must be used in the bucket name (eg: s3://sub:directory:bucket/file).
    • Note that this is purely a convenience feature and hence is not returned by API Actions such as ListBuckets.

Alluxio uses / as a reserved separator. Therefore, any S3 paths with objects or folders named / (eg: s3://example-bucket//) will cause undefined behavior. For additional limitations on object key names please check this page: Alluxio limitations

No Bucket Virtual Hosting

Virtual hosting of buckets is not supported in the Alluxio S3 API. Therefore, S3 clients must utilize path-style requests (i.e: http://s3.amazonaws.com/{bucket}/{object} and NOT http://{bucket}.s3.amazonaws.com/{object}).

S3 Writes Implicitly Overwrite

As described in the AWS S3 docs for PutObject:

Amazon S3 is a distributed system. If it receives multiple write requests for the same object simultaneously, it overwrites all but the last object written. Amazon S3 does not provide object locking; if you need this, make sure to build it into your application layer or use versioning instead.

  • Note that at the moment the Alluxio S3 API does not support object versioning

Alluxio S3 will overwrite the existing key and the temporary directory for multipart upload.

Folders in ListObjects(V2)

All sub-directories in Alluxio will be returned in ListObjects(V2) as 0-byte folders. This behavior is in accordance with if you used the AWS S3 console to create all parent folders for each object.

Tagging & Metadata Limits

User-defined tags on buckets & objects are limited to 10 and obey the S3 tag restrictions.

  • Set the property key alluxio.proxy.s3.tagging.restrictions.enabled=false to disable this behavior.

The maximum size for user-defined metadata in PUT-requests is 2KB by default in accordance with S3 object metadata restrictions.

  • Set the property key alluxio.proxy.s3.header.metadata.max.size to change this behavior.

Performance Implications

The S3 API leverages the Alluxio REST proxy , introducing an additional network hop for Alluxio clients. For optimal performance, it is recommended to run the proxy server and an Alluxio worker on each compute node. It is also recommended to put all the proxy servers behind a load balancer.

Global request headers

Header Content Description
Authorization AWS4-HMAC-SHA256 Credential={user}/..., SignedHeaders=..., Signature=... There is currently no support for access & secret keys in the Alluxio S3 API. The only supported authentication scheme is the SIMPLE authentication type. By default, the user that is used to perform any operations is the user that was used to launch the Alluxio proxy process.

Therefore this header is used exclusively to specify an Alluxio ACL username to perform an operation with. In order to remain compatible with other S3 clients, the header is still expected to follow the AWS Signature Version 4 format.

When supplying an access key to an S3 client, put the intended Alluxio ACL username. The secret key is unused so you may use any dummy value.

Supported S3 API Actions

The following table describes the support status for current S3 API Actions:

S3 API Action Supported Headers Supported Query Parameters
AbortMultipartUpload
  • None
N/A
CompleteMultipartUpload
  • None
N/A
CopyObject
  • Content-Type,
  • x-amz-copy-source,
  • x-amz-metadata-directive,
  • x-amz-tagging-directive,
  • x-amz-tagging
N/A
CreateBucket
  • None
N/A
CreateMultipartUpload
  • Content-Type,
  • x-amz-tagging
N/A
DeleteBucket
  • None
N/A
DeleteBucketTagging
  • None
N/A
DeleteObject
  • None
N/A
DeleteObjects
  • None
N/A
DeleteObjectTagging
  • None
  • None
GetBucketTagging
  • None
N/A
GetObject
  • Range
  • None
GetObjectTagging
  • None
  • None
HeadBucket
  • None
  • None
HeadObject
  • None
  • None
ListBuckets N/A N/A
ListMultipartUploads
  • None
  • None
ListObjects
  • None
  • delimiter,
  • encoding-type,
  • marker,
  • max-keys,
  • prefix
ListObjectsV2
  • None
  • continuation-token,
  • delimiter,
  • encoding-type,
  • max-keys,
  • prefix,
  • start-after
ListParts
  • None
  • None
PutBucketTagging
  • None
N/A
PutObject
  • Content-Length,
  • Content-MD5,
  • Content-Type,
  • x-amz-tagging
N/A
PutObjectTagging
  • None
  • None
UploadPart
  • Content-Length,
  • Content-MD5
N/A
UploadPartCopy
  • x-amz-copy-source
N/A

Property Keys

The following table contains the configurable Alluxio property keys which pertain to the Alluxio S3 API.

Property NameDefaultDescription

Example Usage

S3 API Actions


AbortMultipartUpload

See AbortMultipartUpload on AWS


CompleteMultipartUpload

See CompleteMultipartUpload on AWS


CopyObject

See CopyObject on AWS


CreateBucket

See CreateBucket on AWS


CreateMultipartUpload

See CreateMultipartUpload on AWS


DeleteBucket

See DeleteBucket on AWS


DeleteBucketTagging

See DeleteBucketTagging on AWS


DeleteObject

See DeleteObject on AWS


DeleteObjects

See DeleteObjects on AWS


DeleteObjectTagging

See DeleteObjectTagging on AWS


GetBucketTagging

See GetBucketTagging on AWS


GetObject

See GetObject on AWS


GetObjectTagging

See GetObjectTagging on AWS


HeadBucket

See HeadBucket on AWS


HeadObject

See HeadObject on AWS


ListBuckets

See ListBuckets on AWS


ListObjects

See ListObjects on AWS


ListMultipartUploads

See ListMultipartUploads on AWS


ListObjectsV2

See ListObjectsV2 on AWS


ListParts

See ListParts on AWS


PutBucketTagging

See PutBucketTagging on AWS


PutObject

See PutObject on AWS


PutObjectTagging

See PutObjectTagging on AWS


UploadPart

See UploadPart on AWS


UplaodPartCopy

See UploadPartCopy on AWS


Python S3 Client

Tested for Python 2.7.

Create a connection

Please note you have to install boto package first.

$ pip install boto
import boto
import boto.s3.connection

conn = boto.connect_s3(
    aws_access_key_id = '',
    aws_secret_access_key = '',
    host = 'localhost',
    port = 39999,
    path = '/api/v1/s3',
    is_secure=False,
    calling_format = boto.s3.connection.OrdinaryCallingFormat(),
)

Authenticating as a user

By default, authenticating with no access_key_id uses the user that was used to launch the proxy as the user performing the file system actions.

Set the aws_access_key_id to a different username to perform the actions under a different user.

Create a bucket

bucketName = 'bucket-for-testing'
bucket = conn.create_bucket(bucketName)

List all buckets owned by the user

Authenticating as a user is necessary to have buckets returned by this operation.

conn = boto.connect_s3(
    aws_access_key_id = 'testuser',
    aws_secret_access_key = '',
    host = 'localhost',
    port = 39999,
    path = '/api/v1/s3',
    is_secure=False,
    calling_format = boto.s3.connection.OrdinaryCallingFormat(),
)

conn.get_all_buckets()

Put a small object

smallObjectKey = 'small.txt'
smallObjectContent = 'Hello World!'

key = bucket.new_key(smallObjectKey)
key.set_contents_from_string(smallObjectContent)

Get the small object

assert smallObjectContent == key.get_contents_as_string()

Upload a large object

Create a 8MB file on local file system.

$ dd if=/dev/zero of=8mb.data bs=1048576 count=8

Then use python S3 client to upload this as an object

largeObjectKey = 'large.txt'
largeObjectFile = '8mb.data'

key = bucket.new_key(largeObjectKey)
with open(largeObjectFile, 'rb') as f:
    key.set_contents_from_file(f)
with open(largeObjectFile, 'rb') as f:
    largeObject = f.read()

Get the large object

assert largeObject == key.get_contents_as_string()

Delete the objects

bucket.delete_key(smallObjectKey)
bucket.delete_key(largeObjectKey)

Initiate a multipart upload

mp = bucket.initiate_multipart_upload(largeObjectKey)

Upload parts

import math, os

from filechunkio import FileChunkIO

# Use a chunk size of 1MB (feel free to change this)
sourceSize = os.stat(largeObjectFile).st_size
chunkSize = 1048576
chunkCount = int(math.ceil(sourceSize / float(chunkSize)))

for i in range(chunkCount):
    offset = chunkSize * i
    bytes = min(chunkSize, sourceSize - offset)
    with FileChunkIO(largeObjectFile, 'r', offset=offset, bytes=bytes) as fp:
        mp.upload_part_from_file(fp, part_num=i + 1)

Complete the multipart upload

mp.complete_upload()

Abort the multipart upload

Non-completed uploads can be aborted.

mp.cancel_upload()

Delete the bucket

bucket.delete_key(largeObjectKey)
conn.delete_bucket(bucketName)