Managing Large Numbers of Small Files in S3: Best Practices for Buckets and Prefixes

It’s common for teams to misunderstand how Amazon S3 works internally. As a result, they may encounter issues such as slow object listing, 503 Service Unavailable errors, degraded performance, and other operational challenges.

Let’s walk through a practical scenario:

600 million small objects in a single S3 bucket (with ongoing growth)
Frequent listing and access operations

We’ll explore how to design your object key structure to ensure stable performance and scalability.

Key Concepts

To solve this, we rely on:

Sharding
Prefix distribution
Hash-based key design

How S3 Handles Prefixes

A critical concept:

Amazon S3 is not a file system or a traditional database.
It is a distributed key-value store where the object key is simply a string.

S3 automatically scales by distributing requests across prefixes (e.g., data/, logs/2024/05/). Each prefix can handle requests independently and has its own throughput limits.

Key takeaway

The more evenly distributed your prefixes are, the higher your total throughput.
If most requests hit the same prefix, you’ll eventually face:
- Increased latency
- 503 errors
- Bottlenecks during bulk operations

Scenario Overview

600 million objects
Small object size
Continuous growth
High read/list activity

Let’s evaluate different design approaches.

1. No Prefix Structure (All Objects in Root)

All objects are stored without any prefix hierarchy.

Problems

Effectively a single prefix
All traffic hits one partition
Quickly reaches throughput limits
Leads to 503 errors

Verdict

Worst possible design — avoid entirely

2. One Prefix per Object (Millions of Unique Prefixes)

Each object gets its own unique “folder.”

Advantages

Perfect load distribution
No hot prefixes

Problems

Listing becomes impractical
Navigation via console/API is difficult
S3 internals (metadata, billing, indexing) aren’t optimized for this
High operational complexity

Verdict

Overengineered — not recommended in practice

3. Controlled Sharding (~600 Prefixes)

Split objects across a fixed number of prefixes (~1 million objects per prefix).

Advantages

Even load distribution
Parallel processing across prefixes
Manageable structure
Aligns with S3 best practices
Scales without redesign

Verdict

Recommended approach

Prefix Naming Strategy

Avoid sequential naming like:

data-1/, data-2/, data-3/

Especially if object keys are also sequential — this creates hot partitions.

Best practice: Use hashing

Algorithm:

Take a unique object identifier (e.g., user_id, file_id)
Compute a hash (MD5, SHA-1, SHA-256)
Use the first N characters as the prefix

Why Hex Prefixes?

Hexadecimal (0–9, a–f) is widely used because:

16 possible values per character → strong distribution
Most hash functions output hex
Easy to implement
Predictable scaling

Sharding Depth Options

1 Hex Character → 16 Prefixes

0/, 1/, ..., f/

~37.5M objects per prefix

Use when:

Low traffic
Prototypes

Downside:

Too many objects per prefix

2 Hex Characters → 256 Prefixes

00/, 01/, ..., ff/

~2.34M objects per prefix

Use when:

Moderate load (thousands of RPS)
Typical production workloads

3 Hex Characters → 4096 Prefixes

000/, 001/, ..., fff/

~146K objects per prefix

Use when:

High-load systems
Tens of thousands of RPS
Critical services

Best default choice for most production systems

Example Implementation (Python)

import hashlib

def get_s3_path(object_id: str, prefix_level: int = 3) -> str:
    """
    Generate an S3 object path using hex-based prefix sharding.

    Args:
        object_id: Unique object identifier
        prefix_level: Number of hex characters (1–4)
    """

    hash_hex = hashlib.md5(object_id.encode()).hexdigest()
    prefix = hash_hex[:prefix_level]

    # Create hierarchy: a/b/c/object_id
    prefix_path = "/".join(prefix)

    return f"{prefix_path}/{object_id}"


# Examples:
print(get_s3_path("user_12345.pdf", 2))   # 7/f/user_12345.pdf
print(get_s3_path("image_67890.jpg", 3))  # a/1/b/image_67890.jpg

Additional Recommendations

Consider multiple buckets if storing more than ~500 million objects
Always apply key sharding at scale
Use multi-level hash prefixes (e.g., f4c/3a/9/...)
Avoid monotonically increasing keys at the beginning of object names
Never generate prefixes manually — always derive them via hashing
Use at least 3 hex characters (~4096 prefixes) for production
Avoid full-bucket LIST operations — rely on prefixes and delimiters

Have you tried Cloud4U services? Not yet?

Visit Website

Try for free