Skip to content

CloudTrail-optimized polling #86

@brandond

Description

@brandond

A very common use case for S3 polling is ingest of CloudTrail logs, which have a fixed key format within a bucket:
/AWSLogs/<AccountId>/CloudTrail/<region>/<YYYY>/<MM>/<DD>/<AccountId>_CloudTrail_<region>_<ISODate>_<random>.json.gz

Given this fixed structure, ingest and incremental polling can be optimized given:

  • Objects will not be rewritten or appended to once created
  • Within a given account and region, only one sub-prefix (the current date) will be written to.

The process would look something like:

  • Walk the prefix tree to build an initial list of /AWSLogs/<AccountId>/CloudTrail/<region>/ prefixes
  • For each prefix in the list, spawn a poller thread:
    • Walk the prefix tree to the first <YYYY>/<MM>/<DD>/ sub-prefix
    • List objects within this prefix, paging through results using max_keys, next_continuation_token, and start_after until no further objects are returned
    • When no further objects are returned, remove the <DD> token from current_prefix and call list_objects_v2({prefix: parent_prefix, start_after: current_prefix})
    • If a new common prefix is returned, update current_prefix and begin listing objects
    • If no new prefix is returned, repeat for <MM> and <YYYY> tokens
    • If no new sub-prefix is discovered, store last object key as start_after and sleep for a period of time
    • Re-start polling loop
  • Periodically check to see if new /AWSLogs/<AccountId>/CloudTrail/<region> prefixes are present and spawn new poller threads as necessary
  • If a poller thread's /AWSLogs/<AccountId>/CloudTrail/<region> prefix disappears, it should terminate.

Using the above logic, the lastdb file only needs to persist a small amount of information:

  • List of /AWSLogs/<AccountId>/CloudTrail/<region>/ prefixes with:
    • current_prefix (<YYYY>/<MM>/<DD>/)
    • next_continuation_token (opaque)
    • start_after (last object key processed)

I am happy to work on this with an optimized poller class that could be selected via configuration option. Not sure if I should fork the current master branch, or the WIP threading branch?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions