Parallelization factor: AWS Kinesis data streams to Lambda

Question

I'm very confused with the concept of ParallelizationFactor.

My understanding

https://stackoverflow.com/a/57534322/13000229
In the past, one KDS shard can send data to only one Lambda instance/invocation. More than one Lambda instance getting data from the same KDS shard can't run concurrently.

https://aws.amazon.com/blogs/compute/new-aws-lambda-scaling-controls-for-kinesis-and-dynamodb-event-sources/
In Nov 2019, a new parameter ParallelizationFactor (Concurrent batches per shard) came out.

The default factor of one exhibits normal behavior. A factor of two allows up to 200 concurrent invocations on 100 Kinesis data shards.

Questions

By using ParallelizationFactor, can more than one Lambda instance get different data from the same KDS shard concurrently?
For example, the shard has data d1, d2, d3 d4, d5 and d6, and we assume BatchSize = 2 and ParallelizationFactor = 2. Lambda instance A can consume d1 and d2, while Lambda instance B can consume d3 and d4 at the same time. Then once Lambda instance A finishes the first batch, it starts processing d5 and d6 and so on.

If Question 1 is correct, what might be sacrificed? (e.g. the order in the same shard, one piece of data may be processed more than once)
If Question 1 is not correct, how will data in KDS shards be processed by Lambda concurrently?

ghostrider · Accepted Answer · 2024-01-04 13:20:18Z

Yes when using ParallelizationFactor more than one lambda can process records from the same shard concurrently. The order is maintained because records with the same partition key will not be processed concurrently.

For example let’s say that you have two partitions: Partition1 and Partition2 and two shards

Scenario 1: all of your records share only two partition keys: PartitionKey1 and PartitionKey2. In this case all records with PartitionKey1 will end up in Partition1 and all records with PartitionKey2 will end up in Partition2. Setting ParallelizationFactor will not result in any records being processed concurrently because records of the same partition key are processed in order.

Scenario 2: your records have 20 different partition keys: PartitionKey1…PartitionKey20. Ideally Shard1 will contain around half of your records and Shard2 will contain the other half (if they are evenly distributed across the two shards). Setting ParallelizationFactor in this case will result in records being process concurrently. Records within the shard that have different partition keys will be processed concurrently.

Do you know if "Bisect on Function Error" behave the same?
– omer blechman
Commented May 11, 2023 at 10:53 — omer blechman, Commented May 11, 2023 at 10:53

Collectives™ on Stack Overflow

Parallelization factor: AWS Kinesis data streams to Lambda

1 Answer 1

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Linked

Related