Seeking suggestions: SQS-based multi-stage pipeline for processing vendor catalog files (CSV/XLSX) + Amazon SP-API enrichment #185854

ihtishamtanveer · 2026-01-28T19:21:58Z

ihtishamtanveer
Jan 28, 2026

Discussion Type

Question

Discussion Content

I’m working on a Node.js service that ingests vendor catalog files (CSV/XLSX), enriches items via Amazon SP-API, computes profitability metrics, and persists results to a database.

The pipeline is distributed across multiple SQS queues to support horizontal scaling and better handling of Amazon SP-API rate limits.

I’m looking for feedback on architecture, token management, progress tracking, idempotency, and retry strategies.
High-Level Architecture
API Trigger
Endpoint: POST /api/process-csv
Payload:

fileUrl
catalogId

Download & Parse

Download file to disk
Convert Excel to CSV (if needed)
Stream CSV and emit fixed-size batches (20 rows)

Multi-Stage SQS Pipeline
Stage 1: UPC → ASIN Resolution

Uses searchCatalogItems (Amazon SP-API)
Handles pagination
Retry with exponential backoff

Stage 2: Offers Fetch

Fetch Buy Box / offer data per ASIN
ASINs chunked to 20
Executed in parallel waves

Stage 3: Fees Estimation

Depends on Buy Box price
Processed using parallel waves

Stage 4: Merge + Compute + Persist

Merge enrichment data
Compute profit / ROI
Bulk upsert into the database

Token Pool Strategy

Maintain multiple Amazon refresh tokens
1:1 mapping between concurrent work and tokens
Goal: reduce 429 Too Many Requests errors

Progress Tracking

catalog_v2.total_quantity updated once CSV reading completes
matched / unmatched / saved / progress updated incrementally in Stage 4
In-memory UPC → ASIN maps per catalog for UPC-level completion

Constraints & Pain Points

Amazon SP-API throttling (429) is the primary bottleneck
In-memory completion tracking breaks with horizontal scaling
Strong guarantees needed:
- Avoid double-counting batches
- Correct last-batch completion
- Safe retries and redelivery

What I’m Looking for Feedback On
Token Management

Is token pool + wave-based parallelism reasonable?
Better token-to-work assignment across instances?

Progress & Completion Tracking

Scalable alternatives to in-memory state?
Redis or DB-backed counters?
Durable state machines?

Idempotency & Deduplication

FIFO vs Standard SQS?
messageGroupId per catalog or per batch?
Idempotency keys in DB or Redis?

Error Handling & Retries

DLQs per stage?
Centralized retry strategy?
Workflow engines (Step Functions, Temporal)?

Open to Suggestions
If you’ve built similar pipelines (file ingestion → enrichment APIs → persistence), I’d appreciate guidance on:

Cleaner architecture patterns
Scaling under strict rate limits
Durable progress tracking models
Simplifying stages without losing throughput

Dagris · 2026-01-29T09:29:36Z

Dagris
Jan 29, 2026

Hi @ihtishamtanveer! 👋

Your architecture is logically sound, but chaining raw SQS queues for a multi-stage dependency workflow often leads to "orchestration hell" (handling partial failures, zombie states, and complex retries).

Here are specific recommendations to solve your pain points based on similar high-scale ingestion pipelines:

1. Workflow Orchestration

Instead of managing state and retries manually across 4 SQS queues, consider AWS Step Functions (Distributed Map) or Temporal.io.

Why? They maintain the "state" of the ingestion file automatically. You don't need to build a custom progress tracker. If "Stage 3" fails for a specific batch, the workflow engine handles the exponential backoff without losing the context of the parent job.

2. Addressing your specific questions:

On Token Management & Rate Limiting (SP-API)

Critique: A "Token Pool" is complex to implement correctly in a distributed environment.
Better Approach: Use a Redis-backed Rate Limiter (like the bottleneck library with Redis clustering). It coordinates the 429 throttling across all your consumer instances globally. If SP-API says "wait", all instances pause effectively.

On Progress Tracking

Solution: Move away from in-memory maps immediately. Use Redis Atomic Increments (INCR).
Pattern:
1. Parse CSV -> Set total_batches in Redis.
2. Worker finishes batch -> INCR processed_batches.
3. Frontend polls Redis for processed / total.
This is blazing fast and works perfectly with horizontal scaling.

On Idempotency (FIFO vs Standard)

Verdict: Stick to Standard Queues for throughput.
Strategy: Handle idempotency at the Database Level. Use UPSERT (e.g., ON CONFLICT DO UPDATE in Postgres) based on a composite key (CatalogID + SKU). FIFO queues have lower throughput limits and strict ordering often isn't necessary for "enrichment" tasks, just final consistency.

Summary Recommendation:
If you want to keep the stack simple (Node.js + AWS), look at BullMQ Pro (Redis-based queues). It supports "Flows" (parent/child jobs) and robust Rate Limiting out of the box, which solves 90% of what you are trying to build with raw SQS.

0 replies

mecodeatlas · 2026-01-29T12:55:51Z

mecodeatlas
Jan 29, 2026
Maintainer

Thanks for posting in the GitHub Community, @ihtishamtanveer!

We're happy you're here. You are more likely to get a useful response if you are posting your question in the applicable category, the Discussions category is solely related to conversations around the GitHub product Discussions. This question should be in the Programming Help category. I've gone ahead and moved it for you. Good luck!

0 replies

healer0805 · 2026-01-29T20:02:47Z

healer0805
Jan 29, 2026

This is a solid shape for a "long pipe with strict external throttles." The main thing I’d change is: "stop treating SQS + in-memory maps as the coordinator." Let SQS move work, but let your DB/Redis be the source of truth for "state, idempotency, and completion".

Token + rate-limit management

Token pool + waves is fine, but "1 token = 1 worker” won’t hold once you scale out.

What works better in practice:

"Per-token limiter" (leaky bucket) backed by Redis: token:{id}:budget with TTL and refill based on SP-API headers.
"Adaptive concurrency": if you see 429s, automatically cut parallelism; if clean for N minutes, slowly ramp up.
"Jitter everywhere": your backoff should be decorrelated jitter, not pure exponential, or workers synchronize and stampede.

Key point: "assign work to tokens by availability", not by instance. Instances are cattle; the limiter is the brain.

Progress tracking that survives scaling

"In-memory UPC→ASIN completion per catalog” is the first thing that will bite you.

A simple durable model:

catalog_run (run_id, catalog_id, status, totals, started_at, finished_at)
catalog_batch (run_id, batch_id, row_start,row_end, status, attempts, last_error, updated_at)
Optional catalog_row only if you truly need per-row debug.

Flow:

When you parse CSV, you "create batch records" and enqueue messages that include run_id + batch_id.
Stage 4 does a "single atomic "mark batch done" update.
Completion is: SELECT COUNT(-) WHERE status != DONE (or a materialized counter updated atomically).

This avoids the "last batch” guessing game.

Idempotency + dedup

For SQS: I’d keep "Standard" unless you -must- preserve order. FIFO reduces throughput and you’ll feel it.

Instead, do idempotency at the edges:

Every stage writes a small "I processed X” record keyed by (run_id, batch_id, stage) (or use a single batch row with stage statuses).
In your DB upsert, include an "idempotency key" like run_id + asin (or run_id + upc) so repeats don’t double-count.
If you want a lightweight guard, Redis SETNX idempotency:{run_id}:{stage}:{batch_id} ttl is good, but DB is safer for correctness.

Retries, DLQs, and "poison” batches

Yes to "DLQ per stage". Different stages fail for different reasons.

Retry transient (429/5xx/network) with jittered backoff.
Fail fast on permanent (bad input, invalid UPC, SP-API 4xx that won’t change).
Track attempts on the batch record. After N attempts, shove to DLQ and mark batch FAILED so completion logic stays honest.

Simplify the pipeline (without losing throughput)

You may not need 4 queues forever. A common simplification:

Queue 1: "resolve UPC→ASIN"
Queue 2: "enrich + compute + persist" (offers + fees + merge in one worker)

Why: you reduce cross-queue coordination and state merging. Keep the stages split only if you truly need independent scaling knobs.

When to bring in Step Functions / Temporal

If you need "auditable runs, pause/resume, partial reruns, and clean visibility", a workflow engine is worth it.

If you mostly need "queues + durable state + retries,” you can stay with SQS + DB and be fine.
If people will ask "why is catalog 123 stuck?” every day, Temporal/Step Functions pays for itself.

If you share your current message schema (what fields you pass between stages) and what DB you’re on, I can suggest a clean set of tables + idempotency keys that make "safe retries + exact completion” basically automatic.

I am speak for my exp, thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Seeking suggestions: SQS-based multi-stage pipeline for processing vendor catalog files (CSV/XLSX) + Amazon SP-API enrichment #185854

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Seeking suggestions: SQS-based multi-stage pipeline for processing vendor catalog files (CSV/XLSX) + Amazon SP-API enrichment #185854

Uh oh!

ihtishamtanveer Jan 28, 2026

Discussion Type

Discussion Content

Replies: 3 comments

Uh oh!

Dagris Jan 29, 2026

1. Workflow Orchestration

2. Addressing your specific questions:

Uh oh!

mecodeatlas Jan 29, 2026 Maintainer

Uh oh!

healer0805 Jan 29, 2026

ihtishamtanveer
Jan 28, 2026

Dagris
Jan 29, 2026

mecodeatlas
Jan 29, 2026
Maintainer

healer0805
Jan 29, 2026