Seeking suggestions: SQS-based multi-stage pipeline for processing vendor catalog files (CSV/XLSX) + Amazon SP-API enrichment #185854
Replies: 3 comments
-
|
Hi @ihtishamtanveer! 👋 Your architecture is logically sound, but chaining raw SQS queues for a multi-stage dependency workflow often leads to "orchestration hell" (handling partial failures, zombie states, and complex retries). Here are specific recommendations to solve your pain points based on similar high-scale ingestion pipelines: 1. Workflow OrchestrationInstead of managing state and retries manually across 4 SQS queues, consider AWS Step Functions (Distributed Map) or Temporal.io.
2. Addressing your specific questions:On Token Management & Rate Limiting (SP-API)
On Progress Tracking
On Idempotency (FIFO vs Standard)
Summary Recommendation: |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for posting in the GitHub Community, @ihtishamtanveer! We're happy you're here. You are more likely to get a useful response if you are posting your question in the applicable category, the Discussions category is solely related to conversations around the GitHub product Discussions. This question should be in the |
Beta Was this translation helpful? Give feedback.
-
|
This is a solid shape for a "long pipe with strict external throttles." The main thing I’d change is: "stop treating SQS + in-memory maps as the coordinator." Let SQS move work, but let your DB/Redis be the source of truth for "state, idempotency, and completion".
Token pool + waves is fine, but "1 token = 1 worker” won’t hold once you scale out. What works better in practice:
Key point: "assign work to tokens by availability", not by instance. Instances are cattle; the limiter is the brain.
"In-memory UPC→ASIN completion per catalog” is the first thing that will bite you. A simple durable model:
Flow:
This avoids the "last batch” guessing game.
For SQS: I’d keep "Standard" unless you -must- preserve order. FIFO reduces throughput and you’ll feel it. Instead, do idempotency at the edges:
Yes to "DLQ per stage". Different stages fail for different reasons.
You may not need 4 queues forever. A common simplification:
Why: you reduce cross-queue coordination and state merging. Keep the stages split only if you truly need independent scaling knobs.
If you need "auditable runs, pause/resume, partial reruns, and clean visibility", a workflow engine is worth it.
If you share your current message schema (what fields you pass between stages) and what DB you’re on, I can suggest a clean set of tables + idempotency keys that make "safe retries + exact completion” basically automatic. I am speak for my exp, thanks |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussion Type
Question
Discussion Content
I’m working on a Node.js service that ingests vendor catalog files (CSV/XLSX), enriches items via Amazon SP-API, computes profitability metrics, and persists results to a database.
The pipeline is distributed across multiple SQS queues to support horizontal scaling and better handling of Amazon SP-API rate limits.
I’m looking for feedback on architecture, token management, progress tracking, idempotency, and retry strategies.
High-Level Architecture
API Trigger
Endpoint: POST /api/process-csv
Payload:
Download & Parse
Multi-Stage SQS Pipeline
Stage 1: UPC → ASIN Resolution
Stage 2: Offers Fetch
Stage 3: Fees Estimation
Stage 4: Merge + Compute + Persist
Token Pool Strategy
Progress Tracking
Constraints & Pain Points
What I’m Looking for Feedback On
Token Management
Progress & Completion Tracking
Idempotency & Deduplication
Error Handling & Retries
Open to Suggestions
If you’ve built similar pipelines (file ingestion → enrichment APIs → persistence), I’d appreciate guidance on:
Beta Was this translation helpful? Give feedback.
All reactions