Return to Answer

added 386 characters in body

Source Link

edited Sep 18, 2021 at 8:47

220.6k
35
410
625

My question is: where in this design is scalability defined? Or rather, if we scale up the volume of data that goes in and out, which part of this design would start giving issues?

Scalability approaches areis usually separated into scaling horizontallyhorizontal or verticallyvertical scaling.

"Horizontal scaling" for a data pipeline usually refers to the ability of partitioning the input data into chunks which can be processed in independently independently from each other in parallel. If that's possible in your case depends heavily on the nature of the data and the specific transformation steps. One also has to evaluate whether the input and output stages of the pipeline may become a bottleneck. As mentioned here, several NoSQL databases support horizontal scaling through sharding.

"Vertical scaling" may either refer to introducing more intermediate processing steps into the pipeline, which only brings an improvement when those steps can run interleaved (but often for the price of a higher latency). Or it can refer to increasing the computing power of the hardware (or software) for the individual processing steps (for example, by using CPUs with higher clock speeds or, systems with higher memory or network speed, or an improved compiler/interpreter). However, since the maximum affordable CPU clock speed hasn't really increased any more over the last decade for the major CPU types, this option has lost itsit's practical importancerelevance to some degree. Today, improvements in hardware come mostly through parallelization.

In essence, vertical scaling is a restricted approach - it offers you a finite number of options, but when you took them, you may easily reach their limits. Horizontal scaling, however, allows you to scale way more dynamically up and down (under the precondition you will be able to split up the data into independently processable pieces.)

My question is: where in this design is scalability defined? Or rather, if we scale up the volume of data that goes in and out, which part of this design would start giving issues?

Scalability approaches are usually separated into scaling horizontally or vertically.

"Horizontal scaling" for a data pipeline usually refers to the ability of partitioning the input data into chunks which can be processed in independently from each other in parallel. If that's possible in your case depends heavily on the nature of the data and the specific transformation steps. One also has to evaluate whether the input and output stages of the pipeline may become a bottleneck. As mentioned here, several NoSQL databases support horizontal scaling through sharding.

"Vertical scaling" may either refer to introducing more intermediate processing steps into the pipeline, which only brings an improvement when those steps can run interleaved (but often for the price of a higher latency). Or it can refer to increasing the computing power of the hardware for the individual processing steps (for example, by using CPUs with higher clock speeds or systems with higher memory or network speed). However, since the maximum affordable CPU clock speed hasn't really increased any more over the last decade for the major CPU types, this option has lost its practical importance to some degree.

My question is: where in this design is scalability defined? Or rather, if we scale up the volume of data that goes in and out, which part of this design would start giving issues?

Scalability is usually separated into horizontal or vertical scaling.

"Horizontal scaling" for a data pipeline usually refers to the ability of partitioning the input data into chunks which can be processed independently from each other in parallel. If that's possible in your case depends heavily on the nature of the data and the specific transformation steps. One also has to evaluate whether the input and output stages of the pipeline may become a bottleneck. As mentioned here, several NoSQL databases support horizontal scaling through sharding.

"Vertical scaling" may either refer to introducing more intermediate processing steps into the pipeline, which only brings an improvement when those steps can run interleaved (but often for the price of a higher latency). Or it can refer to increasing the computing power of the hardware (or software) for the individual processing steps (for example, by using CPUs with higher clock speeds, systems with higher memory or network speed, or an improved compiler/interpreter). However, since the maximum affordable CPU clock speed hasn't really increased any more over the last decade for the major CPU types, this option has lost it's practical relevance to some degree. Today, improvements in hardware come mostly through parallelization.

Source Link

answered Sep 18, 2021 at 8:39

Doc Brown

220.6k
35
410
625

My question is: where in this design is scalability defined? Or rather, if we scale up the volume of data that goes in and out, which part of this design would start giving issues?

Scalability approaches are usually separated into scaling horizontally or vertically.

"Vertical scaling" may either refer to introducing more intermediate processing steps into the pipeline, which only brings an improvement when those steps can run interleaved (but often for the price of a higher latency). Or it can refer to increasing the computing power of the hardware for the individual processing steps (for example, by using CPUs with higher clock speeds or systems with higher memory or network speed). However, since the maximum affordable CPU clock speed hasn't really increased any more over the last decade for the major CPU types, this option has lost its practical importance to some degree.