Newest 'spark' Questions - Software Engineering Stack Exchange

0 votes

0 answers

93 views

How to connect to SFTP using Apache Spark 3.5 with Scala 2.12 for parallel file transfers?

I am working on a project where I need to transfer thousands of files (each sized between 50-60 MB) every hour from an SFTP server to local storage or AWS S3. I am using Apache Spark 3.5 with Scala 2....

Abhishek

11

asked Jul 8 at 10:44

2 votes

3 answers

282 views

Method naming conventions "setX" vs "withX"

Why learning about Fluent Interfaces, I came across this post which states that using set hints one is mutating the object whereas with is returing a new object. I have seen this pattern first hand ...

Ezequiel Castaño

151

asked Oct 30, 2022 at 23:57

1 vote

3 answers

979 views

Python: Is returning self in method chaining a violation of Demeter's law?

In Python it is very common to see code that uses method chaining, the main difference with code elsewhere is that this is also combined with returning an object of the same type but modified. This ...

Ezequiel Castaño

151

asked Oct 29, 2022 at 4:16

2 votes

1 answer

166 views

Data Ingest Architecture Advice

I have a requirement where we need to collect N different events and store them for analysis. I am having trouble coming up with a general architecture for this. FINAL REQUIREMENTS The end goal of the ...

Sriram R

29

asked Feb 26, 2022 at 4:43

5 votes

1 answer

1k views

How do you perform accumulation on large data sets and pass the results as a response to REST API?

I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: ...

Namah

61

asked Feb 19, 2021 at 22:08

2 votes

1 answer

260 views

How (whether to?) include Apache Spark in my Architecture

Brief overview of general data flow The general goal of my system is to allow users to upload many different types of files containing data (PDF, CSV, ZIP, etc.), then index it and perform some basic ...

foxtrotuniform6969

859

asked Feb 2, 2021 at 21:52

-2 votes

1 answer

695 views

Export huge excel file

I develop a web application in Angular (frontend) and Scala (backend) for a big data team. Because they use large files for export/import, I build a module which is a copy of Microsoft Excel. So, what ...

AlleXyS

117

asked Jul 8, 2020 at 9:39

-2 votes

1 answer

97 views

What are the benefits of running Apache Spark on Kubernetes?

When running Apache Spark one submits jobs to a Cluster Manager. The cluster manager is delegated with the task of accepting / declining requests for resources. The cluster manager could either be ...

vi_ral

121

asked Jun 1, 2020 at 21:49

2 votes

0 answers

72 views

How to manage scheduled ETL jobs that are time sensitive?

We have some ETL jobs that are scheduled to run every day, and some that are scheduled to run every week via Control-M. These types of jobs tag data with the date the job was run and perform filter ...

Igneous01

2,333

asked Dec 12, 2019 at 14:41

-3 votes

1 answer

112 views

Can accessing the same API from different languages be more performant?

I've just started my first proper internship in industry (not learning to code but learning to write software that does stuff). My employer makes use of Apache Spark, as they do a lot of Big Data ...

HenryPrickettMorgan

13

asked Jun 5, 2019 at 18:19

2 votes

0 answers

37 views

How to design a report processing model using Spark in the most efficient way

I have a reporting system which gets time-series data from numerous meters (here I am referring it as raw_data) I need to generate several reports based on different combinations of the incoming ...

Remis Haroon - رامز

121

asked May 2, 2019 at 7:02

0 votes

1 answer

143 views

Where do you put tests that are not unit tests in a Maven project?

I'm building a Spark-based, text analysis package using both Java and Scala. I have a series of transform functions, which take in one dataframe and spit out another, and that can be chained together ...

kingledion

119

asked Dec 28, 2018 at 19:57

-1 votes

1 answer

173 views

How to incrementally update value of features in a machine learning pipeline?

I am working on a machine learning pipeline where we have to compute certain measures on streaming data. Every day, new raw data enters our pipeline. To update our features, we have to run an ETL that ...

spoderman

7

asked Dec 18, 2018 at 9:38

1 vote

1 answer

3k views

Processing only once the same message produced by two producers

If I have two different producers that could produce the same message for a Kafka broker, how can I ensure that only one of the two message occurrences gets processed? Is the only way to have an ...

Syed Jafri

43

asked Sep 6, 2018 at 9:57

0 votes

1 answer

77 views

Could Apache spark be an option?

Today we are using SQL server with multiple indexed views. Whenever we update the source tables for the view there is too long delay. I have no experience with Spark, so the question is: Can we input ...

Mr Zach

269

asked Aug 8, 2018 at 11:16

Stack Exchange Network

Questions tagged [spark]

How to connect to SFTP using Apache Spark 3.5 with Scala 2.12 for parallel file transfers?

Method naming conventions "setX" vs "withX"

Python: Is returning self in method chaining a violation of Demeter's law?

Data Ingest Architecture Advice

How do you perform accumulation on large data sets and pass the results as a response to REST API?

How (whether to?) include Apache Spark in my Architecture

Export huge excel file

What are the benefits of running Apache Spark on Kubernetes?

How to manage scheduled ETL jobs that are time sensitive?

Can accessing the same API from different languages be more performant?

How to design a report processing model using Spark in the most efficient way

Where do you put tests that are not unit tests in a Maven project?

How to incrementally update value of features in a machine learning pipeline?

Processing only once the same message produced by two producers

Could Apache spark be an option?

Hot Network Questions