Skip to main content

Questions tagged [spark]

0 votes
0 answers
93 views

I am working on a project where I need to transfer thousands of files (each sized between 50-60 MB) every hour from an SFTP server to local storage or AWS S3. I am using Apache Spark 3.5 with Scala 2....
Abhishek 's user avatar
2 votes
3 answers
282 views

Why learning about Fluent Interfaces, I came across this post which states that using set hints one is mutating the object whereas with is returing a new object. I have seen this pattern first hand ...
Ezequiel Castaño's user avatar
1 vote
3 answers
979 views

In Python it is very common to see code that uses method chaining, the main difference with code elsewhere is that this is also combined with returning an object of the same type but modified. This ...
Ezequiel Castaño's user avatar
2 votes
1 answer
166 views

I have a requirement where we need to collect N different events and store them for analysis. I am having trouble coming up with a general architecture for this. FINAL REQUIREMENTS The end goal of the ...
Sriram R's user avatar
5 votes
1 answer
1k views

I have around 125 million event records on s3. The s3 bucket structure is: year/month/day/hour/*. Inside each hour directory, we have files for every minute. A typical filename looks like this: ...
Namah's user avatar
  • 61
2 votes
1 answer
260 views

Brief overview of general data flow The general goal of my system is to allow users to upload many different types of files containing data (PDF, CSV, ZIP, etc.), then index it and perform some basic ...
foxtrotuniform6969's user avatar
-2 votes
1 answer
695 views

I develop a web application in Angular (frontend) and Scala (backend) for a big data team. Because they use large files for export/import, I build a module which is a copy of Microsoft Excel. So, what ...
AlleXyS's user avatar
  • 117
-2 votes
1 answer
97 views

When running Apache Spark one submits jobs to a Cluster Manager. The cluster manager is delegated with the task of accepting / declining requests for resources. The cluster manager could either be ...
vi_ral's user avatar
  • 121
2 votes
0 answers
72 views

We have some ETL jobs that are scheduled to run every day, and some that are scheduled to run every week via Control-M. These types of jobs tag data with the date the job was run and perform filter ...
Igneous01's user avatar
  • 2,333
-3 votes
1 answer
112 views

I've just started my first proper internship in industry (not learning to code but learning to write software that does stuff). My employer makes use of Apache Spark, as they do a lot of Big Data ...
HenryPrickettMorgan's user avatar
2 votes
0 answers
37 views

I have a reporting system which gets time-series data from numerous meters (here I am referring it as raw_data) I need to generate several reports based on different combinations of the incoming ...
Remis Haroon - رامز's user avatar
0 votes
1 answer
143 views

I'm building a Spark-based, text analysis package using both Java and Scala. I have a series of transform functions, which take in one dataframe and spit out another, and that can be chained together ...
kingledion's user avatar
-1 votes
1 answer
173 views

I am working on a machine learning pipeline where we have to compute certain measures on streaming data. Every day, new raw data enters our pipeline. To update our features, we have to run an ETL that ...
spoderman's user avatar
1 vote
1 answer
3k views

If I have two different producers that could produce the same message for a Kafka broker, how can I ensure that only one of the two message occurrences gets processed? Is the only way to have an ...
Syed Jafri's user avatar
0 votes
1 answer
77 views

Today we are using SQL server with multiple indexed views. Whenever we update the source tables for the view there is too long delay. I have no experience with Spark, so the question is: Can we input ...
Mr Zach's user avatar
  • 269

15 30 50 per page