Questions tagged [apache-spark]
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write, originally developed in the AMPLab at UC Berkeley.
239 questions
0
votes
0
answers
13
views
Java heap space - even only 1.5Gb of 5.8 used and df size is 3Gb
Why am i getting java.lang.OutOfMemoryError: Java heap space even when i have a plenty of memory.
So my simple code that create dataframe from input data, so no ...
1
vote
2
answers
69
views
Stuck on loading parquet files recursively of varying size with Spark
I am using Spark on Scala via and Almond kernel for Jupyter to load several parquet files with varying size. I have a single worker with 10 cores and memory allowance of 10GB. When I execute the ...
0
votes
1
answer
104
views
Any Interface/Library that can take the Python ML code and run on spark cluster without learning PySpark?
I have been working with Python for machine learning and have a fair amount of code written in Python using libraries such as scikit-learn, pandas, and numpy. Recently, I’ve been faced with larger ...
0
votes
1
answer
214
views
Hadoop, Spark and Cloud
It seems Hadoop, Spark, and different versions of Clouds offer facilities to store and analyze big data. There are some articles comparing Hadoop and Spark (for example, this article). There are also ...
0
votes
1
answer
371
views
IllegalArgumentException at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source) when training an ALS implementation of spark in scala
I was following this tutorial trying to write a collaborative recommender system using the alternating least squares algorithm in spark. I am using the movie lens dataset which can be found here.
My ...
1
vote
0
answers
74
views
Not able to read data from Mongodb for below schema [closed]
Am trying to read very complex json from mongoDB. Tried in multiple ways nut no luck. Sample schema below :
...
1
vote
1
answer
89
views
What is the difference between Data Modeling and Data Processing?
When discussing big data, it is sometimes mentioned that data modeling can be done by using a tool like map reduce, while data processing may be performed by apache spark. What is the difference ...
1
vote
0
answers
52
views
Working with massive data what is the right approach
let's say I have database with massive data (millions of rows)
additionally Let's say 26 Million rows are entered every day
I want to build a fraud model to check these 26 Million rows every day..
as ...
0
votes
1
answer
406
views
Group a spark dataframe by a starting event to an ending event
Given a series of events (with datetime) such as:
failed, failed, passed, failed, passed, passed
I want to retrieve the time from when it first "failed" to when it first "passed," ...
0
votes
1
answer
545
views
Storage of N-dimensional matrices (tensors) as part of machine learning pipelines
I'm an infra person working on a storage product. I've been googling quite a bit to find an answer to the following question but unable to do so. Hence, I am attemping to ask the question here.
I am ...
0
votes
0
answers
306
views
Generalized Additive Modeling Apache Spark implementation
Does Spark MLlib support Generalized Additive Modeling? How does one go about implementing GAM models in Spark?
I want to implement GAM (Generalized additive model) model in Spark. Based on my ...
0
votes
1
answer
133
views
CREATE TABLE USING Oracle DATA_SOURCE
I am trying to create a table using ORACLE as a data source using spark query but getting an error.
%sql
CREATE TABLE TEST
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:oracle:thin:@...
0
votes
0
answers
339
views
Creating table in Databricks using the table from Oracle
I am trying to create a table in Databricks using a object in Oracle Db, but getting an error
...
1
vote
1
answer
308
views
Outlier Elimination in Spark With InterQuartileRange Results in Error
I have the following function that is supposed to calculate the outlier for a given dataset.
...
1
vote
1
answer
43
views
Would it be possible/practical to build a distributed deep learning engine by tapping into ordinary PCs' unused resources?
I started thinking about this in the context of Apple's new line of desktop CPUs with dedicated neural engines. From what I hear, these chips are quite adept at solving deep learning problems (as the ...