Questions tagged [bigdata]
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.
454 questions
4
votes
0
answers
43
views
Optimizing S3 Data Lake for Low-Frequency Individual Record Lookups Prioritizing Simplicity
We have a large-scale S3 data lake with the following characteristics:
Source: AWS Flink application writing Parquet files directly to S3
Volume: ~4000 Parquet files per hour, ranging from 200GB to ...
1
vote
0
answers
37
views
Big data analytics questions [closed]
Assignment 1:
If you have a file of 100 TB required to be stored by HDFS, where this file is divided into blocks of 10 TB each.
Draw how can you store these data, and what are the required number ...
3
votes
0
answers
55
views
Scalable Clustering Strategies for 300M Address Variants: Validation and Deduplication
I need to cluster 300 million unstructured addresses for validation, ensuring variants (e.g., "55 Tower F. EST City" vs. "Tower F 55, EST City, SINGA ROAD") map to a group similar ...
2
votes
1
answer
59
views
Calculating LOF for big data
I have big dataset (hundreds of millions of records, counted in dozens of GBs) and I would like to perform LOF for the problem of anomaly detection (testing different methods for academic purposes) ...
1
vote
2
answers
69
views
Stuck on loading parquet files recursively of varying size with Spark
I am using Spark on Scala via and Almond kernel for Jupyter to load several parquet files with varying size. I have a single worker with 10 cores and memory allowance of 10GB. When I execute the ...
2
votes
1
answer
50
views
How to process big data in my case?
I have 54 data files generated by a simulation. Each file has 10 million rows, and each file is several GB in size.
I need to read each file, compute their autocorrelation, and fit the curve.
What is ...
5
votes
2
answers
637
views
How to deal with high data volumes? (Tools, techniques, concepts, etc.)
I have some doubts about how to deal with high volumes of data. I'm currently working in the data analysis/data science field, so I've had the chance to perform calculations, manipulate data, and ...
1
vote
0
answers
26
views
How to find own way in data field? [closed]
I need help finding my path in the data field. I just completed my MSc in Big Data at university. During my studies, I gained approximately Junior level Python and SQL programming skills, worked with ...
0
votes
2
answers
84
views
Data imputation for heavily missing features
I am currently working on the dataset IEEE-CIS Fraud Detection, provided via Kaggle, with around 350 features, with around 600k instances. However, some features are missing large amounts of values, ...
0
votes
1
answer
60
views
Can I update the source of Data found in a Data Lake or Data Blob
Is it possible to update the source of data found in a Data Lake or Data Blob? What about while using HDInsight or Azure Databricks?
0
votes
1
answer
188
views
Gensim: create a dictionary from a large corpus without loading it in RAM?
The topic modelling library Gensim offers the ability to stream a large document instead of storing it in memory.
Streaming is possible for the stage of converting the corpus to BOW, but the ...
0
votes
2
answers
80
views
What kinds of ML models should I use when the outcome variable does not vary with time but only vary across individuals and groups?
I am trying to predict individuals’ income in 2018 using 18 years worth of data for people who were born in 1978,1979, and 1980 on many variables such as family income, location, family members’ ...
1
vote
1
answer
73
views
What is the name of my problem - distribution of counts of elements having certain attribute
I have the following problem:
There is a large set of records. Each record in the set has an attribute. For some values of the attribute, there is only one record, for other values there are many ...
2
votes
3
answers
105
views
Is big data a fallacy if most phenomena can be mostly described by few variables?
Is big data a fallacy if most phenomena can be mostly described by few variables?
This has confused me. Surely there are big data sets, but there are also cases when the set of significant or ...
0
votes
2
answers
121
views
How to manage large datasets (approx 95GB)
I was planning some data analysis on a dataset I'll be using for some projects. The dataset in question is ZINC20. Now, I don't need the whole thing so I was going to write some functions that would ...