Skip to main content

Questions tagged [bigdata]

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.

4 votes
0 answers
43 views

We have a large-scale S3 data lake with the following characteristics: Source: AWS Flink application writing Parquet files directly to S3 Volume: ~4000 Parquet files per hour, ranging from 200GB to ...
stackiee's user avatar
  • 141
1 vote
0 answers
37 views

Assignment 1: If you have a file of 100 TB required to be stored by HDFS, where this file is divided into blocks of 10 TB each. Draw how can you store these data, and what are the required number ...
king king's user avatar
3 votes
0 answers
55 views

I need to cluster 300 million unstructured addresses for validation, ensuring variants (e.g., "55 Tower F. EST City" vs. "Tower F 55, EST City, SINGA ROAD") map to a group similar ...
IAIMT2024's user avatar
2 votes
1 answer
59 views

I have big dataset (hundreds of millions of records, counted in dozens of GBs) and I would like to perform LOF for the problem of anomaly detection (testing different methods for academic purposes) ...
Asic's user avatar
  • 21
1 vote
2 answers
69 views

I am using Spark on Scala via and Almond kernel for Jupyter to load several parquet files with varying size. I have a single worker with 10 cores and memory allowance of 10GB. When I execute the ...
Ícaro Lorran's user avatar
2 votes
1 answer
50 views

I have 54 data files generated by a simulation. Each file has 10 million rows, and each file is several GB in size. I need to read each file, compute their autocorrelation, and fit the curve. What is ...
user366312's user avatar
5 votes
2 answers
637 views

I have some doubts about how to deal with high volumes of data. I'm currently working in the data analysis/data science field, so I've had the chance to perform calculations, manipulate data, and ...
tms's user avatar
  • 61
1 vote
0 answers
26 views

I need help finding my path in the data field. I just completed my MSc in Big Data at university. During my studies, I gained approximately Junior level Python and SQL programming skills, worked with ...
Gleb's user avatar
  • 11
0 votes
2 answers
84 views

I am currently working on the dataset IEEE-CIS Fraud Detection, provided via Kaggle, with around 350 features, with around 600k instances. However, some features are missing large amounts of values, ...
Hai Nguyen's user avatar
0 votes
1 answer
60 views

Is it possible to update the source of data found in a Data Lake or Data Blob? What about while using HDInsight or Azure Databricks?
JF0001's user avatar
  • 101
0 votes
1 answer
188 views

The topic modelling library Gensim offers the ability to stream a large document instead of storing it in memory. Streaming is possible for the stage of converting the corpus to BOW, but the ...
Erwan's user avatar
  • 27.8k
0 votes
2 answers
80 views

I am trying to predict individuals’ income in 2018 using 18 years worth of data for people who were born in 1978,1979, and 1980 on many variables such as family income, location, family members’ ...
Aman Desai's user avatar
1 vote
1 answer
73 views

I have the following problem: There is a large set of records. Each record in the set has an attribute. For some values of the attribute, there is only one record, for other values there are many ...
danatel's user avatar
  • 111
2 votes
3 answers
105 views

Is big data a fallacy if most phenomena can be mostly described by few variables? This has confused me. Surely there are big data sets, but there are also cases when the set of significant or ...
mavavilj's user avatar
  • 426
0 votes
2 answers
121 views

I was planning some data analysis on a dataset I'll be using for some projects. The dataset in question is ZINC20. Now, I don't need the whole thing so I was going to write some functions that would ...
apvn's user avatar
  • 1

15 30 50 per page
1
2 3 4 5
31