Newest 'sampling' Questions - Data Science Stack Exchange

3 votes

1 answer

45 views

Is there a fast method from sampling from document embeddings to maximize pairwise distances?

I have a large set of document embeddings, and I would like to sample a subset where the median or average pairwise distance is maximized. The idea here is to get a more balanced sample set where long ...

Layman

291

asked Nov 11, 2025 at 23:52

4 votes

1 answer

68 views

Bootstrap iterations in Orange data mining

In Orange data mining (GUI), what is the default number of iterations for the data sampler bootstrap? And is there a way to increase it?

lala

41

asked Sep 29, 2025 at 11:39

1 vote

1 answer

325 views

best way to create Synthetic data generation

I want to here y'all opinions on synthetic data generations, which method and tools you use and why.

haneulkim

497

asked Sep 12, 2025 at 6:13

3 votes

1 answer

81 views

Should data be sent to Learner algorithm also in Orange?

I see that both of following arrangements work in Orange software to give score for a model: and Both above work but which of above two is the correct method? Does the selection of model (Tree, ...

rnso

1,648

asked Jul 14, 2025 at 12:43

4 votes

1 answer

73 views

How do I downsample huge datasets with sparse asymptotes?

I'm rendering charts for timeseries data composed by millions of records. The charts need to be interactive and have lots of feature support so I need to downsample them. The problem I've encountered ...

nathan-w

41

asked Jun 21, 2025 at 6:14

2 votes

0 answers

50 views

Need help with model architecture and sampling negative edges

I am currently training a graph transformer model in order to develop an AI who'd be able to generate edges on a unseen graph (link dependencies between text with historical data). I divided my ...

lili

371

asked May 12, 2025 at 8:49

2 votes

0 answers

86 views

How do I train a model on data where there should be a statistical difference but it can't find it?

I'm trying to create a predictive model for a dataset with continuous input variables and a binary/probability output. The input are sensors (up to 400 columns, but some very irrelevant) which are ...

user46124

21

asked Apr 3, 2025 at 8:53

1 vote

0 answers

42 views

Sampling multiple masked tokens through Metropolis–Hastings

I'm trying to replicate the finding of the the publication "Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis-Hastings" for obtaining the joint distribution ...

Chris

11

asked Aug 4, 2024 at 16:21

1 vote

0 answers

18 views

Optimizing Sampling Strategy to Enhance Uniformity Under Conditional Constraints

I am facing a challenge in a project that involves sampling from a design space defined by 10 variables. I use Latin Hypercube Sampling (LHS) and/or Sobol sequences, and initially, the samples are ...

Chris

11

asked Apr 23, 2024 at 14:16

4 votes

1 answer

88 views

Algorithm for picking N random uniformly distributed samples, in irregular polygon?

Say want to pick a fixed number of samples from a large 2D dataset, such that they relatively evenly distributed over the whole sample area. Imagine places in a country - so the border of the data is ...

barryhunter

171

asked Mar 5, 2024 at 17:02

1 vote

1 answer

305 views

Top_p parameter in langchain

I am trying to understand the top_p parameter in langchain (nucleus sampling) but I can't seem to grasp it. Based on this we sort the probabilities and select a ...

Labyrinthian

13

asked Mar 1, 2024 at 16:16

1 vote

1 answer

292 views

Correct way to take a subset of a dataset?

I am attempting a binary classification problem (using Weka). My dataset has 100,000 rows, 14 attributes (1 output variable). It takes already too long just to open the dataset in excel so I just know ...

FlexMcMurphy

113

asked Dec 17, 2023 at 23:53

1 vote

1 answer

4k views

Why is 0.7, in general, the default value of temperature for LLMs?

I have recently read through a lot of documentation and articles about Large Language Models (LLMs), and I have come to the conclusion that 0.7 is, most of the time, the default value for the ...

jmpion

11

asked Nov 14, 2023 at 15:47

0 votes

1 answer

70 views

how to evaluate a model on our data when the model is imported from a library and thus not trained by us?

The company I work for has deployed a trained rule-based sentiment analyzer model vader to make predictions on customer's attitude. We import the model from nltk library directly, so we didn't train ...

Shelby

3

asked Sep 22, 2023 at 13:28

1 vote

0 answers

43 views

Calculating an integral with as few grid points as possible

Suppose I have a function $f\colon [0,1] \to \mathbb{R}$ which is maybe continuous (it's at least in $L^1$). I have a sample of $N$ points $\{x_i\}$ taken from the domain $[0,1]$ randomly from some ...

math_guy

111

asked Jul 27, 2023 at 23:17

Stack Exchange Network

Questions tagged [sampling]

Is there a fast method from sampling from document embeddings to maximize pairwise distances?

Bootstrap iterations in Orange data mining

best way to create Synthetic data generation

Should data be sent to Learner algorithm also in Orange?

How do I downsample huge datasets with sparse asymptotes?

Need help with model architecture and sampling negative edges

How do I train a model on data where there should be a statistical difference but it can't find it?

Sampling multiple masked tokens through Metropolis–Hastings

Optimizing Sampling Strategy to Enhance Uniformity Under Conditional Constraints

Algorithm for picking N random uniformly distributed samples, in irregular polygon?

Top_p parameter in langchain

Correct way to take a subset of a dataset?

Why is 0.7, in general, the default value of temperature for LLMs?

how to evaluate a model on our data when the model is imported from a library and thus not trained by us?

Calculating an integral with as few grid points as possible

Hot Network Questions