Newest 'sequence-modeling' Questions - Artificial Intelligence Stack Exchange

2 votes

1 answer

34 views

Why do Transformers handle long-range dependencies better than LSTMs despite lacking explicit recurrence?

Recurrent architectures such as LSTMs and GRUs were originally designed to address the vanishing gradient problem and capture long-range dependencies in sequential data. However, in recent years ...

Avalon Brooks

637

asked Mar 13 at 9:45

0 votes

0 answers

32 views

Is this illustration an accurate representation of RNN?

$$fig-1$$ Notation & Assumptions Assume that for $t=1$ we give zero vector as $a^{<0>}$ $\hat{X_i}$ - i'th Token where $\hat{X_i} \in R^{2} $ $a^{<t>[L]}_{j}$ - $j$th node or ...

Rambal heart remo

101

asked Mar 13 at 5:57

1 vote

1 answer

162 views

Ensure Monotonic Output in a Neural Network with Variable-Length Sequence Input

I'm designing a neural network that takes input of shape (batch_size, seq_len, features), where seq_len can vary between samples....

bliu

11

asked Jul 29, 2025 at 1:11

0 votes

0 answers

53 views

Best neural network algorithms/architectures for generating synthetic sequences of tuples of words

I would like to generate sequences of tuples using a neural network algorithm such that the model trains on a dataset of sequences of tuples and generates synthetic sequences of tuples. Each tuple <...

Ben Bost

101

asked May 2, 2025 at 12:30

1 vote

2 answers

524 views

Wouldn't residual connections in RNNs solve the vanishing/exploding gradient problem?

I was recently brushing up on my deep-learning basics and came back to RNNs. LSTMs/GRUs and the Transformer architecture were invented to solve RNN's vanishing/exploding gradient problem. I was at ...

Vladislav Korecký

167

asked Mar 25, 2024 at 17:21

0 votes

0 answers

53 views

Executing Multiple ML Models simultaneously on multiple cores to reduce the model building time

I have a time series forecasting problem which consist of date, item no and quantity columns. I have defined a function which takes input as data frame and forecasting period (Daily,Weekly,Monthly,...

Rohit

1

asked Jan 31, 2024 at 15:29

1 vote

1 answer

619 views

Can transformers autoregressively generate a sequence of embeddings (instead of predictions)?

Is it theoretically possible to use a transformer architecture to autoregressively generate a sequence of embedding vectors, instead of discrete tokens? For example, if I were to provide an input of a ...

Theo Coombes

11

asked Jul 24, 2023 at 2:20

1 vote

2 answers

2k views

How is the padding mask incorporated in the attention formula?

I have been looking for the answer in other questions but no one tackled that. I want to ask you how is the padding mask considered in the formula of attention? The attention formula taking into ...

Daviiid

605

asked Jul 19, 2023 at 22:29

2 votes

1 answer

180 views

Is the problem of Language Modelling a Well-Posed Learning Problem?

Hadamard defines (Well-posed problem (Wikipedia)) a well-posed problem as one for which: a solution exists, the solution is unique, the solution depends continuously on the data (e.g. it is stable) ...

aren't eistert

21

asked May 9, 2023 at 15:45

1 vote

0 answers

76 views

The model's accuracy becomes suddenly so unreasonably good at beginning of the training process. I need an explaination

I am practicing machine translation using seq2seq model (more specifically with GRU/LSTM units). The following is my first model: This model first archived about 0.03 accuracy score and gradually ...

Đạt Trần

11

asked Apr 23, 2023 at 14:27

4 votes

1 answer

1k views

Difference between dot product attention and "matrix attention"

As far as I know, attention was first introduced in Learning To Align And Translate. There, the core mechanism which is able to disregard the sequence length, is a dynamically-built matrix, of shape ...

Gulzar

799

asked Apr 16, 2023 at 10:16

-1 votes

1 answer

293 views

Understanding self attention - How come there is no connection between different states?

During trying to understand transformers by reading Attention is all you need, I noticed the authors constantly refer to "self attention" without explaining it. The original attention ...

Gulzar

799

asked Apr 3, 2023 at 7:18

0 votes

0 answers

342 views

Increasing "output_sequence_length" in TextVectorization layer worsens model's performance

When exploring the Twitter Sentiment Analysis dataset on Kaggle, I came up with a model that looks like this: ...

Tran Khanh

177

asked Mar 5, 2023 at 15:40

1 vote

1 answer

128 views

Many To One LSTM - Can I Use the Same Sequence as Input from Previous Timesteps?

I'm new to LSTMs, and I'm trying to do a basic timeseries prediction using stock prices. However, I'm a bit confused as to how the LSTM is supposed to remember outputs from previous timesteps when it ...

Krusty the Clown

111

asked Feb 24, 2023 at 17:36

0 votes

2 answers

260 views

Which model should I apply on sequential data?

I need to predict a binary vector given a sequential dataset meaning the current datapoint depends on its predecessors as well as (known) successors. So, it looks something like this: Given the ...

toom

101

asked Oct 9, 2022 at 16:34

Stack Exchange Network

Questions tagged [sequence-modeling]

Why do Transformers handle long-range dependencies better than LSTMs despite lacking explicit recurrence?

Is this illustration an accurate representation of RNN?

Ensure Monotonic Output in a Neural Network with Variable-Length Sequence Input

Best neural network algorithms/architectures for generating synthetic sequences of tuples of words

Wouldn't residual connections in RNNs solve the vanishing/exploding gradient problem?

Executing Multiple ML Models simultaneously on multiple cores to reduce the model building time

Can transformers autoregressively generate a sequence of embeddings (instead of predictions)?

How is the padding mask incorporated in the attention formula?

Is the problem of Language Modelling a Well-Posed Learning Problem?

The model's accuracy becomes suddenly so unreasonably good at beginning of the training process. I need an explaination

Difference between dot product attention and "matrix attention"

Understanding self attention - How come there is no connection between different states?

Increasing "output_sequence_length" in TextVectorization layer worsens model's performance

Many To One LSTM - Can I Use the Same Sequence as Input from Previous Timesteps?

Which model should I apply on sequential data?

Hot Network Questions