Newest 'transformer-model' Questions

0 votes

0 answers

87 views

Torch example transformer with TransformerDecoder

In the torch example provided here https://github.com/pytorch/examples/tree/main/word_language_model, tansformer only uses torch.TransformerEncoder and torch.TransformerDecoder is overwritten with a ...

cuneyttyler

1,395

asked Oct 21 at 8:48

0 votes

0 answers

157 views

Why my Transformer model did not work well when dealing with single cell multi-omic data

The complete codes and data are available at:Google Disk I'm working on a high-dimensional regression problem and have built a Transformer-based model in PyTorch. While the model trains, I'm observing ...

氢氰酸

9

asked Sep 3 at 14:31

1 vote

1 answer

115 views

Can I use a custom attention layer while still leveraging a pre-trained BERT model?

In the paper “Using Prior Knowledge to Guide BERT’s Attention in Semantic Textual Matching Tasks”, they multiply a similarity matrix with the attention scores inside the attention layer. I want to ...

Blockchain Kid

335

asked Jul 6 at 11:47

0 votes

1 answer

44 views

Multi-Head Self Attention in Transformer is permutation-invariant or equivariant how to see it in practice?

I read that a function f is equivariant if f(P(x)) = P(f(x)) where P is a permutation So to check what means equivariant and permutation invariant I wrote the following code import torch import torch....

fenaux

47

asked Jul 2 at 19:34

0 votes

0 answers

66 views

Why does adding token and positional embeddings in transformers work?

In transformer models, I've noticed that token embeddings and positional embeddings are added together before being passed into the attention layers: import torch import torch.nn as nn class ...

Yilmaz

51k

asked May 26 at 21:21

0 votes

0 answers

96 views

Training and validation losses do not reduce when fine-tuning ViTPose from huggingface

I am trying to fine-tune a transformer/encoder based pose estimation model available here at: https://huggingface.co/docs/transformers/en/model_doc/vitpose When passing "labels" attribute to ...

Soham Bhaumik

341

asked May 8 at 15:28

2 votes

1 answer

84 views

Logits Don't Change in a Custom Reimplementation of a CLIP model [PyTorch]

The problem The similarity scores are almost the same for texts that describe both a photo of a cat and a dog (the photo is of a cat). Cat similarity: tensor([[-3.5724]], grad_fn=<MulBackward0>) ...

Yousef

51

asked Apr 20 at 18:46

0 votes

0 answers

78 views

SageMaker Real-Time Endpoint Timeout Issues with Lambda for Parallel Data Processing

I’m new to AWS and struggling with an architecture involving AWS Lambda and a SageMaker real-time endpoint. I’m trying to process large batches of data rows efficiently, but I’m running into timeout ...

Kabir Juneja

1

asked Mar 31 at 6:07

0 votes

0 answers

41 views

PyTorch Transformer Stuck in Local Minima Occasionally

I am working on a project to pre-train a custom transformer model I developed and then fine-tune it for a downstream task. I am pre-training the model on an H100 cluster and this is working great. ...

Martin Weiss

1

asked Mar 18 at 2:48

0 votes

0 answers

164 views

How do I resolve ImportError Using bitsandbytes 4bit quantization requires the latest version of bitsandbytes despite having version 0.45.3 installed?

I am trying to use the bitsandbytes library for 4-bit quantization in my model loading function, but I keep encountering an ImportError. The error message says, "Using bitsandbytes 4-bit ...

from

1

asked Mar 11 at 10:54

0 votes

0 answers

34 views

How to change last layer in finetuned model?

When I fine-tuned the model Hubert to detect phoneme, I chose a fine-tuned ASR Hubert model and I removed the last two layers and added a linear layer to the config vocab_size of phoneme. What is ...

Ngoc Anh

1

asked Feb 24 at 8:47

0 votes

1 answer

181 views

Trouble understanding the formula for estimating dense self-attention FLOPS per Token given as 6LH(2QT)

In the appendix B of the PaLM paper (https://arxiv.org/pdf/2204.02311) it describes a metric called "Model Flops Utilization (MFU)" and the formula for estimating it. It's computation makes ...

cangozpi

159

asked Feb 12 at 21:22

2 votes

1 answer

457 views

What to do when the gradient explodes in a Transformer model?

General question (hopefully useful for people coming from google): What to do when the gradient explodes? When working with transformers and deep NNs (with PyTorch), do you have a mental checklist of ...

Nicholas Kryger-Nelson

21

asked Feb 6 at 17:31

2 votes

0 answers

316 views

Timestamps reset every 30 seconds when using distil-whisper with return_timestamps=True

Problem distil-large-v3#sequential-long-form I'm using distil-whisper through the 🤗 Transformers pipeline for speech recognition. When setting return_timestamps=True, the timestamps reset to 0 every ...

Martin Zhu

441

asked Jan 21 at 20:10

0 votes

0 answers

12 views

Reverse Mapping of Table Elements from screenshot | Table Transformer

I am working on an end-to-end (E2E) project for websites that involves: Capturing Tight Screenshots of Data Tables: The project automatically detects and takes precise screenshots of all the data ...

Michael Dzwinel

387

asked Jan 15 at 12:43

1 vote

1 answer

449 views

ValueError: Exception encountered when calling layer 'tf_bert_model' (type TFBertModel)

I have been trying to run TFBertModel from Transformers, but it kept on throwing me this error ValueError Traceback (most recent call last) Cell In[9], line 1 ----> 1 ...

Faiz khan

13

asked Dec 26, 2024 at 15:53

0 votes

1 answer

681 views

How to correctly apply LayerNorm after MultiheadAttention with different input shapes (batch_first vs default) in PyTorch?

I’m working on an audio recognition task using a Transformer-based model in PyTorch. My input features are generated by a CNN-based embedding layer and have the shape [batch_size, d_model, n_token], ...

MuxAte

43

asked Dec 15, 2024 at 1:44

0 votes

0 answers

191 views

Fine-Tune Tacotron2 and Waveglow Pre-train Model from Nvidia Tacotron and Waveglow Models

Does anyone knows how to fine tuned tacotron2 and waveglow model from nvida tacotron and waveglow pre-trained model? first thing i was create my own dataset where same with format from the ljspeech ...

Izukishi

3

asked Dec 3, 2024 at 15:10

0 votes

1 answer

49 views

Compare two consecutive rows in datastage and throw the rows that doesn't meet a condition

I'm reading a file using a sequential file in Datastage and I'm doing some transformation in the data using a transformer, I want to compare the current row with the previous row, to check a value of ...

Chaimaa Emily

1

asked Nov 27, 2024 at 16:35

0 votes

0 answers

41 views

Missing a required argument: 'dec_input' in Transformer Model

I am busy with a forecasting model, and have turned to Transformers to see if they will be able to perform better than other sequence models. I keep getting the error: TypeError ...

Tayla Corney

49

asked Nov 25, 2024 at 9:18

0 votes

1 answer

36 views

Unable to figure out the hardware requirement(Cloud or on-prem) for open source inference for multiple users

I am trying to budget for setting up a llm based RAG application which will serve users with dynamic size(Anything from 100 to 2000). I am able to figure out the GPU requirement to host a certain llm[...

Bing

631

asked Nov 19, 2024 at 17:50

1 vote

0 answers

34 views

pytorch quantized linear function gives shape invalid error

I am trying to implement write a simple quantized tensor linear multiplication. Assuming the weight matrix w3 of shape (14336, 4096) and the input tensor x of shape (2, 512, 4096) where first dim is ...

hafezmg48

99

asked Oct 30, 2024 at 20:32

0 votes

1 answer

102 views

KV caching for varying length texts

I am trying to do some strucutured text extraction using some kv caching tricks. For this example I will use the following model and data: model_name = "Qwen/Qwen2.5-0.5B-Instruct" model = ...

sachinruk

10k

asked Oct 28, 2024 at 6:57

0 votes

1 answer

98 views

Tensorflow executes slow on GPU - retracing issue?

I am trying to develop a transformer sequence to vector model but encounter performance issues. I am working with a Tesla V100-PCIE-16GB. Whenever the model encounters an unseen sequence length, the (...

D. E.

1

asked Oct 13, 2024 at 20:18

0 votes

1 answer

261 views

Exploding Gradient (NaN Training Loss And Validation Loss) In Multi Head Self Attention - Vision Transformer

This multihead self attention code causes the training loss and validation loss to become NaN, but when I remove this part, everything goes back to normal. I know that when the training loss and ...

Fuji

117

asked Sep 26, 2024 at 11:48

Collectives™ on Stack Overflow

Torch example transformer with TransformerDecoder

Why my Transformer model did not work well when dealing with single cell multi-omic data

Can I use a custom attention layer while still leveraging a pre-trained BERT model?

Multi-Head Self Attention in Transformer is permutation-invariant or equivariant how to see it in practice?

Why does adding token and positional embeddings in transformers work?

Training and validation losses do not reduce when fine-tuning ViTPose from huggingface

Logits Don't Change in a Custom Reimplementation of a CLIP model [PyTorch]

SageMaker Real-Time Endpoint Timeout Issues with Lambda for Parallel Data Processing

PyTorch Transformer Stuck in Local Minima Occasionally

How do I resolve ImportError Using bitsandbytes 4bit quantization requires the latest version of bitsandbytes despite having version 0.45.3 installed?

How to change last layer in finetuned model?

Trouble understanding the formula for estimating dense self-attention FLOPS per Token given as 6LH(2QT)

What to do when the gradient explodes in a Transformer model?

Timestamps reset every 30 seconds when using distil-whisper with return_timestamps=True

Reverse Mapping of Table Elements from screenshot | Table Transformer

ValueError: Exception encountered when calling layer 'tf_bert_model' (type TFBertModel)

How to correctly apply LayerNorm after MultiheadAttention with different input shapes (batch_first vs default) in PyTorch?

Fine-Tune Tacotron2 and Waveglow Pre-train Model from Nvidia Tacotron and Waveglow Models

Compare two consecutive rows in datastage and throw the rows that doesn't meet a condition

Missing a required argument: 'dec_input' in Transformer Model

Unable to figure out the hardware requirement(Cloud or on-prem) for open source inference for multiple users

pytorch quantized linear function gives shape invalid error

KV caching for varying length texts

Tensorflow executes slow on GPU - retracing issue?

Exploding Gradient (NaN Training Loss And Validation Loss) In Multi Head Self Attention - Vision Transformer

Hot Network Questions