Newest 'data-preprocessing' Questions

0 votes

0 answers

26 views

Assistance with Data Processing Insurance Premiums

I have been set a task by my manager to try and predict insurance premiums based on some categories such as job description, number of people employed and turnover. I am comparing between K-Nearest ...

Red_bull

19

asked Jul 29 at 13:57

0 votes

1 answer

58 views

Multivalued column cannot be transformed

Im working with Stackoverflow 2024 survey. In the csv file there are several multivalued variables (separated by ;). I want to apply One-hot encoding to the variables Employment and LanguageAdmire by ...

Lev

843

asked Jun 3 at 10:42

0 votes

0 answers

15 views

Does Modifying an Attribute of a Custom Dataset Affect Both Subsets After random_split in PyTorch?

I am working on a binary classification task using an audio dataset, which is already divided into training and testing sets. However, I also need a validation set, so I split the training set into ...

GauravGiri

21

asked Mar 1 at 4:52

0 votes

1 answer

47 views

Is there a way to set the data_min and the data_max in MinMaxScaler()?

I'm currently using MinMaxScaler() on my dataset. However, because my dataset is large I'm doing a first iteration pass in batches to compute the Min and Max Values for my Scaler. i'm using ...

Saffy

13

asked Feb 5 at 21:57

0 votes

0 answers

17 views

How to combine columns with nested lists with each other using pandas? [duplicate]

I'm working on a padas DataFrame that contains columns with lists and currently trying the method explode, but I'm not getting the desired output, instead, it does a Cartesian Product, combining all ...

buzzo

1

asked Jan 14 at 14:58

2 votes

0 answers

65 views

kernel died when I run : dataset = Dataset.from_dict(data_dict)

I am fine-tuning sam model for my dataset containing train_images and train_masks. I am able to create dict, but when calling last command i.e. to load dataset from dict, kernel dies. It happened ...

Sanju

21

asked Dec 11, 2024 at 10:39

0 votes

1 answer

62 views

Varying embedding dim due to changing padding in batch size

I want to train a simple neural network, which has embedding_dim as a parameter: class BoolQNN(nn.Module): def __init__(self, embedding_dim): super(BoolQNN, self).__init__() self....

samuel gast

392

asked Oct 18, 2024 at 15:54

-1 votes

1 answer

189 views

Capitalized words in sentiment analysis

I'm currently working with data of customers reviews on products from Sephora. my task to classify them to sentiments : negative, neutral , positive . A common technique of text preprocessing is to ...

read data

3

asked Aug 30, 2024 at 13:49

1 vote

0 answers

23 views

how can I transform the categorical data entered by the user using Target Encoding?

When fitting the model in google collab there doesnt seem to be any problem. However, when I try to create an interface using streamlit and pickle, Target encoder doesnt work and I am unable to solve ...

user25546188

11

asked Aug 22, 2024 at 22:19

0 votes

0 answers

52 views

How can I preprocess a feature that contains a list of number codes?

I have to preprocess a feature which is basically a list of number codes enocoded as a string, and I want to encode it such that the output is an array of frequencies of each of these numbers. The ...

AKHIL GOPIKUMAR

1

asked Jul 27, 2024 at 15:30

1 vote

2 answers

682 views

How can I create a custom sigmoid function?

I am trying to build a custom sigmoid-shaped function because I want to scale my data during preprocessing. Basically, the goal is to obtain a sigmoid shaped function that outputs from 0 to 1 and only ...

cercio

89

asked Jun 25, 2024 at 10:32

1 vote

0 answers

85 views

How do I ensure unique non-overlapping values in each column?

I have the following input: data = { 'Group_A': ['0&1', '1&5', '0&5', '1&7', '3&8', '4&8', '3&5', '4&4'], 'Group_B': ['1&0', '5&7', '0&5'...

deepcurious

19

asked Jun 13, 2024 at 7:16

0 votes

1 answer

838 views

SageMaker Processing Job permission denied to save csv file under /opt/ml/processing/<folder>

I am working on a project involving Step Functions with SageMaker. I have an existing Step Function that I need to integrate SageMaker into, and I tried adding steps such as processing, model training,...

Gwenda Thomas

101

asked May 29, 2024 at 19:34

-4 votes

1 answer

64 views

Is there an excel function to assign a binary result to a predefine data cell?

Sorry for the title, I know it might be pretty wide and not so much informative. I am facing a problem regarding the analysis of a data set. The participants of my experiments were randomly assigned ...

taboulet

1

asked May 20, 2024 at 14:53

0 votes

1 answer

374 views

Filtering Pandas DataFrame by Substring Match at Start of Strings [duplicate]

Trying to filter out rows in which the data of specific column start with a given substring. I have a pandas.DataFrame as shown below (simplified): price DRUG_CODE 123 A12D958 234 B564F3C ... ... I'm ...

Warren Chen

51

asked May 3, 2024 at 10:02

0 votes

1 answer

33 views

Sklearn Column Transformer not working for mixed data types

from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder from sklearn.pipeline import Pipeline from sklearn.model_selection import ...

s213439

1

asked Apr 30, 2024 at 8:20

0 votes

0 answers

39 views

Failed to convert a NumPy array to a Tensor for LSTM

Trying to run an LSTM model where the data is separated into few columns in csv and i'm trying to prepare date from such csv's. Getting the error of ValueError: Failed to convert a NumPy array to a ...

Athul Srinivas

36

asked Apr 25, 2024 at 15:22

1 vote

1 answer

2k views

Why is my GPU not being used despite having turned it on in Kaggle?

I've uploaded a dataset on kaggle(approx. 73GB), and I'm trying to preprocess this data for model training purposes. This dataset has a large no. of missing values, which I am trying to interpolate ...

54m4gr4

13

asked Apr 23, 2024 at 12:54

0 votes

1 answer

620 views

Issue when padding and packing sequences in LSTM networks using PyTorch

I'm trying to make a simple lstm neural network. I've got time series data which I am splitting into sequences and batches using Pytorch's Dataset and DataLoader. To account for the variable lengths ...

D Danne

17

asked Apr 18, 2024 at 19:35

0 votes

0 answers

55 views

TypeError: Cannot do positional indexing on RangeIndex with these indexers of type DataFrame

I'm new with python so I'm sorry if this is a basic one. However, after I ran the code, I got this: TypeError: cannot do positional indexing on RangeIndex with these indexers [ Year Average of PM ...

Sofia

1

asked Apr 6, 2024 at 10:14

0 votes

1 answer

102 views

Feature Scaling with MinMaxScaler()

I have 31 features to be input into an ML algorithm. Of these 22 feature values are in the range of 0 to 1 already. The remaining 9 features vary between 0 to 750. My doubt is if I choose to apply ...

rekha

7

asked Mar 19, 2024 at 5:40

1 vote

1 answer

38 views

Using sklearn where the label a combination of multiple inputs [closed]

I'm performing data analysis on a dataset with categorical labels are interrelated. My labels track experimental conditions. In my case, labels track concentrations of combinations of two chemicals ...

WoolyThomas

47

asked Mar 12, 2024 at 22:39

0 votes

0 answers

95 views

Sklearn inverse_transformation does not work as expected, any alternatives?

from sklearn.preprocessing import MinMaxScaler values = df[['Close']] #values is floats ranging from 0.06 to 190.08 sc = MinMaxScaler() scaled_values = sc.fit_transform(values) descaled_values = sc....

haintaki

11

asked Mar 8, 2024 at 0:43

0 votes

0 answers

57 views

Is there a faster method to process pandas list of string values

There are 13000 values approximately for a given column. The below function works in a way that the input is a list of strings and does the NER tagging for each word in the list. On an average there ...

srinivas muralidharan

39

asked Feb 14, 2024 at 10:38

0 votes

0 answers

93 views

Worse performance with increased direct_num_workers when running preprocessing of DLRM with Apache Beam

I am now trying to run preprocessing tasks of DLRM with Apache Beam https://github.com/tensorflow/models/tree/master/official/recommendation/ranking/preprocessing. The dataset is Criteo Kaggle 10GB ...

Eric

1

asked Jan 29, 2024 at 8:36

Collectives™ on Stack Overflow

Assistance with Data Processing Insurance Premiums

Multivalued column cannot be transformed

Does Modifying an Attribute of a Custom Dataset Affect Both Subsets After random_split in PyTorch?

Is there a way to set the data_min and the data_max in MinMaxScaler()?

How to combine columns with nested lists with each other using pandas? [duplicate]

kernel died when I run : dataset = Dataset.from_dict(data_dict)

Varying embedding dim due to changing padding in batch size

Capitalized words in sentiment analysis

how can I transform the categorical data entered by the user using Target Encoding?

How can I preprocess a feature that contains a list of number codes?

How can I create a custom sigmoid function?

How do I ensure unique non-overlapping values in each column?

SageMaker Processing Job permission denied to save csv file under /opt/ml/processing/<folder>

Is there an excel function to assign a binary result to a predefine data cell?

Filtering Pandas DataFrame by Substring Match at Start of Strings [duplicate]

Sklearn Column Transformer not working for mixed data types

Failed to convert a NumPy array to a Tensor for LSTM

Why is my GPU not being used despite having turned it on in Kaggle?

Issue when padding and packing sequences in LSTM networks using PyTorch

TypeError: Cannot do positional indexing on RangeIndex with these indexers of type DataFrame

Feature Scaling with MinMaxScaler()

Using sklearn where the label a combination of multiple inputs [closed]

Sklearn inverse_transformation does not work as expected, any alternatives?

Is there a faster method to process pandas list of string values

Worse performance with increased direct_num_workers when running preprocessing of DLRM with Apache Beam

Hot Network Questions