2,345 questions
Advice
0
votes
4
replies
83
views
How to create stable person identifiers when names vary across years
I am working with a university faculty salary dataset where the same person appears across many years, but their name strings are inconsistent. The dataset has about 8,000 unique people and years from ...
0
votes
2
answers
84
views
Automatically map messy column names to a standard schema in pandas
I'm working with many tabular datasets (Excel, CSV) that contain inconsistent or messy column names due to typos, different naming conventions, spacing, punctuation, etc.
I have a standard schema (as ...
-2
votes
1
answer
116
views
How to match German province names between 2 data sets in R?
I'm working with two datasets for German NUTS-3 level regions:
A shapefile from Eurostat via the giscoR package:
> library(giscoR)
> nuts3_germany <- gisco_get_nuts(country = "Germany&...
2
votes
3
answers
121
views
Pandas DataFrame column partial match and extract matching value
I have a column in Pandas DataFrame(Names) with a large collection of names. I have another DataFrame(Title) text column and in between text, the names in Name frame are there. What would be the ...
1
vote
1
answer
71
views
Match similar names [duplicate]
I have a database with three columns: name, occupation, and organization. In these columns, I have duplicates with slightly different names. For example, Anne Sue Frank and Anne S. Frank refer to the ...
1
vote
3
answers
94
views
Find str.contains in two large Pandas DataFrames
I have a large pandas DataFrames like below.
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
("1", "Dixon Street", "Auckland"),
("2&...
1
vote
1
answer
78
views
How to match a function but exclude object methods without negative lookbehind
I'm trying to write a regex that matches every occurrence of some_function(...), but it should not match when it's part of an object method like my.some_function(...) or if it is a substring of ...
1
vote
0
answers
75
views
Trying to fix names in my database with fuzzywuzzy
What I'm trying to do is find and correct similar names in my database, like 'Patrick Maxwell' and 'Patrick Maxwel.' However, the issue I'm facing is that the best match for each name is often itself, ...
1
vote
1
answer
2k
views
How can i check if a string contains another string in powershell?
I have a string that is returned from an api call , the string is something like
".\controllers\myaction c:\test\path"
I want to use Powershell to check if the string contains c:\
...
0
votes
2
answers
71
views
dynamic approach to identify and standardize similar names automatically in pandas or data cleaning
I have a DataFrame with a column of publisher names that contains various minor variations of the same publisher. For example, entries such as "Harlequin Romance", "Harlequin Blaze"...
0
votes
1
answer
62
views
Renaming dataframe column in Python with a string value in another dataframe by matching column/index names
Major edit:
Apparently it is difficult to understand my question, so I'll do my best to concretize it.
I got two dataframes, "df1" and "df2". These are quite larger, larger than in ...
0
votes
2
answers
75
views
Is there a way to obtain a list separated by comma as the output of str_extract_all instead of the default output in R?
I have searched high and low and nobody seems to have asked that exact question, so I'm at loss.
I have a data frame with a couple columns. One of this column contains various sentences that don't ...
2
votes
6
answers
133
views
Matching the start of a sequence in R
I have a series of string in a vector and need to remove the matching starting pattern from the string. However, I don't know the pattern or how long it is.
stringa <- c("apple_tart", &...
2
votes
2
answers
335
views
How can I find all exact occurrences of a string, or close matches of it, in a longer string in Python?
Goal:
I'd like to find all exact occurrences of a string, or close matches of it, in a longer string in Python.
I'd also like to know the location of these occurrences in the longer string.
To define ...
0
votes
0
answers
80
views
How to efficiently compute similarity scores for prefixes of a string with another string in C?
I'm working on a problem involving string matching where I need to compute the similarity scores for each prefix of a string C against another string S. The similarity score for a prefix P of C and S ...
0
votes
1
answer
635
views
How to do fuzzy merge with 2 large pandas dataframes?
I have 2 pandas dataframes that both contain company names. I want to merge these 2 dataframes on company names using a fuzzy match. But the problem is 1 dataframe contains 5m rows and the other 1 ...
2
votes
3
answers
167
views
How to Compare Hierarchy in 2 Pandas DataFrames? (New Sample Data Updated)
I have 2 dataframes that captured the hierarchy of the same dataset. Df1 is more complete compared to Df2, so I want to use Df1 as the standard to analyze if the hierarchy in Df2 is correct. However, ...
-1
votes
1
answer
85
views
How do I find the first # after an even number of "?
Reading a text file with the format:
e2c=["(vsim-86)" ,'kkk', "pppp",
"bbbbbb", #"old", "uio",
" sds # sds", #"old2",
" sds #...
0
votes
0
answers
61
views
String Matching Function Not Matching Strings Despite Threshold Set to 0
I have implemented a string matching function in Python utilizing n-grams and similarity ratios. The function signature is as follows:
# concise version of the function
def match_strings(...
1
vote
1
answer
62
views
Is there a way to recode a vector of strings based on two key words or phrases that appear in every value into new vector with those two values?
As my question indicates, I would like to convert a vector of strings into a new vector one of two values that appears in every string. Here is an example of a very simple data frame I have:
data <-...
0
votes
0
answers
701
views
Google Sheets - Count if two cells have the same text
I'm trying to create a code to see if my predictions for games and the actual result of the games are the same. I was going to create a point value, like March Madness has, but I can't actually get ...
1
vote
1
answer
319
views
module 'thefuzz' has no attribute 'partial_ratio' and other odd errors
Been trying to use thefuzz to compare two different lists, and got the above error, which doesn't seem right. I've commented everything else out in my code except the below two test lines and still ...
0
votes
0
answers
34
views
powershell ilike operator not returning true [duplicate]
PS C:\Users\Administrator> $string = "hello world"
PS C:\Users\Administrator> $string -ilike "hello"
False
the above is outputing false, and not true. not sure what I am ...
0
votes
2
answers
102
views
Is there a way in R to join between two columns based on whether a string in column 1 is contained within the string in column 2?
I am trying to join several messy datasets together without using "fuzzy matching".
In the core dataset (example dataset1 below), I have simple names for companies. In the datasets I would ...
1
vote
1
answer
573
views
Split full address to contain only street name
I have a table with address1, city, state, and postal code. However, some address1 will also contains city, state and postal code (separated by either comma or space or both). Example:
Address1: 9999 ...