I am using hospital data. I want to make a regex expression in R and I am struggling to do this without using string manipulation outside of a single regex expression.
The string I want to search is:
"W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941".
The string represents procedures, which are described in groups.
The groups are written in the generic form:
(Procedure code, joint code, laterality code)
Procedure codes are "[A-Z]\d{3}", joint code which is "W84\d" and then a laterality code "Z94\d"
This format can be repeated multiple times. In some circumstances the code may be written: (Procedure code1, joint code1), (Procedure code2, joint code2), lateralityALL This is done when the laterality applies to each group.
I want to capture the codes up to and including the laterality code, if present.
If there is only one laterality code at the end of all the string groups, this should be appended to each group.
# Example data:
string = c("W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941")
Desired output:
Group 1: "W779 Y767 W835 W848 Y189 Z846 Z941"
Group 2: "T625 Z843 Z941"
df <- data.frame(string = "W779 Y767 W835 W848 Y189 Z846 Z941")
What I have done: I have taken a inefficient approach to identify string with shared laterality and those without. When shared I manually append the laterality to the first group.
df <- data.frame(string = c("W779 Y767 W835 W848 Y189 Z846 Z941",
"Y189 Z846 Z941",
"W779 Y767 W835 W848"))
df %>%
mutate(joint_count = str_count(string, "(W84[1-9]{1})|(Z84[1-9]{1})"),
laterality_count = str_count(string, "Z94[1-9]{1}"),
laterality = str_extract_all(string, "Z94[1-9]{1}"),
joint_laterality_count = str_count(string, "(W84[1-9]{1}|Z84[1-9]{1}) Z94[1-9]{1}"),
laterality_end = str_detect(string, "Z94[1-9]{1}$"),
shared_laterality = case_when(joint_count>laterality_count & laterality_end==T~1,
.default = 0),
single_joint_laterality = case_when(joint_count==1 & joint_laterality_count==1 & laterality_end==T~T, .default = F),
op_group_1 = case_when(single_joint_laterality == T ~ str_extract(string, "^.*(W84[1-9]{1}|Z84[1-9]{1}) Z94[1-9]{1}"),
shared_laterality == T ~ paste(str_extract(string, "^.*(W84[1-9]{1})|(Z84[1-9]{1})"),laterality)
),
op_group_2 = case_when(shared_laterality == T ~ str_extract(string, "(?<=(W84[1-9]{1})|(Z84[1-9]{1})).*")
)
)
My data have 100's of millions of rows so I want to have the most efficient approach and this probably is not it.
W94and laterality codes withZ94, but then we seeWs andZs that may not be the right thing. (For instance, indf, it does not end on aZ94laterality code.)W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941is split intoW779 Y767 W835 W848 Y189 Z846 Z941andT625 Z843 Z941- why not intoW779 Y767 W835 W848 Z941andY189 Z846 Z941? I'd strsplit byW848and appendZ941if it appears at the endW779 Y767 W835 W848 Y189 Z846 Z941as Group 1 since there is some text in betweenW779 Y767 W835 W848 Y189 Z846andZ941. So, the answer is "it is impossible". Match parts and combine with a "post-process" action, or just use something else.unlist(strsplit(df$string, " "))and then some vector grouping (e.g.,cumsum(.)on logical sequences). This might be hasty, esp if each row ofdfis logically grouped and/or helps with classification.