String manipulation in R, conditional capture and conditional append. ?regex solution

Question

I am using hospital data. I want to make a regex expression in R and I am struggling to do this without using string manipulation outside of a single regex expression.

The string I want to search is: "W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941".

The string represents procedures, which are described in groups.

The groups are written in the generic form: (Procedure code, joint code, laterality code) Procedure codes are "[A-Z]\d{3}", joint code which is "W84\d" and then a laterality code "Z94\d"

This format can be repeated multiple times. In some circumstances the code may be written: (Procedure code1, joint code1), (Procedure code2, joint code2), lateralityALL This is done when the laterality applies to each group.

I want to capture the codes up to and including the laterality code, if present.

If there is only one laterality code at the end of all the string groups, this should be appended to each group.

# Example data:
string = c("W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941")

Desired output: Group 1: "W779 Y767 W835 W848 Y189 Z846 Z941" Group 2: "T625 Z843 Z941"

df <- data.frame(string = "W779 Y767 W835 W848 Y189 Z846 Z941")

What I have done: I have taken a inefficient approach to identify string with shared laterality and those without. When shared I manually append the laterality to the first group.

df <- data.frame(string = c("W779 Y767 W835 W848 Y189 Z846 Z941",
                            "Y189 Z846 Z941",
                            "W779 Y767 W835 W848"))

df %>% 
  mutate(joint_count = str_count(string, "(W84[1-9]{1})|(Z84[1-9]{1})"),
         laterality_count = str_count(string, "Z94[1-9]{1}"),
         laterality = str_extract_all(string, "Z94[1-9]{1}"),
         joint_laterality_count = str_count(string, "(W84[1-9]{1}|Z84[1-9]{1}) Z94[1-9]{1}"),
         laterality_end = str_detect(string, "Z94[1-9]{1}$"),
         shared_laterality = case_when(joint_count>laterality_count & laterality_end==T~1,
                                      .default = 0),
         single_joint_laterality = case_when(joint_count==1 & joint_laterality_count==1 & laterality_end==T~T, .default = F),
         op_group_1 = case_when(single_joint_laterality == T ~ str_extract(string, "^.*(W84[1-9]{1}|Z84[1-9]{1}) Z94[1-9]{1}"),
                                shared_laterality == T ~ paste(str_extract(string, "^.*(W84[1-9]{1})|(Z84[1-9]{1})"),laterality)
                                  ),
         op_group_2 = case_when(shared_laterality == T ~ str_extract(string, "(?<=(W84[1-9]{1})|(Z84[1-9]{1})).*")
                                  )
         )

My data have 100's of millions of rows so I want to have the most efficient approach and this probably is not it.

It seems we cannot rely on the first letter, is that correct? Your text says that joint codes start with W94 and laterality codes with Z94, but then we see Ws and Zs that may not be the right thing. (For instance, in df, it does not end on a Z94 laterality code.) — r2evans
– r2evans, Commented Oct 3 at 11:57
In your first example W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941 is split into W779 Y767 W835 W848 Y189 Z846 Z941 and T625 Z843 Z941 - why not into W779 Y767 W835 W848 Z941 and Y189 Z846 Z941 ? I'd strsplit by W848 and append Z941 if it appears at the end — lailaps
– lailaps, Commented Oct 3 at 12:13
You cannot capture non-contiguous portions of the text into the same capture within a single regex call. You cannot get W779 Y767 W835 W848 Y189 Z846 Z941 as Group 1 since there is some text in between W779 Y767 W835 W848 Y189 Z846 and Z941. So, the answer is "it is impossible". Match parts and combine with a "post-process" action, or just use something else. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 3 at 12:18
I think the takeaway from Wiktor's comment is that regex alone is not going to address this, you will need some handling logic. My first stab at this is to confirm that we can "know" whether a particular code is a laterality code based on either its exact value or that it always follows or precedes (if not last) some other pattern. If that is true, then I would likely start with unlist(strsplit(df$string, " ")) and then some vector grouping (e.g., cumsum(.) on logical sequences). This might be hasty, esp if each row of df is logically grouped and/or helps with classification. — r2evans
– r2evans, Commented Oct 3 at 12:22
Can you Edit the question to provide enough examples to cover all the cases (showing input and expected output) that can occur and double check the explanation to be sure that it is consistent with the examples. — G. Grothendieck
– G. Grothendieck, Commented Oct 3 at 18:19

sln · Accepted Answer · 2025-10-03 21:48:49Z

Based on the Code portion it looks like you're trying to do something like this.
I can't put comments in the regex yet ..
The equivalent Group1 is here group1 + group2.
The equivalent Group2 is here group3

See the substitution in the demo.

https://regex101.com/r/NhhEg0/1

^.*?((?:[A-Z]\d{3}[ ]*)*(?:[WZ]84\d)(?=.*?[WZ]84\d.*?(Z94\d)$)).*?((?:[A-Z]\d{3}[ ]*)*(?:[WZ]84\d).*?Z94\d)$

If your regex engine doesn't support cluster group notation (?:), change it to capture groups being sure to get the right captures that represent your two groups.

Formatted

^ 
.*? 
(                             # (1 start)
   (?:
      [A-Z] \d{3} 
      [ ]*    
   )*
   (?: [WZ] 84 \d )
   (?=
      .*? 
      [WZ] 84 \d .*? 
      ( Z94 \d )                    # (2)
      $ 
   )
)                             # (1 end)
.*? 
(                             # (3 start)
   (?:
      [A-Z] \d{3} 
      [ ]*    
   )*
   (?: [WZ] 84 \d )
   .*? Z94 \d 
)                             # (3 end)
$

Collectives™ on Stack Overflow

String manipulation in R, conditional capture and conditional append. ?regex solution

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related