1

I am using hospital data. I want to make a regex expression in R and I am struggling to do this without using string manipulation outside of a single regex expression.

The string I want to search is: "W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941".

The string represents procedures, which are described in groups.

The groups are written in the generic form: (Procedure code, joint code, laterality code) Procedure codes are "[A-Z]\d{3}", joint code which is "W84\d" and then a laterality code "Z94\d"

This format can be repeated multiple times. In some circumstances the code may be written: (Procedure code1, joint code1), (Procedure code2, joint code2), lateralityALL This is done when the laterality applies to each group.

I want to capture the codes up to and including the laterality code, if present.

If there is only one laterality code at the end of all the string groups, this should be appended to each group.

# Example data:
string = c("W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941")

Desired output: Group 1: "W779 Y767 W835 W848 Y189 Z846 Z941" Group 2: "T625 Z843 Z941"

df <- data.frame(string = "W779 Y767 W835 W848 Y189 Z846 Z941")

What I have done: I have taken a inefficient approach to identify string with shared laterality and those without. When shared I manually append the laterality to the first group.

df <- data.frame(string = c("W779 Y767 W835 W848 Y189 Z846 Z941",
                            "Y189 Z846 Z941",
                            "W779 Y767 W835 W848"))

df %>% 
  mutate(joint_count = str_count(string, "(W84[1-9]{1})|(Z84[1-9]{1})"),
         laterality_count = str_count(string, "Z94[1-9]{1}"),
         laterality = str_extract_all(string, "Z94[1-9]{1}"),
         joint_laterality_count = str_count(string, "(W84[1-9]{1}|Z84[1-9]{1}) Z94[1-9]{1}"),
         laterality_end = str_detect(string, "Z94[1-9]{1}$"),
         shared_laterality = case_when(joint_count>laterality_count & laterality_end==T~1,
                                      .default = 0),
         single_joint_laterality = case_when(joint_count==1 & joint_laterality_count==1 & laterality_end==T~T, .default = F),
         op_group_1 = case_when(single_joint_laterality == T ~ str_extract(string, "^.*(W84[1-9]{1}|Z84[1-9]{1}) Z94[1-9]{1}"),
                                shared_laterality == T ~ paste(str_extract(string, "^.*(W84[1-9]{1})|(Z84[1-9]{1})"),laterality)
                                  ),
         op_group_2 = case_when(shared_laterality == T ~ str_extract(string, "(?<=(W84[1-9]{1})|(Z84[1-9]{1})).*")
                                  )
         )

My data have 100's of millions of rows so I want to have the most efficient approach and this probably is not it.

8
  • 1
    It seems we cannot rely on the first letter, is that correct? Your text says that joint codes start with W94 and laterality codes with Z94, but then we see Ws and Zs that may not be the right thing. (For instance, in df, it does not end on a Z94 laterality code.) Commented Oct 3 at 11:57
  • 1
    In your first example W779 Y767 W835 W848 Y189 Z846 T625 Z843 Z941 is split into W779 Y767 W835 W848 Y189 Z846 Z941 and T625 Z843 Z941 - why not into W779 Y767 W835 W848 Z941 and Y189 Z846 Z941 ? I'd strsplit by W848 and append Z941 if it appears at the end Commented Oct 3 at 12:13
  • 2
    You cannot capture non-contiguous portions of the text into the same capture within a single regex call. You cannot get W779 Y767 W835 W848 Y189 Z846 Z941 as Group 1 since there is some text in between W779 Y767 W835 W848 Y189 Z846 and Z941. So, the answer is "it is impossible". Match parts and combine with a "post-process" action, or just use something else. Commented Oct 3 at 12:18
  • 1
    I think the takeaway from Wiktor's comment is that regex alone is not going to address this, you will need some handling logic. My first stab at this is to confirm that we can "know" whether a particular code is a laterality code based on either its exact value or that it always follows or precedes (if not last) some other pattern. If that is true, then I would likely start with unlist(strsplit(df$string, " ")) and then some vector grouping (e.g., cumsum(.) on logical sequences). This might be hasty, esp if each row of df is logically grouped and/or helps with classification. Commented Oct 3 at 12:22
  • 2
    Can you Edit the question to provide enough examples to cover all the cases (showing input and expected output) that can occur and double check the explanation to be sure that it is consistent with the examples. Commented Oct 3 at 18:19

1 Answer 1

2

Based on the Code portion it looks like you're trying to do something like this.
I can't put comments in the regex yet ..
The equivalent Group1 is here group1 + group2.
The equivalent Group2 is here group3

See the substitution in the demo.

https://regex101.com/r/NhhEg0/1

^.*?((?:[A-Z]\d{3}[ ]*)*(?:[WZ]84\d)(?=.*?[WZ]84\d.*?(Z94\d)$)).*?((?:[A-Z]\d{3}[ ]*)*(?:[WZ]84\d).*?Z94\d)$

If your regex engine doesn't support cluster group notation (?:), change it to capture groups being sure to get the right captures that represent your two groups.

Formatted

^ 
.*? 
(                             # (1 start)
   (?:
      [A-Z] \d{3} 
      [ ]*    
   )*
   (?: [WZ] 84 \d )
   (?=
      .*? 
      [WZ] 84 \d .*? 
      ( Z94 \d )                    # (2)
      $ 
   )
)                             # (1 end)
.*? 
(                             # (3 start)
   (?:
      [A-Z] \d{3} 
      [ ]*    
   )*
   (?: [WZ] 84 \d )
   .*? Z94 \d 
)                             # (3 end)
$
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.