How to know if the data is missing when using XML find all function in R?

Question

I am trying to extract data from a SOAP file (XML format) which has many children.

the XML_find_all is a great function to get the data from complex structure. However, it is unable to return missing values.

Here is a simple example:

Read the simple example file with two customers. One customer is missing the name.

x <- read_xml("<Customers> <Customer> <ID> 01 </ID> <Name> Bla </Name> </Customer> <Customer> <ID> 02 </ID> </Customer> </Customers>")

Can find both IDs

xml_find_all(x, ".//ID")

[1] 01 [2] 02

Find only one name

xml_find_all(x, ".//Name")

[1] Bla

How can I get an NA or something that can tell me which data is missing?

In the end, I want to build a data frame. Please keep in mind this is just a simple example. The real data has 4.000 "customers" and 100 attributes.

E.Wiest · Accepted Answer · 2024-05-17 09:55:16Z

1

One other option with the tidyverse :

### Packages
library(purrr)
library(rvest)

### Data
x <-
  read_xml(
    "<Customers> <Customer> <ID> 01 </ID> <Name> Bla </Name> </Customer> <Customer> <ID> 02 </ID> </Customer> </Customers>"
  )

### Select all Customers elements
### For each one get the ID and the name of the person
### Merge the result in a dataframe
x %>%
  html_elements(xpath="//Customer") %>%
  map_df(~c(id= .x %>% html_element(xpath = ".//ID") %>%  html_text2(),
         name= .x %>% html_element(xpath = ".//Name") %>%  html_text2()))

Output :

# A tibble: 2 × 2
  id    name 
  <chr> <chr>
1 01    Bla  
2 02    NA

answered May 17, 2024 at 9:55

E.Wiest

5,9152 gold badges9 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dave2e · Accepted Answer · 2024-05-17 11:00:38Z

0

Assuming every customer has an ID, you can go from there and use xml_find_first() to find the first sibling Name-node after an ID node... If there is none, the function returns NA.

# get all nodes with ID
ID.nodes <- xml_find_all(x, ".//ID")
# assuming Name comes after ID (if actually present)
# get following  sibling Name node
Name.nodes <- xml_find_first(ID.nodes, "./following-sibling::Name")

results

ID.nodes |> xml_text()
# [1] " 01 " " 02 "
Name.nodes |> xml_text()
# [1] " Bla " NA

edited May 17, 2024 at 11:00

Dave2e

24.3k18 gold badges46 silver badges57 bronze badges

answered May 17, 2024 at 9:27

Wimpel

27.9k1 gold badge25 silver badges40 bronze badges

Comments

Parfait · Accepted Answer · 2024-05-18 19:57:24Z

Given the end goal is a data frame, consider building a list of data frames from XML using find_all on Customer nodes, then iteratively call xml_children, xml_text, and xml_name on child nodes.

Then, compile all data frames with dplyr::bind_rows (or plyr::rbind_fill or user method) to fill NAs for missing columns. Below should handle all child nodes without hard-coding ID and Name.

library(dplyr)
library(xml2)

xml_str <- paste0(
  "<Customers>",
  "  <Customer>",
  "    <ID> 01 </ID>",
  "    <Name> Bla </Name>",
  "  </Customer>",
  "  <Customer>",
  "    <ID> 02 </ID>",
  "  </Customer>",
  "</Customers>"
)

doc <- read_xml(xml_str)

# RETRIEVE ALL NODES
recs <- xml2::xml_find_all(doc, "//Customer")

# BIND EACH CHILD TEXT AND NAME
df_list <- lapply(recs, function(r) {
  vals <- xml2::xml_children(r)
  
  df <- setNames(
    c(xml2::xml_text(vals) |> trimws()), 
    c(xml2::xml_name(vals))
  ) |> rbind() |> data.frame()
})

# COMBINE ALL DFS
customer_df <- dplyr::bind_rows(df_list)

customer_df
#   ID Name
# 1 01  Bla
# 2 02 <NA>

Collectives™ on Stack Overflow

How to know if the data is missing when using XML find all function in R?

Read the simple example file with two customers. One customer is missing the name.

Can find both IDs

Find only one name

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Read the simple example file with two customers. One customer is missing the name.

Can find both IDs

Find only one name

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related