1

I am trying to extract data from a SOAP file (XML format) which has many children.

the XML_find_all is a great function to get the data from complex structure. However, it is unable to return missing values.

Here is a simple example:

Read the simple example file with two customers. One customer is missing the name.

x <- read_xml("<Customers> <Customer> <ID> 01 </ID> <Name> Bla </Name> </Customer> <Customer> <ID> 02 </ID> </Customer> </Customers>")

Can find both IDs

xml_find_all(x, ".//ID")

[1] 01 [2] 02

Find only one name

xml_find_all(x, ".//Name")

[1] Bla

How can I get an NA or something that can tell me which data is missing?

In the end, I want to build a data frame. Please keep in mind this is just a simple example. The real data has 4.000 "customers" and 100 attributes.

3 Answers 3

1

One other option with the tidyverse :

### Packages
library(purrr)
library(rvest)

### Data
x <-
  read_xml(
    "<Customers> <Customer> <ID> 01 </ID> <Name> Bla </Name> </Customer> <Customer> <ID> 02 </ID> </Customer> </Customers>"
  )

### Select all Customers elements
### For each one get the ID and the name of the person
### Merge the result in a dataframe
x %>%
  html_elements(xpath="//Customer") %>%
  map_df(~c(id= .x %>% html_element(xpath = ".//ID") %>%  html_text2(),
         name= .x %>% html_element(xpath = ".//Name") %>%  html_text2()))

Output :

# A tibble: 2 × 2
  id    name 
  <chr> <chr>
1 01    Bla  
2 02    NA   
Sign up to request clarification or add additional context in comments.

Comments

0

Assuming every customer has an ID, you can go from there and use xml_find_first() to find the first sibling Name-node after an ID node... If there is none, the function returns NA.

# get all nodes with ID
ID.nodes <- xml_find_all(x, ".//ID")
# assuming Name comes after ID (if actually present)
# get following  sibling Name node
Name.nodes <- xml_find_first(ID.nodes, "./following-sibling::Name")

results

ID.nodes |> xml_text()
# [1] " 01 " " 02 "
Name.nodes |> xml_text()
# [1] " Bla " NA 

Comments

0

Given the end goal is a data frame, consider building a list of data frames from XML using find_all on Customer nodes, then iteratively call xml_children, xml_text, and xml_name on child nodes.

Then, compile all data frames with dplyr::bind_rows (or plyr::rbind_fill or user method) to fill NAs for missing columns. Below should handle all child nodes without hard-coding ID and Name.

library(dplyr)
library(xml2)

xml_str <- paste0(
  "<Customers>",
  "  <Customer>",
  "    <ID> 01 </ID>",
  "    <Name> Bla </Name>",
  "  </Customer>",
  "  <Customer>",
  "    <ID> 02 </ID>",
  "  </Customer>",
  "</Customers>"
)

doc <- read_xml(xml_str)

# RETRIEVE ALL NODES
recs <- xml2::xml_find_all(doc, "//Customer")

# BIND EACH CHILD TEXT AND NAME
df_list <- lapply(recs, function(r) {
  vals <- xml2::xml_children(r)
  
  df <- setNames(
    c(xml2::xml_text(vals) |> trimws()), 
    c(xml2::xml_name(vals))
  ) |> rbind() |> data.frame()
})

# COMBINE ALL DFS
customer_df <- dplyr::bind_rows(df_list)

customer_df
#   ID Name
# 1 01  Bla
# 2 02 <NA>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.