How to map multiple URLs to retrieve every text between a start word and end words in R? [closed]

Question

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Guide the asker to update the question so it focuses on a single, specific problem. Narrowing the question will help others answer the question concisely. You may edit the question if you feel you can improve it yourself. If edited, the question will be reviewed and might be reopened.

Closed 6 months ago.

Improve this question

In order to search for keywords using text mining tools, I need to retrieve the abstracts available on each of the URLs from the dat0 dataframe (given for example, URLs provided from this website) and integrate them into a second column named "abstract". The desired output is:

The challenge is, using a loop or other mapping, to open each URL and then to retrieve the text between the first word "Abstract" and the last two words "Issue Section" (these beginning and ending words are not to be retrieved), and to implement them in the second column of the dataframe. Example, for the first URL:

The loop/mapping is necessary because there are more than 700 URLs in the real dataframe, and therefore more than 700 abstracts to retrieve.

Thanks for help

Initial data:

dat0 <- structure(list(url = c("https://doi.org/10.1093/clinchem/hvae106.001", 
"https://doi.org/10.1093/clinchem/hvae106.002", "https://doi.org/10.1093/clinchem/hvae106.003"
)), class = "data.frame", row.names = c(NA, -3L))

Desired output:

dat1 <- structure(list(url = c("https://doi.org/10.1093/clinchem/hvae106.001", 
"https://doi.org/10.1093/clinchem/hvae106.002", "https://doi.org/10.1093/clinchem/hvae106.003"
), abstract = c("Background\nCardiovascular disease (CVD) remains the leading cause of death in developed countries. Elevated levels of the gut-microbe-associated metabolite trimethylamine-N-oxide (TMAO) have been associated with increased risk for CVD mortality in many large independent studies. In fact, large laboratory corporations such as Labcorp and Quest Diagnostic now offer TMAO diagnostic tests for the assessment of CVD risk and as a marker for disease-associated dysbiosis, using samples obtained from fasting patients. Although the strong association between TMAO levels and CVD risk has been established in fasting blood samples, here we are investigating the potential for a single meal to dynamically alter TMAO-related metabolites based on sex, diverse dietary substrates, and microbial influences.\n\nMethods\n35 healthy participants were randomized to one of four study groups with approximately 8-9 subjects in each arm. Half of the participants were randomly assigned to receive a three-day broad spectrum antibiotic regimen (ciprofloxacin, metronidazole, and vancomycin) while the others received no antibiotics. These two groups were further subdivided into a group consuming a highly processed meal (originating from local fast food restaurants) or alternatively a whole food meal (containing a variety of fruits, vegetables, and healthy fiber). Blood samples were taken at baseline (after an overnight fast), and then postprandially at 15 minutes, 30 minutes, 1 hour, 2 hours, 4 hours, and 6 hours after meal ingestion. Targeted plasma metabolites were quantified using a stable isotope dilution, liquid chromatography - tandem mass spectrometry (LC-MS/MS) method\n\nResults\nWhile plasma TMAO levels between food groups did not change significantly within the total cohort, there were clear individualized responses to either highly processed or whole food meals. Separation based on sex showed that TMAO levels were significantly reduced in females at 6 hours but remained steady in males for both food groups. Particularly, in the processed food group, TMAO levels were significantly lower in females than males at the 6-hour time point. Neither of these observations held true for TMAO precursors choline, carnitine, or betaine. However, plasma levels of a dietary precursor for TMAO, γ-butyrobetaine, showed clear diet-microbe-host interactions. For males in the processed food group, plasma γ-butyrobetaine levels were significantly increased in subjects on broad spectrum antibiotics.\n\nConclusions\nOur results show the postprandial levels of TMAO, and its nutrient precursors dynamically change in a diet-, microbe-, and sex-dependent manner. These findings provide new insights into the postprandial levels of TMAO-related metabolites and may inform precision nutritional approaches in those who could benefit from TMAO-lowering strategies.", 
"Background\nHigh-sensitivity cardiac troponin (hs-cTn) plays a pivotal role in the early diagnosis of Acute Coronary Syndrome (ACS). With technological advancements, the sensitivity of troponin assays has significantly improved. The International Federation of Clinical Chemistry (IFCC) criteria for hs-cTn assays include a total imprecision (Coefficient of Variation, CV) ≤10% at the gender-specific 99th percentile value and detectable levels in ≥50% of a healthy population. This study aims to evaluate the performance of Zybio's hs-cTnI assay against these high-sensitivity standards\n\nMethods\nThis study involved 1661 individuals undergoing routine health checks at Chongqing Medical University Third Hospital. Exclusion criteria included recent medication use, abnormal NT-proBNP levels, significant cardiac silhouette changes in chest X-rays, age <20, incomplete data, pregnancy, or history of severe chronic diseases. Participants were classified into apparent healthy and healthy groups based on blood pressure, lipid profiles, and electrocardiogram results. Laboratory examinations included hs-cTn and NT-ProBNP concentrations using Zybio's chemiluminescence instrument EXI2400 and associated reagent kits, with other parameters measured using standard biochemical and high-performance liquid chromatography methods.\n\nResults\nOf the 1247 participants (556 males, 691 females) included in the final analysis, 1157 showed detectable hs-cTnI levels above the limit of detection (LOD), yielding an overall detection rate of 92.78%. Detection rates were 93.71% in males and 92.04% in females. The total imprecision (CV) of hs-cTnI at gender-specific 99th percentile values over 20 days was below 10%, meeting the high-sensitivity criteria. This finding was consistent across lower and higher concentration ranges.\n\nConclusions\nThe Zybio hs-cTnI assay demonstrated a high detection rate in a healthy population, with 92.78% detectability overall, 93.71% in males, and 92.04% in females. The assay met the high-sensitivity criteria of IFCC, with a total imprecision (CV) of less than 10% at the gender-specific 99th percentile levels. These results validate the utility of Zybio's hs-cTnI assay for clinical application in the early diagnosis of ACS.", 
"Background\nIn July 2024, the Clinical Laboratory Improvement Act (CLIA) proficiency testing (PT) criteria will directly regulate Troponin I performance for the very first time. The new CLIA goal is 0.9 ng/mL or 30%, whichever is greater. CAP previously set a goal of 30% or 3 times the group standard deviation (SD), whichever is greater, a more permissive setting. Estimates of current instrument group performance from an international proficiency testing (PT) survey have shown none of the 5 major diagnostic instruments can achieve the biological minimum goal at a 6-Sigma level, while 4 of the 5 instruments perform at 3 Sigma or below. Performance of these platforms was assessed using the methodology introduced in 2006 by Westgard JO and Westgard SA. The DxI 9000 high sensitivity Troponin I assay was assessed to determine if it could achieve the new CLIA 2024 goals.\n\nMethods\nThe DxI 9000 high sensitivity Troponin I assay was assessed with three reagent lots, on both serum and Lithium Heparin (LiHep) samples, following Clinical Laboratory Standards Institute (CLSI) protocols EP05 and EP09 to estimate imprecision and bias. The new CLIA 2024 PT criteria supplied the allowable total error for the standard Sigma-metric calculation: Sigma-metric = (TEa - |bias|) / SD The Sigma-metric predicts not only future problems with PT, but also potential optimization of QC procedures, including fewer Westgard Rules, control levels, even reduced QC frequency which can lead to less cost, time, and materials.\n\nResults\nThe majority (91.7%) of data points across the analytical measuring range for DxI 9000’s high sensitivity Troponin I assay from both serum and LiHep samples achieved 6-Sigma performance. For serum, only 8.3% of samples achieved 5-sigma performance. For LiHep, only 4.2% of the performance was 4 Sigma. None of the performance was 3 Sigma or lower.\n\nConclusions\nThe superior precision observed on DxI 9000 high sensitivity Troponin I delivers overwhelming 6-sigma performance when assessed by CLIA’s 2024 goal. This assay is highly unlikely to face PT difficulties and can be optimized for reduced Westgard Rules, reduced control levels leading to a reduction in time, materials, and cost."
)), class = "data.frame", row.names = c(NA, -3L))

You have two clear questions here: (1) Iterate over urls and download each of them. (2) Extract text between two strings. The second is easy once you have the first done. The first is typically addressed with lapply or a for loop, but unfortunately these URLs use captcha or something similar, so none of httr::GET, rvest::read_html, nor rvest::read_html_live work with simple tests. I suggest you edit your question to remove the "extract text" part and ask solely on the first part for now. Scraping may be possible, but the captcha part is usually because they do not allow scraping. — r2evans
– r2evans, Commented Jun 6 at 18:32
They use CloudFlare, this makes scraping very difficult. I don't want to say impossible since there may be ways, but I for one know I've not been able to scrape from them, even when the served site does not have a clear policy against scraping/harvesting data. (I could not see a fair-use or similar policy on doi.org, but I didn't look very long.) "DOI" has an email for getting more information (see doi.org/the-identifier/resources/faqs); if your use is fair, they may give you guidance such as an API to get what you need. — r2evans
– r2evans, Commented Jun 6 at 18:39

SamR · Accepted Answer · 2025-06-07 07:32:57Z

As r2evans notes, scraping a bunch of CloudFlare pages is going to be difficult and possibly against their terms of service, and an API is preferable. Fortunately, such APIs exist. For example, the Crossref REST API.

The result for your first DOI can be obtained at https://api.crossref.org/works/10.1093/clinchem/hvae106.001. As you can see, the results are json which is slightly strangely formatted, e.g. they have a bunch of html-like tags in them such as <jats:title>Background</jats:title>. Fortunately, these tags are quite helpful to show us where the abstract starts, and they can be easily removed with a function like this:

clean_abstracts <- function(abstracts) {
    abstracts |>
        # remove anything before the start
        sub("^.*?<jats:title>", "", x = _) |>
        # remove the <jats:title> etc. tags
        gsub("<.*?>", "", x = _) |>
        # replace multiple newline/space combos with one newline
        gsub("\n+\\s*\n*", "\n", x = _) |>
        trimws()
}

Using such an approach to clean the abstracts, we are in a position to write a function to obtain them by looping through the URLs (keeping in mind the 50 request/second rate limit):

get_abstracts <- function(dois, base_url = "https://api.crossref.org/works/") {
    urls <- paste0(base_url, sub("https://doi.org/", "", dois))
    results <- lapply(urls, \(url) {
        Sys.sleep(0.02) # rate limit 50 request a second
        httr::GET(url)
    })
    abstracts <- lapply(results, \(result) {
        if (!result$status_code == 200) {
            return(NA_character_)
        }
        # return NA rather than NULL if no abstract
        httr::content(result)$message$abstract %||% NA_character_
    })
    clean_abstracts(abstracts)
}

You can then just apply this to your dataframe:

dat0 |>
    transform(abstract = get_abstracts(url))

Output:

  url                                          abstract                                                                                                                   
  <chr>                                        <chr>                                                                                                                      
1 https://doi.org/10.1093/clinchem/hvae106.001 "Abstract\nBackground\nCardiovascular disease (CVD) remains the leading cause of death in developed countries. Elevated le…"
2 https://doi.org/10.1093/clinchem/hvae106.002 "Abstract\nBackground\nHigh-sensitivity cardiac troponin (hs-cTn) plays a pivotal role in the early diagnosis of Acute Cor…"
3 https://doi.org/10.1093/clinchem/hvae106.003 "Abstract\nBackground\nIn July 2024, the Clinical Laboratory Improvement Act (CLIA) proficiency testing (PT) criteria will…"

Or the first abstract in full:

"Abstract\nBackground\nCardiovascular disease (CVD) remains the leading cause of death in developed countries. Elevated levels of the gut-microbe-associated metabolite trimethylamine-N-oxide (TMAO) have been associated with increased risk for CVD mortality in many large independent studies. In fact, large laboratory corporations such as Labcorp and Quest Diagnostic now offer TMAO diagnostic tests for the assessment of CVD risk and as a marker for disease-associated dysbiosis, using samples obtained from fasting patients. Although the strong association between TMAO levels and CVD risk has been established in fasting blood samples, here we are investigating the potential for a single meal to dynamically alter TMAO-related metabolites based on sex, diverse dietary substrates, and microbial influences.\nMethods\n35 healthy participants were randomized to one of four study groups with approximately 8-9 subjects in each arm. Half of the participants were randomly assigned to receive a three-day broad spectrum antibiotic regimen (ciprofloxacin, metronidazole, and vancomycin) while the others received no antibiotics. These two groups were further subdivided into a group consuming a highly processed meal (originating from local fast food restaurants) or alternatively a whole food meal (containing a variety of fruits, vegetables, and healthy fiber). Blood samples were taken at baseline (after an overnight fast), and then postprandially at 15 minutes, 30 minutes, 1 hour, 2 hours, 4 hours, and 6 hours after meal ingestion. Targeted plasma metabolites were quantified using a stable isotope dilution, liquid chromatography - tandem mass spectrometry (LC-MS/MS) method\nResults\nWhile plasma TMAO levels between food groups did not change significantly within the total cohort, there were clear individualized responses to either highly processed or whole food meals. Separation based on sex showed that TMAO levels were significantly reduced in females at 6 hours but remained steady in males for both food groups. Particularly, in the processed food group, TMAO levels were significantly lower in females than males at the 6-hour time point. Neither of these observations held true for TMAO precursors choline, carnitine, or betaine. However, plasma levels of a dietary precursor for TMAO, γ-butyrobetaine, showed clear diet-microbe-host interactions. For males in the processed food group, plasma γ-butyrobetaine levels were significantly increased in subjects on broad spectrum antibiotics.\nConclusions\nOur results show the postprandial levels of TMAO, and its nutrient precursors dynamically change in a diet-, microbe-, and sex-dependent manner. These findings provide new insights into the postprandial levels of TMAO-related metabolites and may inform precision nutritional approaches in those who could benefit from TMAO-lowering strategies."

A note on DOIs

DOIs are quite complicated and there are a bunch of ways to register them. Various types of resources can have a DOI - e.g. GitHub repos, which I wouldn't expect to be on Crossref and even if they were wouldn't generally contain an abstract. Crossref say that they hold metadata for approximately 150 million scholarly artifacts. It covers the three papers in your sample data and a selection of others I just tried including some published in the last few weeks. However, I found a paper that it couldn't resolve (although technically it's at the "accepted" rather than "published" stage, there's a preprint and a DOI). I'm sure you'll find that there are some papers that this misses. Some will be dead/incorrect DOIs but others will just be missing from Crossref. The answer from r2evans would be a good approach to working out which is which, and mopping up.

The API is what I was hoping somebody would find/identify! Awesome, thanks for making my answer nearly moot :-)
@r2evans CrossRef is large but not comprehensive so I actually think the answers complement each other - I've added a note to the end of my answer to this effect.
Thanks SamR, a little complex for me but it works perfectly! And thanks r2evans for the initial response. In less than 3 minutes, I was able to retrieve the approximately 700 abstracts, indeed thanks to APIs. These open access abstracts often contain valuable information, but unfortunately are not all referenced on PubMed.

r2evans · Accepted Answer · 2025-06-06 19:01:32Z

Since doi.org uses CloudFlare, I'm going to skip (ignore) the notion of scraping it manually. I don't know that it's impossible, but it can be difficult. Your question really is three questions anyway:

How to scrape one URL from doi.org;
How to scrape many URLs in a loop; and
How to extract the text you want.

I'll skip the first two and use a manual browser technique. While less convenient than I'm sure you want, it may shorten your development timeline significantly if you have a manageable number of URLs to scrape. If you can get a "how to scrape" working, then fn should still work for you.

fn <- function(txt = clipr::read_clip(), from = "Abstract", to = "Issue Section:") {
  if (missing(txt)) message("# copied ", length(txt), " lines from the clipboard")
  if (!all(c(from, to) %in% txt)) {
    stop("unable to find both 'from' and 'to' in the copied text")
  }
  out <- txt[ (which(txt %in% from)[1]+1):(which(txt %in% to)[1]-1) ]
  paste(out[nzchar(out)], collapse = "\n")
}

Now navigate to the first URL in your regular browser. Once the page renders, highlight everything or at least the portion that includes both Abstract and Issue Section: (since both need to be found). Copy to your clipboard (Ctrl-C or Cmd-C on macos). Run fn():

res <- fn()
substring(res, 1, 80)
# # copied 157 lines from the clipboard
# [1] "Background\nCardiovascular disease (CVD) remains the leading cause of death in de"

POC complete. Now to automate that:

dat0$abstract <- NA_character_ # only once!
for (i in seq_len(nrow(dat0))) {
  if (is.na(dat0$url[i]) || !nzchar(dat0$url[i]) || !is.na(dat0$abstract[i])) next
  message("# nagivate to ", sQuote(dat0$url[i], FALSE), ", and copy all text")
  clipr::write_clip(dat0$url[i])
  ign <- readline("press enter once copied to the clipboard")
  res <- try(fn(), silent = TRUE)
  if (inherits(res, "try-error")) {
    message("# error when extracting: ", conditionMessage(res))
  } else {
    dat0$abstract[i] <- res
  }
}

When you run that code block, it will iterate over each URL in dat0 and prompt you what to do. When it says "navigate to", go to your browser's address bar and paste (for me using FF, when the browser is focused, I hit Ctrl-L, then Ctrl-V to paste the URL that's in the clipboard, then Enter. Once it loads, I click anywhere in the page itself (not on a link), hit Ctrl-A and Ctrl-C to copy all text into the clipboard. Now back to R, hit Enter, and it should be done.

Notes:

If using a Mac, you'll likely need to use Cmd-L, Cmd-V, Cmd-C;
If using a different browser, you may need different hotkeys. Frankly, the Ctrl-L is nice to have, you can always use your mouse to get to the address bar. For speed I prefer keyboard-only, but that's a personal preference.
If you get interrupted, as long as you do not overwrite dat0$abstract <- NA_character_ again, you can stop the process and start up again later, and it will only work on rows where abstract has not been filled.
This function does a little checking, but clearly changes in text structure or other things may cause errors. Have fun.
I chose to use strict equality for finding the start/end strings. I suspect doi.org will remain consistent since I think it's a template, but that's no guarantee. I tend to argue against regexes for things like this, but if you go that route make sure you guard against: 0 matches, 2+ matches, and incorrect matches. Good luck, we're all counting on you.

After all of that, I have

str(dat0)
# 'data.frame': 3 obs. of  2 variables:
#  $ url     : chr  "https://doi.org/10.1093/clinchem/hvae106.001" "https://doi.org/10.1093/clinchem/hvae106.002" "https://doi.org/10.1093/clinchem/hvae106.003"
#  $ abstract: chr  "Background\nCardiovascular disease (CVD) remains the leading cause of death in developed countries. Elevated le"| __truncated__ "Background\nHigh-sensitivity cardiac troponin (hs-cTn) plays a pivotal role in the early diagnosis of Acute Cor"| __truncated__ "Background\nIn July 2024, the Clinical Laboratory Improvement Act (CLIA) proficiency testing (PT) criteria will"| __truncated__

substring(dat0$abstract, 1, 80)
# [1] "Background\nCardiovascular disease (CVD) remains the leading cause of death in de"
# [2] "Background\nHigh-sensitivity cardiac troponin (hs-cTn) plays a pivotal role in th"
# [3] "Background\nIn July 2024, the Clinical Laboratory Improvement Act (CLIA) proficie"

Collectives™ on Stack Overflow

How to map multiple URLs to retrieve every text between a start word and end words in R? [closed]

2 Answers 2

A note on DOIs

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

A note on DOIs

3 Comments

Comments

Related