Populating a column in R with results of a dynamic query looping through a dataframe

https://stackoverflow.com/questions/23300105

09-07-2023
|

Frage

I have a dataframe, df:

  Chrom Position Gene.Sym Ref Variant   Lbase   Rbase
1  chr1   888639    NOC2L     T         C  888638  888640
2  chr1   889158    NOC2L     G         C  889157  889159
3  chr1   889159    NOC2L     A         C  889158  889160
4  chr1   982941     AGRN     T         C  982940  982942
5  chr1  1888193 KIAA1751     C         A 1888192 1888194
6  chr1  3319632   PRDM16     G         A 3319631 3319633

and I would like to populate a new column, df$triplet, with the [6] result of readLines as applied to a query: Example:

> readLines('http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr20:1888192,1888194')
[1] "<?xml version=\"1.0\" standalone=\"no\"?>"                                  
[2] "<!DOCTYPE DASDNA SYSTEM \"http://www.biodas.org/dtd/dasdna.dtd\">"          
[3] "<DASDNA>"                                                                   
[4] "<SEQUENCE id=\"chr20\" start=\"1888192\" stop=\"1888194\" version=\"1.00\">"
[5] "<DNA length=\"3\">"                                                         
[6] "cct"                                                                        
[7] "</DNA>"                                                                     
[8] "</SEQUENCE>"                                                                
[9] "</DASDNA>"

I want to put "cct" in df like so:

  Chrom Position Gene.Sym Ref.y Variant.y   Lbase   Rbase    triplet
1  chr1   888639    NOC2L     T         C  888638  888640    cct
2  chr1   889158    NOC2L     G         C  889157  889159
3  chr1   889159    NOC2L     A         C  889158  889160
4  chr1   982941     AGRN     T         C  982940  982942
5  chr1  1888193 KIAA1751     C         A 1888192 1888194
6  chr1  3319632   PRDM16     G         A 3319631 3319633

except that I would like to loop over the values in df$Chrom, df$Lbase, and df$Rbase in such a way as to fill the entire column. I know it would be something like the following, but I'm too noobly to figure it out exactly:

baseurl = 'http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment='
myurl = paste(baseurl, trip$Chrom, ":", trip$Lbase, ",", trip$Rbase, sep='')
x = readLines(myurl)

Lösung

The idiomatic way is to parse the xml:

f <- function(i) {
  library(XML)
  library(stringr)
  x <- trip[i,]
  segment <- paste0(x$Chrom,":",x$Lbase,",",x$Rbase)
  url     <- paste0("http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=",segment)
  doc     <- xmlInternalTreeParse(url)
  return(str_extract(xmlValue(doc["//DNA"][[1]]),"[a-z]+"))
}
trip$triplet=sapply(1:nrow(trip),f)
trip
#   Chrom Position Gene.Sym Ref Variant   Lbase   Rbase triplet
# 1  chr1   888639    NOC2L   T       C  888638  888640     ctt
# 2  chr1   889158    NOC2L   G       C  889157  889159     cga
# 3  chr1   889159    NOC2L   A       C  889158  889160     gaa
# 4  chr1   982941     AGRN   T       C  982940  982942     ctc
# 5  chr1  1888193 KIAA1751   C       A 1888192 1888194     ccg
# 6  chr1  3319632   PRDM16   G       A 3319631 3319633     tgc

If your data frame is large (many rows) this is likely to take a very long time, and you may get locked out of the server. It would be better to download multiple sections at once and then parse that in R, but I'm not familiar with the API.

Andere Tipps

You can use sapply to apply readLines to the vector of urls you assembled in myurl, for example adding the output back into your data frame:

df$dna <- sapply(myurl, function(url) readLines(url)[6])

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow