문제

I've got a flatfile, fixed width with neither newline nor linefeed (dump from AS400).

How do I load this file into an R data.frame?

I've tried different combinations of textConnection and read.fwf, to no avail.

The code below crashes Rstudio, so I'm assuming I'm overloading the system.

len below is 24376400, which is tame as far as the files I usually load using read.table. Record length is 400.

Is there any RECLEN parameter I should set, similar to SAS? Is there an option to set EOL = "\n" or "\r\n" ? Thank you.

fname <- "AS400FILE.TXT"
len <- file.info(fname)$size
conn <- file(fname, 'r')
contents <- readChar(conn, len)
close(conn)

df <- read.fwf( textConnection(contents) , widths=layout$length , sep="")

> dput(layout)
structure(list(start = c(1L, 41L, 81L, 121L, 161L, 201L, 224L, 
226L, 231L, 235L, 237L, 238L, 240L, 280L, 290L, 300L, 305L, 308L, 
309L, 330L, 335L, 337L, 349L, 350L, 351L, 355L, 365L), end = c(40L, 
80L, 120L, 160L, 200L, 223L, 225L, 230L, 234L, 236L, 237L, 239L, 
279L, 289L, 299L, 304L, 307L, 308L, 329L, 334L, 336L, 348L, 349L, 
350L, 354L, 364L, 400L), length = c(40L, 40L, 40L, 40L, 40L, 
23L, 2L, 5L, 4L, 2L, 1L, 2L, 40L, 10L, 10L, 5L, 3L, 1L, 21L, 
5L, 2L, 12L, 1L, 1L, 4L, 10L, 36L), label = c("TITLE", "SUFFIX", 
"ADDRESS1", "ADDRESS2", "ADDRESS3", "CITY", "STATE", 
"ZIP", "ZIP+4", "DELIVERY", "CHECKD", "FILLER", "NAME", 
"SOURCECODE", "ID", "FILLER", "BATCH", "FILLER", "FILLER", 
"GRID", "LOT", "FILLER", "CONTROL", 
"ZIPIND", "TROUTE", "SOURCEA", "FILLER")), .Names = c("start", 
"end", "length", "label"), class = "data.frame", row.names = c(NA, 
-27L))
> dim(layout)
[1] 27  4
> 
도움이 되었습니까?

해결책

You could use readChar for this.

First make up some sample data (I think the format is as you describe as far as I can tell from the question? i.e. wall of text with a specified width per column, no new lines in the entire file):

lengths <- c(2,3,4,2,3,4)
nFields <- length(lengths)
nRows   <- 10              # let's make 10 rows.
contents <- paste(letters[sample.int(26,size=sum(lengths)*nRows,replace=TRUE)],
                  collapse="")
#> contents
#[1] "lepajmcgcqooekmedjprkmmicm.......
cat(contents,file='test.txt')

I can think of 3 ways to do it, various differences between each:

If you know the number of rows in advance you can do:

# If you know #rows in advance..
conn <- file('test.txt','r')
data <- readChar( conn, rep(lengths,nRows) )
close(conn)
# reshape data to dataframe
df <- data.frame(matrix(data,ncol=nFields,byrow=T))

Otherwise you can use a loop (why read in the file once to work out the number of rows and then again to parse?)

# Otherwise use a loop
conn <- file('test.txt','r')
df <- data.frame(matrix(nrow=0,ncol=6)) # initialise 0-row data frame
while ( length(data <- readChar(conn, lengths)) > 0 ) {
    df[nrow(df)+1,] <- data
}
close(conn)

Or, since you already have all of contents in a string, you can just split the string using substring:

# have already read in contents so can calculate nRows
nRows <- floor(nchar(contents)/sum(lengths)) # 10 for my example
starts <- c(0,cumsum(lengths[-nFields]))
df3 <- data.frame(t(
                    vapply( seq(1,nRows*sum(lengths),sum(lengths)),
                    function(r) 
                        substring(contents,starts+r,starts+r+lengths-1),
                    rep("",nFields) )))

If you want to do it in as little file reads as possible, I suggest the second or third methods.

The third method "feels" most elegant to me, but requires you to read in the entire contents all at once, which, depending on file size, may not be viable.

If that's the case I'd go for the second, which only reads in one set of nFields fields at a time.

I don't recommend the first, unless you know the number of rows in advance - it was just my first attempt. I don't recommend it because you have to first read in the file to determine the number of rows, and then you close it and read it in again. If you want to go down that route then just use method 3! However, if you know by some other means the number of rows in advance, then you could use this method.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top