1

I have dataset with the following format. I was trying to do that with reshape2 package in R, but it gives an inappropriate format(binary variables for all pages). Is there any method available that can reshape the data set in the required format as below.

Input format:
User    Pages
1   index.html
1   search.html
1   help.html
1   contact.html
2   help.html
2   contact.html
3   index.html
3   search.html
3   feedback.html

Output format:
User    page1       page2         page3         page4         page5
1       index.html  search.html   help.html     contact.html  NA
2       help.html   contact.html  NA            NA            NA
3       index.html  search.html   feedback.html NA            NA
3
  • Why do you need the data in a wide form like this? Commented Apr 27, 2013 at 10:24
  • @Ananda Mahto I am interesting in knowing the visited pages of every users for website from Google Analytics data. You may know Google analytics provides the data in vertical manner so need to restructure the data. Commented Apr 27, 2013 at 13:18
  • The data are in that format because it is usually easier to do analysis on "long" data rather than on "wide" data. That said, you may also be interested in exploring the aggregate function, with which you can condense your "Pages" variable into a list by user. Something like aggregate(as.character(Pages) ~ User, mydf, I, simplify = FALSE), where "mydf" is the name of your dataset. Commented Apr 27, 2013 at 15:51

2 Answers 2

9

Use the function dcast from the reshape2 package:

library(reshape2)

txt <- "User    Pages
1   index.html
1   search.html
1   help.html
1   contact.html
2   help.html
2   contact.html
3   index.html
3   search.html
3   feedback.html"

mydf <- read.table(text=txt, header=TRUE)

#creating a new column to count the page number:
mydf$page <- paste("Page", unlist((sapply(table(mydf$User), seq))))  

new.df <- dcast( mydf, User ~ page, value.var="Pages") #here the magic happens. 

> print(new.df)
   User     Page 1       Page 2        Page 3       Page 4
1    1 index.html  search.html     help.html contact.html
2    2  help.html contact.html          <NA>         <NA>
3    3 index.html  search.html feedback.html         <NA>
Sign up to request clarification or add additional context in comments.

2 Comments

Hey zelite, Thanks for code this. Its working on this sample dataset. But when I am applying this on large dataset it gives me the binary values(0 or 1) instead of the page information(like index.html, search.html). Is there something need to be updated while applying on another dataset with the same format (User, Pages).
Can you check the str of your original dataframe? is there something in the wrong type? an example of code where the output becomes 0 or 1 would be helpful to understand were the code fails.
2

incorporating @zelite's amazing unlist trick

x <- read.table( text = "User    Pages
1   index.html
1   search.html
1   help.html
1   contact.html
2   help.html
2   contact.html
3   index.html
3   search.html
3   feedback.html", h=T)

library(reshape2)

x$tv <- unlist((sapply(table(x$User), seq)))

reshape( x , idvar = 'User' , timevar = 'tv' , direction = 'wide' )

4 Comments

Another possibility using plyr's rbind.fill with unstack: > rbind.fill(lapply(unstack(df, Pages ~ User), function(x) as.data.frame(matrix(x, nrow=1))))
Hi Anthony, Thanks for optimized code. This is generating the structure but I need all the pages information to be in left side of the NA values (As you already have provided for previous code but it takes too much time to complete on my large dataset (21,000 rows). Can we have function for doing that?)
Provided that "Pages" is a character variable, a more direct way to create x$tv is: x$tv <- with(x, ave(Pages, User, FUN = seq_along)).
@AnandaMahto so long as you wrap Pages in as.numeric :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.