R

The need for documenting functions

My current work usually requires me to work on a project until we can submit a research paper, and then move on to a new project. However, 3-6 months down the road, when the reviews for the paper return, it is quite common to have to do some new analyses or re-analyses of the data. At that time, I have to re-visit my code! One of the common problems I (and I’m sure many of us) have is that we tend to hack code and functions with the end in mind, just getting the job done.

Kaplan-Meier plots using ggplots2 (updated)

About 3 years ago I published some code on this blog to draw a Kaplan-Meier plot using ggplot2. Since then, ggplot2 has been updated (from 0.8.9 to 0.9.3.1) and has changed syntactically. Since that post, I have also become comfortable with Git and Github. I have updated the code, edited it for a small error, and published it in a Gist. This gist has two functions, ggkm (basic Kaplan-Meier plot) and ggkmTable (enhanced Kaplan-Meier plot with table showing numbers at risk at various times).

Pocketbook costs of software

I have always been provided SAS as part of my job, so I never really realized how much it cost. I’ve bought Stata before, and of course R :). I recently found out how much a reasonable bundle of SAS modules along with base SAS costs per year per seat, at least under the GSA. I tried finding out how much IBM SPSS is for a comparable bundle, but their web page was “not available”.

An enhanced Kaplan-Meier plot, updated

RStudio 0.94.92 visited

I just updated my RStudio version to the latest, v.0.94.92 (will this asymptotically approach 1, or actually get to 1?). It was nice to see the number of improvements the development team has implemented, based I’m sure on community feedback. The team has, in my experience, been extraordinarily responsive to user feedback, and I’m sure this played a large part in the development path taken by the team. First and foremost, I was happy to see most of my wants met in this version:

A word of warning about grep, which and the like

I’ve often selected columns or rows of a data frame using grep or which, based on some property. That is inherently sound, but the trouble comes when you wish to remove rows or columns based on that grep or which call, e.g., dat <- dat[,-grep(’\.1’, names(dat))] which would remove columns with a .1 in the name. This is fine the first time around, but if you forget and re-run the code, grep(’\.

SAS, R and categorical variables

One of the disappointing problems in SAS (as I need PROC MIXED for some analysis) is to recode categorical variables to have a particular reference category. In R, my usual tool, this is rather easy both to set and to modify using the relevel command available in base R (in the stats package). My understanding is that this is actually easy in SAS for GLM, PHREG and some others, but not in PROC MIXED.

An enhanced Kaplan-Meier plot

We often see, in publications, a Kaplan-Meier survival plot, with a table of the number of subjects at risk at different time points aligned below the figure. I needed this type of plot (or really, matrices of such plots) for an upcoming publication. Of course, my preferred toolbox was R and the ggplot2 package. There were other attempts to do this type of plot in ggplot2, mainly by Gary Collins and an anonymous author as seen on the ggplot2 mailing list.

RStudio: a cut above

As most followers of R-bloggers.com and the Twitter #rstats know by now, RStudio is a new open-source IDE for R that was beta-released yesterday. I have started putting it through its paces within my R workflow, and my impressions are more than favorable. I also tried it out on my home Linux server in server mode. RStudio is obviously designed by people who actually use R and code in R for their data analyses.

The split-apply-combine paradigm in R

Last night at the DC R Users meetup, which was our largest meetup to date, I gave an introductory presentation on data munging, and spent a bit of time on the split-apply-combine paradigm that I use almost daily in my work. I talked mainly about the packages plyr and doBy, which I use a lot now. David Smith posted a link on the Revolution blog to this article by Steve Miller, talking about the virtues of the data.