R

Creating missing values in factors

Background I was looking at some breast cancer data recently, and was analyzing the ER (estrogen receptor) status variable. It turned out that there were three possible outcomes in the data: Positive, Negative and Indeterminate. I had imported this data as a factor, and wanted to convert the Indeterminate level to a missing value, i.e. NA. My usual method for numeric variables created a rather singular result: x <- as.

Selecting columns based on type

The tidyverse and, in particular, dplyr, provides functions to select columns from a data frame. There are three scoped functions available: select_all, select_if and select_at. In this post, we’ll look at a particular application of select_if, i.e., capturing the names of numeric variables. A quick search using Google finds a few solutions to this problem. As an example data set, I’ll use the diamonds data set from the ggplot2 package.

Selecting columns based on type

Templated output in R

Earo Wang, who is the curator for the We are R-Ladies twitter feed this week (last week of April, 2019), had a really nice tweet about using the whisker package to create a template incorporating text and data in R. Her example created a list of tidyverse packages with descriptions. I really liked the example, but thought that the glue package might be able to do the same thing. I used Earo’s code to generate a tibble with the package names and descriptions, and glue_data to create the templated list.

Interchanging RMarkdown and "spinnable" R

Dean Attali wrote this nice post a few years ago describing knitr’s spin function. This function allows a regular R file, with comments written with the roxygen2-style comment tag #’ to be rendered as an HTML document with the comments rendered as text and the results of the R code rendered in place, much as a RMarkdown document would. The basic rules for this are (from Dean’s post): Any line beginning with #’ is treated as a markdown directive (#’ # title will be a header, #’ some bold text results in some bold text) Any line beginning with #+ is parsed as code chunk options Run knitr::spin on the file In effect, this “spinnable” R script is the complement of a RMarkdown document with respect to format, since the RMarkdown document is primarily a text (Markdown) document with code chunks, and the R script is primarily a code document with text chunks.

Modifying Excel Files using openxlsx

I’ve been creating several output tables for a paper, which I usually store as sheets in an Excel file, since my collaborators are entirely in the Microsoft Office ecosystem. One issue I often run into is having to modify a single sheet in that file with updated data, while keeping the rest of the file intact. This is necessary since I’ve perhaps done some custom formatting in Excel on some of the tables, and I don’t want to re-format them everytime I modify a single sheet.

Bootstrapping clustered data

When evaluating the sampling variability of different statistics, I’ll often use the bootstrap procedure to resample my data, compute the statistic on each sample, and look at the distribution of the statistic over several bootstrap samples. In principle, the bootstrap is straightforward to do. However, if you have correlated data (like repeated measures or longitudinal data or circular data), the unit of sampling no longer is the particular data point but the second-level unit within which the data are correlated; otherwise you break the correlation structure of the data by doing a naive bootstrap and distort the resultant distributions.

Joint Statistical Meetings Talk

The Joint Statistical Meetings are being held in the beautiful city of Vancouver, British Columbia this year. I gave a talk on data visualization this year, which is a new one for me, but an area I’m quite excited about. I’ve been looking into the newer toolsets using Javascript graphics for a while now for different projects, and this talk gave me a chance to coalesce my thoughts about data visualization principles and usage, and how the new toolsets can help.

Cleaning up tables

Context One of things I have to do quite often is create tables for papers and presentations. Often the “Table 1” of a paper has descriptives about the study, broken down by subgroups. For presentation purposes, it doesn’t look good (to me, at least) that the name of each subgroup be repeated down one column of the table. One way to deal with this is, of course, by hand. Save the table as a CSV or Excel file, open it up in your favorite spreadsheet program, and prettify things.

Tidying messy Excel data (tidyxl)

Well, here’s what I was dealing with: (You can download this dataset for your playtime here) Notice that we have 3 header rows, first with patient IDs, second with spine region, and third with variable names (A and B, to protect the innocent). Goal A dataset that, for each patient and each angle gives us corresponding values of A and B. So this would be a four-column data set with ID, angle, A and B.