Replacing plyr’s ddply with data.table to improve an R script’s performance

Since quite a while, I am using R as a scripting language to generate graphs to the benchmarking results of various experiments. Since we have a new 64 core machine at the lab, I happen to run into performance issues with scripts that process the benchmark results, because the number of measurements increased significantly and the plyr library I use was designed with ease of use in mind, instead of performance. Concretely, I was experiencing script runtimes of up to 25min for a data set with roughly 50.000 measurements.

Fortunately, it is a know issue and a quick search turned up various options to improve performance. Since my initial choice for plyr was based on its intuitive DSL, I was looking for something that is similarly nice, but a lot faster. This blog post is meant to briefly document my findings for myself, and remind me how to use the data.table package, I chose.

Plyr’s ddply can be extremely slow

In an earlier post, I already outlined briefly how I use R to process benchmarking results. So far, I used ddply mostly with the transform and summarize functions to normalize measurements and to calculate averages, standard deviation, or speedup factors.

The problematic examples are for instance the following normalization and aggregation operations:

process_data <- function(data, baseline) {
  norm <- ddply(data, ~ Benchmark + Cores,
                function(x) transform(x,
                                      RuntimeRatio = Time / mean(Time[VirtualMachine == baseline]),
                                      SpeedupRatio = mean(Time[VirtualMachine == baseline]) / Time))

  # Calculate basic aggregates
  stats <- ddply(norm, ~ Cores + Benchmark + VirtualMachine,
                 summarise,
                 Time.mean                 = mean(Time),
                 Time.stddev               =   sd(Time),

                 RuntimeRatio.mean         = mean(RuntimeRatio),
                 RuntimeRatio.geomean      = geometric.mean(RuntimeRatio),
                 SpeedupRatio.mean         = mean(SpeedupRatio),
                 SpeedupRatio.geomean      = geometric.mean(SpeedupRatio),

                 NumberOfMeasurements      = length(Time),

                 RuntimeRatio.stddev       = sd(RuntimeRatio),
                 RuntimeRatio.scaled_sd    = RuntimeRatio.stddev/RuntimeRatio.geomean,
                 RuntimeRatio.variance     = var(RuntimeRatio),
                 RuntimeRatio.median       = median(RuntimeRatio),
                 # RuntimeRatio.mean095Error = confInterval095Error(RuntimeRatio),
                 # RuntimeRatio.cnfIntHigh   = RuntimeRatio.mean + RuntimeRatio.mean095Error,
                 # RuntimeRatio.cnfIntLow    = RuntimeRatio.mean - RuntimeRatio.mean095Error,

                 SpeedupRatio.stddev       = sd(SpeedupRatio),
                 SpeedupRatio.scaled_sd    = SpeedupRatio.stddev / SpeedupRatio.geomean,
                 SpeedupRatio.variance     = var(SpeedupRatio),
                 SpeedupRatio.median       = median(SpeedupRatio),
                 # SpeedupRatio.mean095Error = confInterval095Error(SpeedupRatio),
                 # SpeedupRatio.cnfIntHigh   = SpeedupRatio.mean + SpeedupRatio.mean095Error,
                 # SpeedupRatio.cnfIntLow    = SpeedupRatio.mean - SpeedupRatio.mean095Error,

                 Time.median               = median(Time)
                 # Time.mean095Error         = confInterval095Error(Time),
                 # Time.cnfIntHigh           = Time.mean + Time.mean095Error,
                 # Time.cnfIntLow            = Time.mean - Time.mean095Error
  )

  list(norm = norm, stats = stats)
}

On the object-table data set form earlier experiments, the process_data(...) function takes roughly 2.1sec to execute. Thus, is as not particular fast, but still useable. However, on the recent data sets I was using, the runtime got a severe bottleneck.

Thus, I was looking into data.table to find an alternative. The result is a rewrite of the function to the process_data_dt(...) function below. The result is a roughly 20x speedup. On the sample data mentioned, the runtime goes from 2.1sec down to 0.2sec. In my particular case, the runtime went from 25min to 1.3min. That’s still slow, but much more usable.

process_data_dt <- function(data, baseline) {
  norm <- data.table(data)
  norm[, RuntimeRatio := Time / mean(Time[VirtualMachine == baseline]), by=list(Benchmark, Cores)]
  norm[, SpeedupRatio := mean(Time[VirtualMachine == baseline]) / Time, by=list(Benchmark, Cores)]

  # Calculate basic aggregates
  stats   <- norm[, list(Time.mean                 = mean(Time),
                         Time.stddev               =   sd(Time),
                         
                         RuntimeRatio.mean         = mean(RuntimeRatio),
                         RuntimeRatio.geomean      = geometric.mean(RuntimeRatio),
                         SpeedupRatio.mean         = mean(SpeedupRatio),
                         SpeedupRatio.geomean      = geometric.mean(SpeedupRatio),
                         
                         NumberOfMeasurements      = length(Time),
                         
                         RuntimeRatio.stddev       = sd(RuntimeRatio),
                         RuntimeRatio.variance     = var(RuntimeRatio),
                         # RuntimeRatio.mean095Error = confInterval095Error(RuntimeRatio),
                         
                         
                         RuntimeRatio.median       = median(RuntimeRatio),
                         
                         SpeedupRatio.stddev       = sd(SpeedupRatio),
                         SpeedupRatio.variance     = var(SpeedupRatio),
                         SpeedupRatio.median       = median(SpeedupRatio),
                         # SpeedupRatio.mean095Error = confInterval095Error(SpeedupRatio),
                         
                         Time.median               = median(Time)
                         # Time.mean095Error         = confInterval095Error(Time)
                      ),
                  by = list(Cores, Benchmark, VirtualMachine)]
  
  # stats[, RuntimeRatio.cnfIntHigh := RuntimeRatio.mean + RuntimeRatio.mean095Error]
  # stats[, RuntimeRatio.cnfIntLow  := RuntimeRatio.mean - RuntimeRatio.mean095Error]
  stats[, RuntimeRatio.scaled_sd  := RuntimeRatio.stddev / RuntimeRatio.geomean]
  stats[, SpeedupRatio.scaled_sd  := SpeedupRatio.stddev / SpeedupRatio.geomean]
  # stats[, SpeedupRatio.cnfIntHigh := SpeedupRatio.mean + SpeedupRatio.mean095Error]
  # stats[, SpeedupRatio.cnfIntLow  := SpeedupRatio.mean - SpeedupRatio.mean095Error]
  # stats[, Time.cnfIntHigh         := Time.mean + Time.mean095Error]
  # stats[, Time.cnfIntLow          := Time.mean - Time.mean095Error]
  
  list(norm=norm, stats=stats)
}

To conclude, data.table is a nice R package that is much faster than plyr’s ddply and as you can see in the example, the provided interface is almost as nice as the one provided by ddply.

Comments are closed.