xcode-select --install
Installations of R Packages that require compilation
1 Preparation
In order to compile R packages from source, you need to have the following tools installed on your system:
- Xcode Command Line Tools: This includes the necessary compilers and build tools for compiling R packages from source. You can install it by running the following command in your terminal. Note that this command will prompt you to install the Xcode Command Line Tools if they are not already installed. If you have Xcode installed, you can skip this step, or you will be prompted that the command line tools are already installed.
- Homebrew: This is a package manager for macOS that makes it easy to install and manage software packages. If you don’t have Homebrew installed, you can install it by running the following command in your terminal. You can also check the official Homebrew installation guide for more details.
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
2 data.table
Sometimes, macOS upgrades can update the default compiler (clang) included with Xcode Command Line Tools, which might lead to compilation issues for certain R packages like data.table
. For example, a recent issue arose with the macOS Sequoia beta where the updated clang version caused compilation failures (see Rdatatable/data.table#6622).
If you encounter such compilation problems after a macOS update, a potential workaround is to install an older version of the compiler using Homebrew, such as LLVM 16:
brew install llvm@16
You can then set the CC
and CXX
environment variables to point to the older compiler version before installing the package:
export CC=/usr/local/opt/llvm@16/bin/clang
export CXX=/usr/local/opt/llvm@16/bin/clang++
2.1 Introduction to data.table
data.table
is a high-performance extension of R’s data.frame
that provides a syntax for data manipulation that is concise, consistent, and efficient. It’s particularly optimized for large datasets and offers significant performance improvements over base R and tidyverse’s dplyr
.
2.2 Why data.table is superior to dplyr
Speed:
data.table
is consistently faster thandplyr
, especially for large datasets, due to its C implementation and sophisticated optimization techniques.Memory efficiency:
data.table
operations are typically performed in-place, which reduces memory overhead compared todplyr
’s copy-on-modify approach.Concise syntax: Complex operations can be expressed in a single line of code using
data.table
’s[i, j, by]
syntax, which is more compact thandplyr
’s pipe-based approach.Advanced features:
data.table
offers powerful features like rolling joins, non-equi joins, and specialized grouped operations that aren’t as easily accessible indplyr
.
2.3 Installation notes
If you directly download the data.table
package from CRAN, it will be installed as a binary package. This means that the package is pre-compiled and does not require compilation on your machine. However, this may not always provide the best performance. The biggest disadvantage of using the binary version is that it usually does not use openmp on Mac and Linux, which is a parallel programming model that can significantly speed up computations, so you lose the benefit of using multiple CPU cores.
To install the data.table
package from source, you can follow the following steps:
Preparation Follow the preparation steps above Section 1 to install Xcode Command Line Tools and Homebrew.
Install OpenMP: Install the
libomp
package using Homebrew, which provides support for OpenMP:
brew install libomp
-
Customize makevars: Create or edit the
~/.R/Makevars
file to include the following lines:
# ~/.R/Makevars
CPPFLAGS += -Xclang -fopenmp
LDFLAGS += -lomp
- CPPFLAGS: This variable is used to specify additional flags for the C++ compiler. The
-Xclang -fopenmp
flag tells the compiler to enable OpenMP support. - LDFLAGS: This variable is used to specify additional flags for the linker. The
-lomp
flag tells the linker to link against the OpenMP library. - If you don’t have the
~/.R/Makevars
file, you can create it using the following command:
mkdir -p ~/.R && touch ~/.R/Makevars
-
Install data.table: After the above steps are done, you can install the
data.table
package from source using the following command in R:
install.packages("data.table", type = "source")
3 qs2
3.1 Introduction to qs2
qs2
is a package for fast serialization and deserialization of R objects. It is particularly useful for saving and loading large datasets quickly, making it a great choice for data-intensive applications.
I used to use fst
package for serialization and deserialization, but I found that qs2
is superior as fst
can only save data frames, while qs2
can save any R object.
3.2 Installation notes
To install the qs2
package from source, you can follow the following steps:
- Preparation Follow the preparation steps above Section 1 to install Xcode Command Line Tools and Homebrew.
-
Install TBB: Install the
tbb
package using Homebrew, which provides support for parallel programming using Intel’s Threading Building Blocks (TBB). TBB is required for theqs2
package to enable parallel serialization and deserialization.
brew install tbb
-
add TBB environment variable: Add the following line to your
~/.zshrc
file to set the TBB environment variable. This step is necessary for theqs2
package to find the TBB library during installation. You can find the path by runningbrew --prefix tbb
. Note that you need to replace/usr/local/opt/tbb
with the actual path to the TBB installation on your system.
export TBB="/usr/local/opt/tbb"
export TBB_INC="$TBB/include"
export TBB_LIB="$TBB/lib"
-
Install qs2: After the above steps are done, you can install the
qs2
package from source using the following command in R. The official guide is here.
install.packages("qs2", type = "source", configure.args = "--with-TBB --with-simd=AVX2")
-
--with-TBB
: This flag tells theqs2
package to use the TBB library for parallel serialization and deserialization. -
--with-simd=AVX2
: This flag tells theqs2
package to use AVX2 SIMD (Single Instruction, Multiple Data) instructions for further performance optimization. AVX2 is a set of CPU instructions that can perform multiple operations in parallel, which can significantly speed up computations.