LoomR

Posted on 2019-10-12 Edited on 2023-12-22 In 单细胞

loom 文件逐渐被广泛用于单细胞数据存储，Seurat Team 研发了 LoomR 包用于处理 loom 文件，并与 Seurat 包无缝接合。本文翻译了 Seurat 网站上对于 LoomR 包的介绍。原文发表于 2018-04-22。

关于 loom 文件

单细胞数据量越来越大，计算需求也以指数形式增长。Seurat Team 发现，尽管使用 sparse matrices，对于分析 > 100,000 数目的细胞，也是一个严峻挑战，主要的困难是在内存中加载存储单细胞数据。HDF5数据格式可以解决这个问题，它不是在内存中存储数据，而是在硬盘上进行有效的数据存储，因而也适用于百万级的细胞数目。

Linnarson lab 开发了loom数据格式，它基于 HDF5 数据格式，可轻松存储单细胞基因组数据和元数据。他们还发布了python API - Lommpy，来处理loom文件。

Seurat Team 开发了与Loompy互补的一个 R 包 - LoomR。LoomR 允许在 R 环境中，加载和处理loom文件。本教程会简单介绍 LoomR 的安装、与 LoomR 对象的交互和如何充分利用 LoomR 中的内置的chunking mechanisms。最后，介绍如何在 Seurat workflow 中进行初始化，使得 Seurat 可以兼容 loom 文件, 并在将来希望可以让 Seurat 与HDF5完全兼容。

loomR

Github release: https://github.com/mojaveazure/loomR

CRAN release: forthcoming

安装

前提条件：

C++ HDF5 library 和 devtools

C++ HDF5 library

操作系统	命令
macOS	brew install hdf5
Debian and Debian-based OSes	sudo apt install libhdf5-dev
Red Hat-based OSes	sudo dnf install hdf5-devel or sudo yum install hdf5-devel
Windows	Download precombiled binaries from Mario Annau here

关于loom的类字段和方法： ?loomR::loom

# Install devtools from CRAN
install.packages("devtools")
# Use devtools to install hdf5r and loomR from GitHub
devtools::install_github(repo = "hhoeflin/hdf5r")
devtools::install_github(repo = "mojaveazure/loomR", ref = "develop")

1 2	# Load loomR library(loomR)

关于 loom objects

loomR 通过基于R6的 loom类，实现loom API。与 R6 objects 的交互与S4 Object（像 Seurat Object）略微不同。简单来说，R6 fields可使用$访问，而不是@，比如my.object$field.name。R6 methods可被直接调用（并修改Object）,比如my.object$method()。

与 loom objects 建立关联

与标准的 R objects 不同，loom objects 仅仅是与硬盘上的文件建立联系，而不需要把所需要的数据加载到内存中，这样就会有很小的内存占用。可以用connect关联 loom 文件（例子:pbmc）、用loomR::create创建表达矩阵、或用Convert将已有的Seurat Object转换为 loom 文件。

1
2
3

# Connect to the loom file in read/write mode
lfile <- connect(filename = "pbmc.loom", mode = "r+")
lfile

## Class: loom
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Attributes: version, chunks
## Listing:
##        name    obj_type dataset.dims dataset.type_class
##   col_attrs   H5I_GROUP         <NA>               <NA>
##  col_graphs   H5I_GROUP         <NA>               <NA>
##      layers   H5I_GROUP         <NA>               <NA>
##      matrix H5I_DATASET 2700 x 13714          H5T_FLOAT
##   row_attrs   H5I_GROUP         <NA>               <NA>
##  row_graphs   H5I_GROUP         <NA>               <NA>

loom Object 容器包括 6 个亚 Object：1 个数据（matrix）和 5 个组（layers, row_attrs, col_attrs, row_graphs, and col_graphs）更多详情

loom 文件结构

loom 文件，无非是一个强加了严格结构的 HDF5 文件。这个结构有助于保持一个无序二进制文件的一致性，并为哪个数据是是什么提供安全性。下面是此文件结构和调用每个数据规则的一个总结。

matrix : loom 文件结构的“根”,二维（n 基因和 m 细胞）

layers ：matrix数据的可选代表（映射），与matrix有相同的维度

row_attrs 和 col_attrs ：分别是行 (基因) 和列 (细胞)的元数据，肯定是一维或二维的数据，row_attrs第一个维度是 n，col_attrs的第一个维度是m。

row_graphs 和 col_graphs：坐标轴格式的稀疏散点图，每张图都是一组三个等长的数据：a (行索引)、b （列索引）和 w（值）

与 loom objects 的交互

两种方式：

使用double subset [[ operator ]] 或 $ 符号。

1
2
3

# Viewing the `matrix` dataset with the double subset [[ operator You can
# also use the $ sigil, i.e. lfile$matrix
lfile[["matrix"]]

## Class: H5D
## Dataset: /matrix
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_IEEE_F64LE
## Space: Type=Simple     Dims=2700 x 13714     Maxdims=Inf x Inf
## Chunk: 32 x 32

查看组内数据，使用[[ operator ]]或 $ 符号。

1
2
3

# Viewing a dataset in the 'col_attrs' group with the double subset [[
# operator and full UNIX-style path
lfile[["col_attrs/cell_names"]]

## Class: H5D
## Dataset: /col_attrs/cell_names
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_STRING {
##       STRSIZE H5T_VARIABLE;
##       STRPAD H5T_STR_NULLTERM;
##       CSET H5T_CSET_ASCII;
##       CTYPE H5T_C_S1;
##    }
## Space: Type=Simple     Dims=2700     Maxdims=Inf
## Chunk: 1024

1 2	# Viewing a dataset in the 'row_attrs' group with S3 $ chaining lfile$row.attrs$gene_names

## Class: H5D
## Dataset: /row_attrs/gene_names
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_STRING {
##       STRSIZE H5T_VARIABLE;
##       STRPAD H5T_STR_NULLTERM;
##       CSET H5T_CSET_ASCII;
##       CTYPE H5T_C_S1;
##    }
## Space: Type=Simple     Dims=13714     Maxdims=Inf
## Chunk: 1024

注意，以上命令仅返回存储数据的描述。获取数据本身，须在描述后面添加[ operator ]。

获取数据需要整数索引，或者空的[]来加载全部数据（[,]二维数据），此时，数据才被加载到内存中。这允许仅加载所需数据到内存中。

# Access the upper left corner of the data matrix
lfile[["matrix"]][1:5, 1:5]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0

# Access the full data matrix (here, using the $ instead of the [[ operator
# to access the matrix)
full.matrix <- lfile$matrix[, ]
dim(x = full.matrix)
## [1]  2700 13714

# Access all gene names
gene.names <- lfile[["row_attrs/gene_names"]][]
head(x = gene.names)
## [1] "AL627309.1"    "AP006222.2"    "RP11-206L10.2" "RP11-206L10.9"
## [5] "LINC00115"     "NOC2L"

get.attribute.df

由于元数据存储在loom文件中的多个数据集中，每个 loom Object 会有一个get.attribute.df方法:把多种元数据集组织成一个 data frame，以便使用。

# Pull three bits of metadata from the column attributes
attrs <- c("nUMI", "nGene", "orig.ident")
attr.df <- lfile$get.attribute.df(MARGIN = 2, attribute.names = attrs)
head(x = attr.df)
##                nUMI nGene    orig.ident
## AAACATACAACCAC 2419   779 SeuratProject
## AAACATTGAGCTAC 4903  1352 SeuratProject
## AAACATTGATCAGC 3147  1129 SeuratProject
## AAACCGTGCTTCCG 2639   960 SeuratProject
## AAACCGTGTATGCG  980   521 SeuratProject
## AAACGCACTGGTAC 2163   781 SeuratProject

loomR 中的 matrices

Matrix Transposition in loomR

为了保证最大的I/O速度， HDF5 library会转置 matrix 数据。这意味着，列是基因，行是细胞。提升的速度值得去这样做，并且 Seuat team 已经采取了防范措施，会对使用者进行正确的转置。但是，当直接对 data matrices 进行交互时，仍需注意！为了提升loomR 的兼容性，row.attrs仍指基因，col.attrs为细胞。

# Print the number of genes
lfile[["row_attrs/gene_names"]]$dims
## [1] 13714

# Is the number of genes the same as the second dimension (typically
# columns) for the matrix?
lfile[["row_attrs/gene_names"]]$dims == lfile[["matrix"]]$dims[2]
## [1] TRUE

# For the sake of consistency within the single-cell community, we've
# reversed the dimensions for the `shape` field.  As such, the number of
# genes is stored in `lfile$shape[1]`; the number of cells is stored in the
# second field
lfile[["row_attrs/gene_names"]]$dims == lfile$shape[1]
## [1] TRUE`

获取部分基因（或细胞）的数据：

# Pull gene expression data for all genes, for the first 5 cells Note that
# we're using the row position for cells
data.subset <- lfile[["matrix"]][1:5, ]
dim(x = data.subset)
## [1]     5 13714

# You can transpose this matrix if you wish to restore the standard
# orientation
data.subset <- t(x = data.subset)
dim(x = data.subset)
## [1] 13714     5

# Pull gene expression data for the gene MS4A1 Note that we're using the
# column position for genes
data.gene <- lfile[["matrix"]][, lfile$row.attrs$gene_names[] == "MS4A1"]
head(x = data.gene)
## [1] 0 6 0 0 0 0

loom objects 添加数据

添加层数据、基因元数据和细胞元数据到 loom object。对应的方法分别是add.layer, add.row.attribute, 和add.col.attribute，数据为矩阵或向量。比如，添加 ENSEMBL IDs 到基因元数据中：

# Generate random ENSEMBL IDs for demonstration purposes
ensembl.ids <- paste0("ENSG0000", 1:length(x = lfile$row.attrs$gene_names[]))
# Use add.row.attribute to add the IDs Note that if you want to overwrite an
# existing value, set overwrite = TRUE
lfile$add.row.attribute(list(ensembl.id = ensembl.ids), overwrite = TRUE)
lfile[["row_attrs"]]
## Class: H5Group
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Group: /row_attrs
## Listing:
##        name    obj_type dataset.dims dataset.type_class
##  ensembl.id H5I_DATASET        13714         H5T_STRING
##  gene_names H5I_DATASET        13714         H5T_STRING

# Find the ENSEMBL ID for TCL1A
lfile[["row_attrs/ensembl.id"]][lfile$row.attrs$gene_names[] == "TCL1A"]
## [1] "ENSG00009584"

基于 Chunk 的迭代

为 loom pbject 内置了chunking方法，对于较大的数据，可以以小块形式加载数据。为了优化内存使用，loomR 提供 map 和 apply 方法，可当数据分批次读入时进行计算。在后台，使用会使用batch.scan和batch.next方法，进行低水平的小块数据计算。每一种方法都会处理 loom 文件的全部数据，包括row_attrs 和 col_attr中的一维数据。

map 方法可以将全部数据分批次加载到内存中，从而实现对所有的基因或细胞进行运算，最终的分析结果会放到内存中。比如计算每个细胞总的 UMI 数目：

# Map rowSums to `matrix`, using 500 cells at a time, returning a vector
nUMI_map <- lfile$map(FUN = rowSums, MARGIN = 2, chunk.size = 500, dataset.use = "matrix", 
    display.progress = FALSE)
# How long is nUMI_map? It should be equal to the number of cells in
# `matrix`
length(x = nUMI_map) == lfile$matrix$dims[1]
## [1] TRUE

MARGINs

loom object 的许多方法都有MARGIN参数：告诉 loom 文件，对哪个维度进行反复迭代、添加或者抓取数据。为了与 scRNAseq 的 R 分析包保持一致，MARGIN=1表示行或基因，MARGIN=2表示列或细胞。这同样适用于 loom object 的shape字段，index 1 表示基因数目， index 2 表示细胞数目。

apply 同map方法类似，除了apply会把结果保存到 loom 文件中，而不是返回给用户。因此，它会使用更少的内存，仅仅占用当前迭代的内存。要把结果保存到硬盘上，需要提供名字，命名必须是完整 UNIX - 风格的路径名。

# Apply rowSums to `matrix`, using 500 cells at a time, storing in
# `col_attrs/umi_apply`
lfile$apply(name = "col_attrs/umi_apply", FUN = rowSums, MARGIN = 2, chunk.size = 500, 
    dataset.use = "matrix", display.progress = FALSE, overwrite = TRUE)
lfile$col.attrs$umi_apply
## Class: H5D
## Dataset: /col_attrs/umi_apply
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_IEEE_F64LE
## Space: Type=Simple     Dims=2700     Maxdims=Inf
## Chunk: 1024

# Ensure that all values are the same as doing a non-chunked calculation
all(lfile$col.attrs$umi_apply[] == rowSums(x = lfile$matrix[, ]))
## [1] TRUE

Selective-chunking

map 和apply 方法均支持处理特定值（不是特定数据集中的全部值），比如使用apply 去scale特定基因的表达值（而不是全部基因）。实现方法：传递一个整数向量给index.use参数。对于map，返回一个与index.use等长的向量或矩阵，对于apply，由于所有 loom object 的数据必须要有相等的大小，被跳过的index会返回NAs。

关闭 loom object

由于闭 loom object 是对于文件的一个链接，必须要关掉，才能保证数据被正确的写入到硬盘上。使用close_all方法。

1	lfile$close_all()

loomR 和 Seurat

loom 文件正在逐渐被 seurat 所兼容。

1	library(Seurat)

seurat 与 loom 文件的转换

Seurat 提供转换工具。seurat object to loom file：

# Load the pbmc_small dataset included in Seurat
data("pbmc_small")
pbmc_small
## An object of class seurat in project SeuratProject 
##  230 genes across 80 samples.

# Convert from Seurat to loom Convert takes and object in 'from', a name of
# a class in 'to', and, for conversions to loom, a filename
pfile <- Convert(from = pbmc_small, to = "loom", filename = "pbmc_small.loom", 
    display.progress = FALSE)
pfile
## Class: loom
## Filename: /home/paul/Documents/Satija/pbmc_small.loom
## Access type: H5F_ACC_RDWR
## Attributes: version, chunks
## Listing:
##        name    obj_type dataset.dims dataset.type_class
##   col_attrs   H5I_GROUP         <NA>               <NA>
##  col_graphs   H5I_GROUP         <NA>               <NA>
##      layers   H5I_GROUP         <NA>               <NA>
##      matrix H5I_DATASET     80 x 230          H5T_FLOAT
##   row_attrs   H5I_GROUP         <NA>               <NA>
##  row_graphs   H5I_GROUP         <NA>               <NA>

Seurat standard workflow

使用 seurat object 找到 variable genes：

# Normalize data, and find variable genes using the typical Seurat workflow
pbmc_small <- NormalizeData(object = pbmc_small, display.progress = FALSE)
pbmc_small <- FindVariableGenes(object = pbmc_small, display.progress = FALSE, 
    do.plot = FALSE)
head(x = pbmc_small@hvg.info)

##         gene.mean gene.dispersion gene.dispersion.scaled
## PCMT1    3.942220        7.751848               2.808417
## PPBP     5.555949        7.652876               1.216898
## LYAR     4.231004        7.577377               1.528749
## VDAC3    4.128322        7.383980               1.296982
## KHDRBS1  3.562833        7.367928               2.476809
## IGLL5    3.758330        7.319567               2.018088

然后使用 loom object：

# Run the same workflow using the loom object
NormalizeData(object = pfile, overwrite = TRUE, display.progress = FALSE)
FindVariableGenes(object = pfile, overwrite = TRUE, display.progress = FALSE)
# Normalized data goes into the 'norm_data' layer, variable gene information
# goes into 'row_attrs' Are the results equal?
par(mfrow = c(1, 2))
plot(x = t(x = pfile$layers$norm_data[, ]), y = pbmc_small@data, main = "Normalized Data", 
    xlab = "loom", ylab = "Seurat")
plot(x = pfile$row.attrs$gene_means[], y = pbmc_small@hvg.info[pfile$row.attrs$gene_names[], 
    "gene.mean"], main = "Gene Means", xlab = "loom", ylab = "Seurat")

最后切记：

1	pfile$close_all()