Apache Arrow 为扁平与层级嵌套数据定义了一种语言无关的列式内存格式，能够进行有效的分析型操作。 Arrow 能够进行零拷贝数据分享与 RPC 数据传输；是一种统一的数据格式；并且支持内存分析和查询处理，是向量化计算友好的

基本术语

Array：一组字段值组成的序列，长度已知，并且所有类型都相同
Slot：Array 中的一个字段值
Buffer：具有给定长度的连续虚拟地址空间。buffer 中的任何字节可以通过 pointer+offset 访问到
Physical Layout：不考虑任何语义的 Array 内存布局
Primitive type：基础类型，不包含任何子类型。包含固定长度、变长度以及 null 类型
Nested type：嵌套类型，包含一个或者多个子类型
Logical type：考虑语义的类型，使用某种 physical layout 实现。如 Decimal 类型使用 16 字节的 fixed-size binary 存储

物理内存格式

Buffer

所有 data 部分的数据都是分配在对齐长度的字节缓冲区，使用对齐地址和填充使得分配的内存是 8 或者 64 字节的倍数，具有以下优点

保证数组元素随机访问能力
对齐长度方便加载到 SIMD 寄存器，便于在没有额外条件检查的情况下使用 SIMD 指令

Validity Bitmap

所有的 Array (包括嵌套 Array，除了 Struct)使用 bitmap 去标识每个 Slot 是否为空值。这也属于 data 的一部分，使用对齐长度的字节缓冲区分配空间，所以访问为

// 判断 Array 的第 j 个 slot 是否为空
is_valid[j] -> bitmap[j / 8] & (1 << (j % 8))

Validity Bitmap 从右往左读取：

          0  1    2   3   4    5
values = [0, 1, null, 2, null, 3]

bitmap
j mod 8   7  6  5  4  3  2  1  0
          0  0  1  0  1  0  1  1

不同类型的内存物理布局

Layout Type	Buffer 0	Buffer 1	Buffer 2
Primitive	validity	data
Variable Binary	validity	offsets	data
List	validity	offsets
Fixed-size List	validity
Struct	validity
Sparse Union	type ids
Dense Union	type ids	offsets
Null
Dictionary-encoded	validity	data (indices)

Fixed-size Layout

字段的比特宽度是固定的，即对应着基本类型，在做列式内存存储时，每个 Array 的 Slot 的字节长度都是相同的。Array 将数据存储在连续的 Value Buffer 上，长度为 Slot 宽度 * Array 长度并且加上填充。与之对应的还包括一个 Validity Bitmap，分配给 Validity Bitmap Buffer。

下面举个例子，对于 Int32 类型的一组数据 [1, null, 2, 4 ,8] 的内存物理布局：

* Length: 5, Null count: 1
* Validity bitmap buffer:

  |Byte 0 (validity bitmap) | Bytes 1-63            |
  |-------------------------|-----------------------|
  | 00011101                | 0 (padding)           |

* Value Buffer:

  |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
  |------------|-------------|-------------|-------------|-------------|-------------|
  | 1          | unspecified | 2           | 4           | 8           | unspecified |

Variable-size Binary Layout

字段的比特宽度是变长的，在做列式内存存储时，每个 Array 的 Slot 的字节长度可能不同。Array 将数据分配在 Offset Buffer 和 Data Buffer 上。Offset Buffer 包含 Array 长度+1 个有符号整数，表示 Data Buffer 中每个 slot 的起始位置。所以 Offset Buffer 中第一个元素是 0 而最后一个元素是 Array 的总字节长度。通过以下方式计算第 j 个 slot 的起始位置和长度

slot_position = offsets[j]
slot_length = offsets[j + 1] - offsets[j]  // (for 0 <= j < length)

Variable-size List Layout

一种嵌套字段，每个字段有数量不同且类型相同的子字段。语义上与 Variable-size Binary Layout 相同，包含 Offset Buffer 和 Data Buffer。

下面举个例子，对于 List<List<Int8>> 类型的一组数据 [[[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]] 的内存物理布局：

* Length 3
* Nulls count: 0
* Validity bitmap buffer: Not required
* Offsets buffer (int32)

  | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
  |------------|------------|------------|-------------|-------------|
  | 0          |  2         |  5         |  6          | unspecified |

* Values array (`List<Int8>`)
  * Length: 6, Null count: 1
  * Validity bitmap buffer:

    | Byte 0 (validity bitmap) | Bytes 1-63  |
    |--------------------------|-------------|
    | 00110111                 | 0 (padding) |

  * Offsets buffer (int32)

    | Bytes 0-27           | Bytes 28-63 |
    |----------------------|-------------|
    | 0, 2, 4, 7, 7, 8, 10 | unspecified |

  * Values array (Int8):
    * Length: 10, Null count: 0
    * Validity bitmap buffer: Not required

      | Bytes 0-9                     | Bytes 10-63 |
      |-------------------------------|-------------|
      | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified |

如果我们需要 List[1] : List<List<Int8>>，那么查询步骤将是这样的

Offsets buffer 表示 List<List<Int8>> 元素在 Values Array(List<Int8>) 中的偏移量。可以查找到 List[1] 的起始偏移为 2，长度为 5-2=3
Values Array(List<Int8>) 的 Offset Buffer 表示 List<Int8> 元素在 Values array (Int8) 中的偏移量。由 List[1] 的起始偏移为 2，长度为 3，可以查到 List[1] 包含了三个元素，分别为 4~7、7~7、7~8
从 Values Array(Int8) 的 Data Buffer 中可以提取到 [[5, 6, 7], [], [8]]

Fixed-Size List Layout

一种嵌套字段，每个字段有相同数量且类型相同的子字段。语义上与 Fixed-size Layout 相同

下面举个例子，对于 FixedSizeList<byte>[4] 类型的一组数据 [[192, 168, 0, 12], null, [192, 168, 0, 25], [192, 168, 0, 1]] 的内存物理布局：

* Length: 4, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  |--------------------------|-----------------------|
  | 00001101                 | 0 (padding)           |

* Values array (byte array):
  * Length: 16,  Null count: 0
  * validity bitmap buffer: Not required

    | Bytes 0-3       | Bytes 4-7   | Bytes 8-15                      |
    |-----------------|-------------|---------------------------------|
    | 192, 168, 0, 12 | unspecified | 192, 168, 0, 25, 192, 168, 0, 1 |

Struct Layout

一种嵌套布局，每个字段由一组命名的相同数量且类型不同的子字段组成。内存布局由 Validity Bitmap 和一组子字段 Array 组成，子字段 Array 的物理布局都是相互独立的。

下面举个例子，对于 Struct<VarBinary, Int32> 类型的一组数据 [{'joe', 1}, {null, 2}, null, {'mark', 4}] 的内存物理布局：

* Length: 4, Null count: 1
* Validity bitmap buffer:

  |Byte 0 (validity bitmap) | Bytes 1-63            |
  |-------------------------|-----------------------|
  | 00001011                | 0 (padding)           |

* Children arrays:
  * field-0 array (`VarBinary`):
    * Length: 4, Null count: 2
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      |--------------------------|-----------------------|
      | 00001001                 | 0 (padding)           |

    * Offsets buffer:

      | Bytes 0-19     |
      |----------------|
      | 0, 3, 3, 3, 7  |

     * Values array:
        * Length: 7, Null count: 0
        * Validity bitmap buffer: Not required

        * Value buffer:

          | Bytes 0-6      |
          |----------------|
          | joemark        |

  * field-1 array (int32 array):
    * Length: 4, Null count: 1
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      |--------------------------|-----------------------|
      | 00001011                 | 0 (padding)           |

    * Value Buffer:

      |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-63 |
      |------------|-------------|-------------|-------------|-------------|
      | 1          | 2           | unspecified | 4           | unspecified |

Union Layout

一种嵌套布局，每个字段由一个类型不同的子字段组成，每个 Slot 所包含的类型在预先定义的类型序列中选择。Union 包含 dense 和 sparse 两种类型。

Dense Union

其物理布局如下：

每个类型都有一个子 Array
Type buffer：一组 type id 的缓冲区，每个 type id 用一个 8-bit 有符号整数表示
Offsets buffer：一组 int32 值的缓冲区，表示在子 Array 中的偏移量

下面举个例子，对于Union<f: float, i: int32> 类型的一组数据 [{f=1.2}, null, {f=3.4}, {i=5}] 的物理布局如下：

* Length: 4, Null count: 0
* Types buffer:

  |Byte 0   | Byte 1      | Byte 2   | Byte 3   | Bytes 4-63  |
  |---------|-------------|----------|----------|-------------|
  | 0       | 0           | 0        | 1        | unspecified |

* Offset buffer:

  |Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
  |----------|-------------|------------|-------------|-------------|
  | 0        | 1           | 2          | 0           | unspecified |

* Children arrays:
  * Field-0 array (f: float):
    * Length: 2, Null count: 1
    * Validity bitmap buffer: 00000101

    * Value Buffer:

      | Bytes 0-11     | Bytes 12-63  |
      |----------------|-------------|
      | 1.2, null, 3.4 | unspecified |


  * Field-1 array (i: int32):
    * Length: 1, Null count: 0
    * Validity bitmap buffer: Not required

    * Value Buffer:

      | Bytes 0-3 | Bytes 4-63  |
      |-----------|-------------|
      | 5         | unspecified |

如上所示，每个子 Array 都是密集排列的，并且长度都不一定一致。

Sparse Union

与 Dense Union 物理布局相同，但是所有的子 Array 长度都相同，所以不要使用 Offset Buffer。很明显子 Array 长度相同需要浪费更多的内存，但是相比于 Dense Union 在某些场景下更向量化计算友好。

下面举个例子，对于Union<f: float, i: int32> 类型的一组数据 [{f=1.2}, null, {f=3.4}, {i=5}] 的物理布局如下：

* Length: 4, Null count: 0
* Types buffer:

  |Byte 0   | Byte 1      | Byte 2   | Byte 3   | Bytes 4-63  |
  |---------|-------------|----------|----------|-------------|
  | 0       | 0           | 0        | 1        | unspecified |

* Offset buffer:

  |Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
  |----------|-------------|------------|-------------|-------------|
  | 0        | 1           | 2          | 0           | unspecified |

* Children arrays:
  * Field-0 array (f: float):
    * Length: 2, Null count: 1
    * Validity bitmap buffer: 00000101

    * Value Buffer:

      | Bytes 0-11     | Bytes 12-63  |
      |----------------|-------------|
      | 1.2, null, 3.4 | unspecified |


  * Field-1 array (i: int32):
    * Length: 1, Null count: 0
    * Validity bitmap buffer: Not required

    * Value Buffer:

      | Bytes 0-3 | Bytes 4-63  |
      |-----------|-------------|
      | 5         | unspecified |

如上所示，每个子 Array 都是稀疏排列的，长度与行数相同。

Null Layout

不会分配内存

Dictionary-encoded Layout

字典编码是一种数据表示技术，通过使用字典保存唯一的数值，来对数据进行压缩，当重复值很多时很有效。其内存布局与整型物理布局类似，字典通过列式数组存储。

下面举个例子，对于 VarBinary 类型的一组数据 ['foo', 'bar', 'foo', 'bar', null, 'baz']，使用字典编码后，效果如下

data VarBinary (dictionary-encoded)
   index_type: Int32
   values: [0, 1, 0, 1, null, 2]

dictionary
   type: VarBinary
   values: ['foo', 'bar', 'baz']

序列化与 IPC

序列化数据的基本单元为 recordBatch。从语义上说，一个 recordBatch 代表了有序 Array 的集合，一个 Array 代表一列的数据。Arrow 的序列化和反序列化可以做到不使用额外的内存拷贝。

列式 IPC 协议有以下三种类型，具体格式定义在Message.fbs中

Schema
ReocrdBatch
DictionaryBatch

Schema

表示 RecordBatch 的 Schema，使用 Flatbuffers 进行序列化，定义在 Schema.fbs中

RecordBatch

包含实际数据的 Data Buffers，定义在Message.fbs中，具体格式如下

length：行数
FieldNode：每一个字段的元信息
- length
- null_count
Buffer：每一个字段的 Buffer 信息
- offset
- length

DictionaryBatch

包含用于编码的字典，定义在Message.fbs中，具体格式如下

Id：字典编号
RecordBatch：记录字典值
isDelta：对于相同 id 的字典，使用 append 模式或者 replace 模式

REFERENCE

Arrow 文档

文档信息

本文作者：wzx
本文链接：https://masterwangzx.com/2021/12/08/arrow/
版权声明：自由转载-非商用-非衍生-保持署名（创意共享3.0许可证）

WZX's Blog

Apache Arrow overview