栈buffer对比堆buffer

2024-07-29 约 3994 字预计阅读 8 分钟

摘要: 作者通过对比栈和堆内存作为缓冲区的性能，采用不同访问方式（数组、指针）和预分配策略，结合benchmark测试和汇编分析，发现内存大小和分配频率对性能影响显著，栈内存不一定比堆内存快，频繁分配会导致堆性能下降，而预分配能改善表现。指针++访问未必比[]访问慢，性能差异受内存访问方式和边界检查影响。作者强调实验验证的重要性，指出实际环境中性能表现可能不同，建议根据具体场景选择内存管理策略，避免盲目信赖书本知识。 (评价: A)

我印象中一直有这样一个观点：栈内存比堆内存更快。但是自从尝试《十亿行挑战》后，便对其怀疑。本文将通过实验调查堆内存和栈内存用作buffer时表现的差异及其原因。

注意

具体问题需要具体分析，需要根据代码情况验证，不能一概而论。本文会通过benchmark和汇编代码分析来验证“栈内存不一定比堆内存快”这一观点。

验证case

栈buffer使用std::array，堆buffer使用std::unique_ptr。buffer定义如下：

1
2
3


using ele_t = uint32_t;
constexpr uint32_t max_stack_size = 1024 * 1024 * 4;
constexpr uint32_t buffer_size = max_stack_size / sizeof(ele_t);

最多4MB的buffer，这是考虑了栈的大小限制，一般是8MB。为什么不直接使用数组或raw指针？主要是考虑C++中我们一般会使用std::array和std::unique_ptr（或其他智能指针）来使用栈和堆内存，所以这里也使用这两种方式。另外，下面的对比中也可以分析出纯数组和raw指针的性能。

栈buffer通过[]访问，也对比一下分配是否耗时（分析：栈空间是程序启动时就分配好的，所以最多占用构造函数时间）：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


static void bench_stack_memory_normal_access(benchmark::State &state) {
  std::array<ele_t, buffer_size> arr;
  for (auto _ : state) {
    for (uint32_t idx = 0; idx < buffer_size; idx++) {
      arr[idx] = idx & 0x1;
      benchmark::DoNotOptimize(arr);
    }
  }
}
BENCHMARK(bench_stack_memory_normal_access);

static void
bench_stack_memory_normal_access_with_alloc(benchmark::State &state) {
  for (auto _ : state) {
    std::array<ele_t, buffer_size> arr;
    for (uint32_t idx = 0; idx < buffer_size; idx++) {
      arr[idx] = idx & 0x1;
      benchmark::DoNotOptimize(arr);
    }
  }
}
BENCHMARK(bench_stack_memory_normal_access_with_alloc);

栈buffer通过指针访问

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


static void bench_stack_memory_pointer_access(benchmark::State &state) {
  std::array<ele_t, buffer_size> arr;
  for (auto _ : state) {
    ele_t *ptr = &(arr[0]);
    for (uint32_t idx = 0; idx < buffer_size; idx++, ptr++) {
      *ptr = idx & 0x1;
      benchmark::DoNotOptimize(ptr);
    }
  }
}
BENCHMARK(bench_stack_memory_pointer_access);

栈buffer通过指针的[]访问

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


static void bench_stack_memory_pointer_normal_access(benchmark::State &state) {
  std::array<ele_t, buffer_size> arr;
  for (auto _ : state) {
    ele_t *ptr = &(arr[0]);
    for (uint32_t idx = 0; idx < buffer_size; idx++) {
      ptr[idx] = idx & 0x1;
      benchmark::DoNotOptimize(ptr);
    }
  }
}
BENCHMARK(bench_stack_memory_pointer_normal_access);

堆buffer通过[]访问，也对比一下频繁分配是否耗时

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


static void bench_heap_memory_normal_access(benchmark::State &state) {
  auto arr = std::make_unique<ele_t[]>(buffer_size);
  for (auto _ : state) {
    for (uint32_t idx = 0; idx < buffer_size; idx++) {
      arr[idx] = idx & 0x1;
      benchmark::DoNotOptimize(arr);
    }
  }
}
BENCHMARK(bench_heap_memory_normal_access);

static void
bench_heap_memory_normal_access_with_alloc(benchmark::State &state) {
  for (auto _ : state) {
    auto arr = std::make_unique<ele_t[]>(buffer_size);
    for (uint32_t idx = 0; idx < buffer_size; idx++) {
      arr[idx] = idx & 0x1;
      benchmark::DoNotOptimize(arr);
    }
  }
}
BENCHMARK(bench_heap_memory_normal_access_with_alloc);

堆buffer通过指针访问

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


static void bench_heap_memory_pointer_access(benchmark::State &state) {
  auto arr = std::make_unique<ele_t[]>(buffer_size);
  for (auto _ : state) {
    ele_t *ptr = arr.get();
    for (uint32_t idx = 0; idx < buffer_size; idx++, ptr++) {
      *ptr = idx & 0x1;
      benchmark::DoNotOptimize(ptr);
    }
  }
}
BENCHMARK(bench_heap_memory_pointer_access);

堆buffer通过指针的[]访问：智能指针的[]操作符会检查边界，然后通过get()获取指针，再通过指针访问；这里不考虑边界的影响，直接通过get()获取指针，然后通过指针访问，以对比差异。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


static void bench_heap_memory_pointer_normal_access(benchmark::State &state) {
  auto arr = std::make_unique<ele_t[]>(buffer_size);
  for (auto _ : state) {
    ele_t *ptr = arr.get();
    for (uint32_t idx = 0; idx < buffer_size; idx++) {
      ptr[idx] = idx & 0x1;
      benchmark::DoNotOptimize(ptr);
    }
  }
}
BENCHMARK(bench_heap_memory_pointer_normal_access);

static void bench_heap_memory_get_normal_access(benchmark::State &state) {
  auto arr = std::make_unique<ele_t[]>(buffer_size);
  for (auto _ : state) {
    for (uint32_t idx = 0; idx < buffer_size; idx++) {
      arr.get()[idx] = idx & 0x1;
      benchmark::DoNotOptimize(arr);
    }
  }
}
BENCHMARK(bench_heap_memory_get_normal_access);

预分配堆buffer通过[]或指针访问

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


static auto preload_heap_buffer = std::make_unique<ele_t[]>(buffer_size);
static void bench_preload_heap_memory_normal_access(benchmark::State &state) {
  for (auto _ : state) {
    auto &arr = preload_heap_buffer;
    for (uint32_t idx = 0; idx < buffer_size; idx++) {
      arr[idx] = idx & 0x1;
      benchmark::DoNotOptimize(arr);
    }
  }
}
BENCHMARK(bench_preload_heap_memory_normal_access);

static void bench_preload_heap_memory_pointer_access(benchmark::State &state) {
  for (auto _ : state) {
    ele_t *ptr = preload_heap_buffer.get();
    for (uint32_t idx = 0; idx < buffer_size; idx++, ptr++) {
      *ptr = idx & 0x1;
      benchmark::DoNotOptimize(ptr);
    }
  }
}
BENCHMARK(bench_preload_heap_memory_pointer_access);

benchmark结果

命名规则，bench_[preload_]{stack|heap}_memory_{normal|pointer|pointer_normal}_access[_with_alloc]，其中preload表示预分配，stack表示栈内存，heap表示堆内存，normal表示通过[]访问，pointer表示通过指针访问，pointer_normal表示通过指针的[]访问，_with_alloc表示是否频繁分配。

编译参数为-std=c++17 -O3，需要开启优化，然后调用顺序随机，通过名称排序输出结果：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


bench_heap_memory_get_normal_access             248895 ns       248892 ns         2801
bench_heap_memory_normal_access                 241877 ns       241869 ns         2850
bench_heap_memory_normal_access_with_alloc      331070 ns       331075 ns         2033
bench_heap_memory_pointer_access                270874 ns       270867 ns         2443
bench_heap_memory_pointer_normal_access         218123 ns       218117 ns         3254
bench_preload_heap_memory_normal_access         241593 ns       241598 ns         2839
bench_preload_heap_memory_pointer_access        275220 ns       275222 ns         2532
bench_stack_memory_normal_access                215381 ns       215376 ns         3252
bench_stack_memory_normal_access_with_alloc     216673 ns       216680 ns         3275
bench_stack_memory_pointer_access               274659 ns       274668 ns         2595
bench_stack_memory_pointer_normal_access        213314 ns       213304 ns         3248

一般把5%差异以内（高中物理上我们一般都这样）认为无明显差异，将上述结果按照速度从快到慢分成几个梯度：

T0:

1
2
3
4


bench_stack_memory_normal_access_with_alloc     216673 ns       216680 ns         3275
bench_heap_memory_pointer_normal_access         218123 ns       218117 ns         3254
bench_stack_memory_normal_access                215381 ns       215376 ns         3252
bench_stack_memory_pointer_normal_access        213314 ns       213304 ns         3248

T1:

1
2
3


bench_heap_memory_normal_access                 241877 ns       241869 ns         2850
bench_preload_heap_memory_normal_access         241593 ns       241598 ns         2839
bench_heap_memory_get_normal_access             248895 ns       248892 ns         2801

T2:

1
2
3


bench_stack_memory_pointer_access               274659 ns       274668 ns         2595
bench_preload_heap_memory_pointer_access        275220 ns       275222 ns         2532
bench_heap_memory_pointer_access                270874 ns       270867 ns         2443

T3:

1

bench_heap_memory_normal_access_with_alloc      331070 ns       331075 ns         2033

多次验证也满足这样的分类结果，此处就不再展示多次统计的数据结果了，仅以上结果来分析。

先看最慢的T3，只有一项，其对比项是bench_stack_memory_pointer_access，分类在T1，与之差异就是会频繁分配和构造内存。在《glibc-malloc源码阅读》系列](/202203/glibc-malloc)中已经提到，较大的内存会通过mmap分配。他会被真正的频繁分配和释放+构造所影响，所以速度慢。

类似的，对比栈上的bench_stack_memory_normal_access_with_alloc，尽管看起来也会频繁分配，但是栈空间是程序启动时就分配好的，所以不会有频繁分配的问题，所以速度比较快。

T2的三项都揭示，通过指针++访问会比通过[]访问慢，如何理解这个现象？一般分析来说，指针++和下标访问都需要一次“位置计算”的操作，差异会很大吗？具体原因需要通过下一节的汇编分析来解释。

T1的三项都是通过[]访问，且都是堆内存。T1、T2也说明对于不频繁分配内存的场景，预分配和使用时分配的差异并不大。不过也要针对实际情况，如果一次性需要几百MB/GB的内存，那么还是会倾向于预分配。

T0的几项，堆内存是通过raw指针的[]访问，栈内存除了raw指针的[]访问，还有直接[]访问。上面也提到栈内存频繁分配对速度影响并不大。这里来看，std::array的直接下标访问也比智能指针的下标访问快。这部分也可以直接通过这两个类的实现找到原因：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


// std::array的[]访问
// Element access.
_GLIBCXX17_CONSTEXPR reference
operator[](size_type __n) noexcept
{ return _AT_Type::_S_ref(_M_elems, __n); }

constexpr const_reference
operator[](size_type __n) const noexcept
{ return _AT_Type::_S_ref(_M_elems, __n); }

static constexpr _Tp&
_S_ref(const _Type& __t, std::size_t __n) noexcept
{ return const_cast<_Tp&>(__t[__n]); }

// std::unique_ptr的[]访问
/// Access an element of owned array.
  typename std::add_lvalue_reference<element_type>::type
  operator[](size_t __i) const
  {
__glibcxx_assert(get() != pointer());
return get()[__i];
  }

可以看到，std::array的[]访问相当于直接访问的raw指针，而std::unique_ptr的[]访问会先做检查然后再访问。这样来看，频繁的__glibcxx_assert(get() != pointer());检查，对速度影响还是比较大的。

汇编差异

这一节需要分析的问题是，为什么指针++访问会比[]访问慢。以bench_heap_memory_pointer_normal_access和bench_heap_memory_pointer_access为例。

bench_heap_memory_pointer_normal_access内存访问部分：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


.L5:
        mov     ecx, eax
        and     ecx, 1
        mov     DWORD PTR [rdx+rax*4], ecx
        add     rax, 1
        cmp     rax, 1048576
        jne     .L5
        test    rsi, rsi
        jle     .L18
        sub     rsi, 1
        jne     .L7
        jmp     .L4

mov ecx, eax，idx赋值给ecx
and ecx, 1，idx与1做与操作，即idx & 0x1
mov DWORD PTR [rdx+rax*4], ecx，将idx & 0x1赋值给ptr[idx]
add rax, 1，idx++

bench_heap_memory_pointer_access内存访问部分：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


.L23:
        mov     ecx, edx
        and     ecx, 1
        mov     DWORD PTR [rax], ecx
        add     edx, 1
        add     rax, 4
        cmp     edx, 1048576
        jne     .L23
        test    rsi, rsi
        jle     .L35
        sub     rsi, 1
        jne     .L25
        jmp     .L22

mov ecx, edx，idx赋值给ecx
and ecx, 1，idx与1做与操作，即idx & 0x1
mov DWORD PTR [rax], ecx，将idx & 0x1赋值给*ptr
add edx, 1，idx++
add rax, 4，ptr++

最后多了一步add rax, 4，但是并没有说服力；我做了一下改造，把bench_heap_memory_pointer_access变得简单一些，即使下面的实现是错误的，但是它和bench_heap_memory_pointer_normal_access的差异仅有一处，即解引用还是下标访问，这样对比起来简单一些：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


static void bench_heap_memory_pointer_access(benchmark::State &state) {
  auto arr = std::make_unique<ele_t[]>(buffer_size);
  for (auto _ : state) {
    ele_t *ptr = arr.get();
    for (uint32_t idx = 0; idx < buffer_size; idx++) {
      *ptr = idx & 0x1;
      benchmark::DoNotOptimize(ptr);
    }
  }
}
BENCHMARK(bench_heap_memory_pointer_access);

得到的结果依然是直接[]访问比指针++访问快，此外，还尝试将内存范围部分*ptr = idx & 0x1;和ptr[idx] = idx & 0x1;注释，得到的结果则是没有明显差别，这说明确实是内存访问的问题。上文说的改造后的汇编部分是怎样的呢？

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


.L23:
        mov     ecx, edx
        and     ecx, 1
        mov     DWORD PTR [rax], ecx
        add     edx, 1
        cmp     edx, 1048576
        jne     .L23
        test    rsi, rsi
        jle     .L35
        sub     rsi, 1
        jne     .L25
        jmp     .L22

和bench_heap_memory_pointer_normal_access没有差别，甚至还不需要rdx+rax*4这样的计算，除了寄存器名字不同。总之，在这一小节中我还没有找到能够支撑“指针++访问会比[]访问慢”的证据。所以目前也只能把实验结论当作知识，先有个印象了。

总结

本想通过实验看看又无什么确定性的结论，但是并没有得到。不过也获得了一些参考。

根据以上结果和经验，可以得出以下结论：

对于内存不大且不频繁分配的场景，栈内存和堆内存的性能差异不大，可以根据实际情况选择。
对于频繁分配的场景，堆内存的性能会受到影响，所以需要考虑预分配或者自建内存池等方式；栈内存则不会受到影响。
如果使用智能指针管理内存，尽量避免频繁的且直接的[]访问，可以通过raw指针访问来提高性能；对栈内存则没有这个问题。
不要为了高性能而一味的使用栈buffer，我曾经遇到过，在某个项目的开发中因为频繁使用栈buffer，待调用栈比较深后，再申请栈buffer就会导致栈溢出的问题。
指针++访问会比[]访问慢，需要有这个意识；如果生产环境中关心这个影响，则需要根据使用的编译器和编译选项来验证。

如果用两个字总结的话，我会用“实验”。

上面输出的对比结论仅是在benchmark这个环境中得到的，如果在实际生产或者稍微复杂一些的环境，没准又会有不同的结果。所以对于一些性能场景，除非是可以被理论证明正确的结论，否则我认为很多经验“不靠谱”，但凡有一点怀疑，都应该通过实验来验证。

另外一个想法是不要迷信书本，技术发展很快，但是很多书本知识比较老旧。类似的，如果是查看某个技术帖子，我也得看看他的成文时间，如果太久了，一般就只能说仅供参考。