on
Measuring the Memory Overhead of a Postgres Connection
One fairly common complaint about postgres is that is that each connection uses too much memory. Often made when comparing postgres' connection model to one where each connection is assigned a dedicated thread, instead of the current model where each connection has a dedicated process.
To be clear: This is a worthwhile discussion to have. And there are several important improvements we could make to reduce memory usage.
That said, I think one common cause of these concerns is that the easy ways to
measure the memory usage of a postgres backend, like top
and ps
, are quite
misleading.
It is surprisingly hard to accurately measure the increase in memory usage by an additional connection.
In this post I’m mostly going to talk about running Postgres on Linux, since that is what I have the most experience with.
My bold claim is that, when measuring accurately, a connection only has an overhead of less than 2MiB (see conclusion).
A first look
Just using the common operating system tools make the overhead look much bigger than it actually is. Especially when not utilizing huge_pages (not recommended), the memory usage for each process will look high.
Let’s first look at a freshly established connection, in a freshly started postgres cluster:
andres@awork3:~$ psql
postgres[2003213][1]=# SELECT pg_backend_pid();
┌────────────────┐
│ pg_backend_pid │
├────────────────┤
│ 2003213 │
└────────────────┘
(1 row)
andres@awork3:~/src/postgresql$ ps -q 2003213 -eo pid,rss
PID RSS
2003213 16944
About 16MiB.
Massive leaks!?! Luckily not.
What’s worse, the memory usage will appear to grow over time. To show this problem, I’ll use the pgprewarm extension to load all pages in a table into postgres' buffer pool:
postgres[2003213][1]=# SHOW shared_buffers ;
┌────────────────┐
│ shared_buffers │
├────────────────┤
│ 16GB │
└────────────────┘
(1 row)
postgres[2003213][1]=# SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relfilenode <> 0;
┌────────┐
│ sum │
├────────┤
│ 383341 │
└────────┘
andres@awork3:~$ ps -q 2003213 -eo pid,rss
PID RSS
2003213 3169144
Now postgres memory usage appears to be around 3GB. Even though the individual connection did not actually allocate much additional memory. The added memory usage is proportional to the amount of shared buffers touched:
postgres[2003213][1]=# SELECT pg_size_pretty(SUM(pg_relation_size(oid))) FROM pg_class WHERE relfilenode <> 0;
┌────────────────┐
│ pg_size_pretty │
├────────────────┤
│ 2995 MB │
└────────────────┘
(1 row)
And even worse than that, if yet another connection also uses those pages, it will also show as having a huge memory usage:
postgres[3244960][1]=# SELECT sum(abalance) FROM pgbench_accounts ;
┌─────┐
│ sum │
├─────┤
│ 0 │
└─────┘
(1 row)
andres@awork3:~/src/postgresql$ ps -q 3244960 -eo pid,rss
PID RSS
3244960 2700372
Of course postgres does not actually use 3+2.7 GiB of memory in this
case. Instead, what is happening is that, with huge_pages=off
off, ps will
attribute the amount of shared memory, including the buffer pool, that a
connection has utilized for each connection. Obviously leading to vastly
over-estimating memory usage.
Huge pages accidentally kind of save the day
Many CPU micro-architectures normally use a page size of 4KiB, but also optionally can use larger page sizes, most commonly 2MiB.
Depending on operating system, configuration, and the type of applications used such larger pages can be utilized transparently by the operating system, or explicitly by applications. See e.g. Debian wiki page about huge pages for some details.
Repeating the previous experiments with huge_pages=on
makes them look a lot
better. First, looking at a “new connection”:
andres@awork3:~$ ps -q 3245907 -eo pid,rss
PID RSS
3245907 7612
So, a new connection now appears to use only about ~7MiB. This reduction in
memory usage is caused by the page table
needing less memory, as it now only needs to contain 1/512
th of the previous
entries, due to the larger page size.
And more importantly, the test where a lot of the memory is accessed:
postgres[3245843][1]=# ;SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relfilenode <> 0;
…
postgres[3245851][1]=# SELECT sum(abalance) FROM pgbench_accounts ;
…
andres@awork3:~$ ps -q 3245907,3245974 -eo pid,rss
PID RSS
3245907 12260
3245974 8936
In contrast to above, these connections now appear to only use 12MiB and 9MiB respectively, where previously they used 3GiB and 2.7GiB. Quite the apparent change ;)
This is due to the way the use of larger pages is implemented in Linux, not
because we used orders of magnitude less memory: huge pages used just aren’t
shown as part of the RSS
column of ps/top.
Getting less unreal
Since Linux 4.5 the
/proc/$pid/status
file shows the memory usage split into finer sub-categories:
VmRSS size of memory portions. It contains the three following parts (VmRSS = RssAnon + RssFile + RssShmem)
RssAnon size of resident anonymous memory
RssFile size of resident file mappings
RssShmem size of resident shmem memory (includes SysV shm, mapping of tmpfs and shared anonymous mappings)
Looking at these stats with huge_pages=off
:
andres@awork3:~$ grep -E '^(Rss|HugetlbPages)' /proc/3247901/status
RssAnon: 2476 kB
RssFile: 5072 kB
RssShmem: 8520 kB
HugetlbPages: 0 kB
postgres[3247901][1]=# SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relfilenode <> 0;
andres@awork3:~$ ps -q 3247901 -eo pid,rss
PID RSS
3247901 3167164
andres@awork3:~$ grep -E '^(Rss|HugetlbPages)' /proc/3247901/status
RssAnon: 3148 kB
RssFile: 9212 kB
RssShmem: 3154804 kB
HugetlbPages: 0 kB
RssAnon
is the amount of “anonymous” memory, i.e. memory
allocations. RssFile
are memory mapped files, including the postgres binary
itself. And lastly, RssShmem
shows the accessed non-huge_pages shared
memory.
This nicely shows that most of the high memory usage ps etc show is due to the shared memory accesses.
And huge_pages=on
:
andres@awork3:~$ grep -E '^(Rss|HugetlbPages)' /proc/3248101/status
RssAnon: 2476 kB
RssFile: 4664 kB
RssShmem: 0 kB
HugetlbPages: 778240 kB
postgres[3248101][1]=# SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relfilenode <> 0;
andres@awork3:~$ grep -E '^(Rss|HugetlbPages)' /proc/3248101/status
RssAnon: 3136 kB
RssFile: 8756 kB
RssShmem: 0 kB
HugetlbPages: 3846144 kB
Approximating Accuracy
Just adding up the memory usage of the non-shmem values still over-estimates memory usage. There’s two main reasons:
First, it doesn’t actually make sense to include RssFile
when measuring a
postgres backend’s memory usage - for postgres that overwhelmingly just are the
postgres binary and the shared libraries it uses (postgres does not mmap()
files). As nearly all of that is shared between all processes in the system,
it’s not a per-connection overhead.
Secondly, even just looking at RssAnon
over-estimates memory usage. The
reason for that is that ps measures the entire memory of the process, even
though the majority of a new connection’s overhead is shared between the user
connection and the supervisor process. This is because Linux does not copy all
memory when fork()
ing a new process, instead it uses
Copy-on-write to only copy pages
when modified.
There is no good way to accurately measure the memory usage of an individual
forked process, but since version 4.14 the Linux kernel at least provides an
approximation
(commit with description)
in a process' /proc/[pid]/smaps_rollup
file. Pss
shows the “the process’
proportional share of this mapping” across all of a process' mappings (Search
linux’s documentation page for
smaps_rollup
and
Pss,
which unfortunately does not have direct links). For memory shared between
processes it will divide the memory usage by the number of processes using a
mapping.
postgres[2004042][1]=# SELECT SUM(pg_prewarm(oid, 'buffer')) FROM pg_class WHERE relfilenode <> 0;
┌────────┐
│ sum │
├────────┤
│ 383341 │
└────────┘
(1 row)
postgres[2004042][1]=# SHOW huge_pages ;
┌────────────┐
│ huge_pages │
├────────────┤
│ off │
└────────────┘
(1 row)
andres@awork3:~$ grep ^Pss /proc/2004042/smaps_rollup
Pss: 3113967 kB
Pss_Anon: 2153 kB
Pss_File: 3128 kB
Pss_Shmem: 3108684 kB
Pss_Anon
contains memory allocated by the process, Pss_File
includes shared
libraries etc linked into the process, and Pss_Shmem
(if not using
huge_pages
) the shared memory usage divided across all the processes having
touched the corresponding pages.
What makes the proportional values not perfect is that the divisor depends on the number of connections to the server. Here I use pgbench (scale 1000, -S, -M prepared -c 1024) to start a large number of connections:
postgres[2004042][1]=# SELECT count(*) FROM pg_stat_activity ;
┌───────┐
│ count │
├───────┤
│ 1030 │
└───────┘
(1 row)
postgres[2004042][1]=# SELECT pid FROM pg_stat_activity WHERE application_name = 'pgbench' ORDER BY random() LIMIT 1;
┌─────────┐
│ pid │
├─────────┤
│ 3249913 │
└─────────┘
(1 row)
andres@awork3:~$ grep ^Pss /proc/3249913/smaps_rollup
Pss: 4055 kB
Pss_Anon: 1185 kB
Pss_File: 6 kB
Pss_Shmem: 2863 kB
And with huge_pages=on
:
andres@awork3:~$ grep ^Pss /proc/2007379/smaps_rollup
Pss: 1179 kB
Pss_Anon: 1173 kB
Pss_File: 6 kB
Pss_Shmem: 0 kB
The Pss
values unfortunately do not account for resources only not visible to
the application. E.g. the size of the page table is not included. The page
table size is also visible in the aforementioned
/proc/$pid/status
file.
To my knowledge - but I am not certain - VmPTE
(the page table size) is
completely private for each process, but most other Vm*
values, including the
stack VmStk
are shared in a copy-on-write manner.
Using the above, the overhead due to the page table is, with huge_pages=off
:
andres@awork3:~$ grep ^VmPTE /proc/2004042/status
VmPTE: 6480 kB
and with huge_pages=on
VmPTE: 132 kB
Of course there are some additional small overheads in the kernel for each
process, but they are small enough not be worth being represented in the
status
file.
Conclusion
Based on the measurements above we can approximate that a connection running a
very simplistic read-only OLTP workload has an overhead of about 7.6MiB
with huge_pages=off
and of 1.3MiB huge_pages=on
by adding Pss_Anon
to
VmPTE
.
Even when leaving room for some “invisible” overhead, a larger amount of data in Postgres' buffer pool, etc, I think this backs my earlier claim that the overhead of a connection is less than 2MiB.