Debugging an AWS RDS Postgres instance

Postgres

I was recently handed a Postgres instance, running on AWS RDS and in due for some maintenance. As the total load kept increasing, the appearance of query timeouts also started to visibly increase. Coupled with how services using the database didn’t handle these timeouts robustly, the fallout of these timeouts started to spill over and customers started to notice. As yet-another vertical upscaling was monetarily impractical, I dove deep.

During my investigations I found the usual suspects of under- and over-indexing, missing guardrails in form of specific _timeout configurations, etc. Some changes were also made on database’s clients; the most effective being to not query as often.

RDS-specific aspects: ‘Parameter Groups’

Pg 10 to 11: got confirmation that bitmap scan were indeed ‘lossy’ (via explain (buffers))

enable custom Parameter Groups enable pgstatstatements and (or?) auto_explain

Replicate staging env and load it with prod-like usage. Then find slow queries

SELECT query,
              round(total_time::numeric, 2) AS total_time,
              calls,
              round(mean_time::numeric, 2) AS mean,
              round((100 * total_time /
                sum(total_time::numeric) OVER ())::numeric, 2) AS percentage_cpu
FROM    pg_stat_statements
ORDER BY total_time DESC
LIMIT 20;

For Pg 11+ (12+?) replace total_time and mean_time with total_exec_time and mean_exec_time.

Reset stats with select pg_stat_statements_reset(); and(?) select pg_stat_reset(); before another load experiment in staging.

discard all;

FILLFACTOR

What

The default is 100: a page gets fully filled in with tuples before a new page is used. Any non-default table setting can be seen in psql with \d+ atable. Non-default settings can also be shown for multiple relations with such query:

select t.relname as table_name,
       t.reloptions
from pg_class t
  join pg_namespace n on n.oid = t.relnamespace
where t.relname in ('table_a', 'etc')
  and n.nspname = 'public';
alter table table_a (set fillfactor = 80);
vacuum full; --

Indexes

When to use other types than B-tree

covering indexes

CREATE INDEX tab_x_y ON tab(x) INCLUDE (y);

partial index

... WHERE

How even with HOT, there’s a tradeoff with having too many indexes

Unique indexes are auto-created, whereas foreign keys’ ones aren’t.

Refs

At least this isn't a full screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AWS, Elixir, and more.


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.