SQL Conjuror

PostgreSQL – Checkpoint

By Enrique 07/12/2023 21/03/2024 PostgreSQL, Database Administration, Performance Optimization

What is Checkpoint?

Checkpoint periodically flushes all dirty buffers and creates checkpoint record in the WAL log, which is used for recovery.

The purpose of the checkpoint is to ensure that all the dirty buffers generated up to a certain point are sent to the disk. This will recycle the WAL up to that point. Read More

Journey of a PostgreSQL Write

BG Writer (Background Writer)

The BGWriter helps to diminish the I/O operations needed to be done by Checkpoints. This stabilizes the system’s I/O. BGWriter writes Dirty Buffers to disk when the system load is low to minimize the impact on active transactions. It constantly monitors the DB Buffer Pol and makes decisions on when to write dirty buffers to disk and which buffers to free up for reuse.

Frequency of Checkpoints

Use the below query to find the frequency of the checkpoints:

WITH sub as (
  SELECT 
    EXTRACT(
      EPOCH 
      FROM 
        (now() - stats_reset)
    ) AS seconds_since_start, 
    (
      checkpoints_timed + checkpoints_req
    ) AS total_checkpoints 
  FROM 
    pg_stat_bgwriter
) 
SELECT 
  total_checkpoints, 
  seconds_since_start / total_checkpoints / 60 AS minutes_between_checkpoints 
FROM 
  sub;

There are two important columns checkpoints_req, checkpoints_timed.

If the checkpoints_req is more than the checkpoints_timed, PostgreSQL is doing checkpoints due to the high WAL generation.

If the checkpoints are happening frequently it will cause more IO load on the machine so increase the max_wal_size parameter.

checkpoints_req > checkpoints_timed = (bad). This means that PostgreSQL is doing checkpoints due to the high WAL generation

The basic rule is simple – checkpoints_timed value should be much higher than checkpoints_req. It’s best when the last one (checkpoints_req) is near zero. It may be obtained by increasing max_wal_size (or checkpoint_segments) and checkpoint_timeout. A good initial point is to set max_wal_size to 10GB and checkpoint_timeout = 30min. Also, checkpoint_completion_target should be configured in a way that will enable spread of execution of checkpoints to a time that is the closest to timeout or to the size of collected WAL. A good starting point for this is 0.9 (default in PostgreSQL 14 and above). Thus, with these settings checkpoint will occur when Postgres collects 10GB of WAL, or after 30 minutes from the last checkpoint. The already running checkpoint’s execution will spread over 27 minutes or until Postgres again collects 9GB of WAL.

Cheers!

PostgreSQL – Error: Canceling statement due to conflict with recovery

By Enrique 21/11/2023 21/03/2024 PostgreSQL, Database Administration, Troubleshooting

Sample Log Output:

Description:

When the standby server receives updates/deletes in the WAL stream that will result in invalidating data currently being accessed by a running query, this error will happen. Read More

PostgreSQL – Monitoring Replication

By Enrique 21/09/2023 21/03/2024 PostgreSQL, Database Administration, High Availability / Disaster Recovery

What is replication lag? How can we monitor Replication lag? Replication lag does not occur in most setups; however, it is important to monitor the entire endpoints of replication to ensure that our data is safe.

What is Replication Lag? Read More

The data can be replicated from the primary/master node to the replica/slave node after the transaction has been committed on the primary node. The replica is never ahead of the primary. It is usually a little behind. This delay is called lag.

Monitoring Primary Server

The system view to monitor Replication on the Primary node is pg_stat_replication.

\d pg_stat_replication
                    View "pg_catalog.pg_stat_replication"
      Column      |           Type           | Collation | Nullable | Default 
------------------+--------------------------+-----------+----------+---------
 pid              | integer                  |           |          | 
 usesysid         | oid                      |           |          | 
 usename          | name                     |           |          | 
 application_name | text                     |           |          | 
 client_addr      | inet                     |           |          | 
 client_hostname  | text                     |           |          | 
 client_port      | integer                  |           |          | 
 backend_start    | timestamp with time zone |           |          | 
 backend_xmin     | xid                      |           |          | 
 state            | text                     |           |          | 
 sent_lsn         | pg_lsn                   |           |          | 
 write_lsn        | pg_lsn                   |           |          | 
 flush_lsn        | pg_lsn                   |           |          | 
 replay_lsn       | pg_lsn                   |           |          | 
 write_lag        | interval                 |           |          | 
 flush_lag        | interval                 |           |          | 
 replay_lag       | interval                 |           |          | 
 sync_priority    | integer                  |           |          | 
 sync_state       | text                     |           |          | 
 reply_time       | timestamp with time zone |           |          |

Important Fields in pg_stat_replication

Let’s talk about the fields in detail.

pid: This represents the process ID of the wal_receiver process in charge of this streaming connection. If you check the process table in your operating system, you should find a PostgreSQL process with exactly this number
usesysid: OID of user which is used for Streaming replication.
usename: This (not username; mind the missing “r”) stores the name of the user related to usesysid. It is what the client has put into the connection string
application_name: Application name connected to master
client_addr: This will tell you where the streaming connection comes from. It holds the IP address of the client.
client_hostname: Hostname of standby.
client_port: This is the TCP port number the client is using for communication with the particular WAL sender. -1 will be shown if local UNIX sockets are used.
backend_start: Start time when SR connected to Master.
backend_xmin: The transaction ID reported by hot_standby_feedback (which is the oldest transaction ID on a slave). It can make sense to monitor the difference between the current transaction ID on the master and the oldest one reported by the slave as well, to check whether there are unusually high differences
state: Current WAL sender state i.e streaming
sent_lsn: This represents the last transaction log position sent to the connection.
write_lsn: Last transaction written on disk at standby.
flush_lsn: This is the last location that was flushed to the standby system. Mind the difference between writing and flushing here. Writing does not imply flushing.
replay_lsn: Last transaction flush on disk at standby.
write_lag: Elapsed time during committed WALs from primary to the standby (but not yet committed in the standby)
flush_lag: Elapsed time during committed WALs from primary to the standby (WAL’s has already been flushed but not yet applied)
replay_lag: Elapsed time during committed WALs from primary to the standby (fully committed in standby node)
sync_priority: Priority of standby server being chosen as synchronous standby
sync_state: Sync State of standby (is it async or synchronous).

Below is a sample query output with a single replica:

When you see an entry in the system view, that means that there is one ACTIVE stream. Otherwise, there is none. You won’t see any entry.

If the values of all *_lsn columns are the same, that means that the Replica has caught up 100%.

sent_lsn: Last write-ahead log location sent on this connection (How much WAL has been sent over the network already?).

write_lsn: Last write-ahead log location written to the operating system (without flushing).

flush_lsn: Last write-ahead log location flushed to disk by this standby server.

replay_lsn: Last write-ahead log location replayed into the database on this standby server (Data that is already visible to end-users).

The following is an illustration of the data flow.

Monitoring Replica Servers

Replica server has incoming replication and remains in recovery mode so that it can replay the WALs/records as they come in. There is a view called pg_stat_wal_receiver.

\d pg_stat_wal_receiver
                      View "pg_catalog.pg_stat_wal_receiver"
        Column         |           Type           | Collation | Nullable | Default 
-----------------------+--------------------------+-----------+----------+---------
 pid                   | integer                  |           |          | 
 status                | text                     |           |          | 
 receive_start_lsn     | pg_lsn                   |           |          | 
 receive_start_tli     | integer                  |           |          | 
 written_lsn           | pg_lsn                   |           |          | 
 flushed_lsn           | pg_lsn                   |           |          | 
 received_tli          | integer                  |           |          | 
 last_msg_send_time    | timestamp with time zone |           |          | 
 last_msg_receipt_time | timestamp with time zone |           |          | 
 latest_end_lsn        | pg_lsn                   |           |          | 
 latest_end_time       | timestamp with time zone |           |          | 
 slot_name             | text                     |           |          | 
 sender_host           | text                     |           |          | 
 sender_port           | integer                  |           |          | 
 conninfo              | text                     |           |          |

Monitoring Replication Slots

The view pg_replication_slots gives one row for each replication slot in the primary. A replication slot guarantees that the WAL will not vanish if the replica is lagging behind. A primary will recycle its WAL as soon as it doesn’t need it on its own anymore if there are no replication slots.

\d pg_replication_slots
             View "pg_catalog.pg_replication_slots"
       Column        |  Type   | Collation | Nullable | Default 
---------------------+---------+-----------+----------+---------
 slot_name           | name    |           |          | 
 plugin              | name    |           |          | 
 slot_type           | text    |           |          | 
 datoid              | oid     |           |          | 
 database            | name    |           |          | 
 temporary           | boolean |           |          | 
 active              | boolean |           |          | 
 active_pid          | integer |           |          | 
 xmin                | xid     |           |          | 
 catalog_xmin        | xid     |           |          | 
 restart_lsn         | pg_lsn  |           |          | 
 confirmed_flush_lsn | pg_lsn  |           |          | 
 wal_status          | text    |           |          | 
 safe_wal_size       | bigint  |           |          | 
 two_phase           | boolean |           |          |

Cheers!

PostgreSQL – Health Check Scripts

By Enrique 03/08/2023 21/03/2024 PostgreSQL, Database Administration

Here are some scripts that you can use to check the health of your PostgreSQL DB servers.

Check Uptime

SELECT current_timestamp - pg_postmaster_start_time();

Monitor cache hit ratio

Tells how often your data is served from memory vs having to go to disk. 99% is a good metric for performance. Read More

SELECT
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit)  as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM
  pg_statio_user_tables;

SELECT 
  datname, 
  (
    blks_hit * 100 /(blks_hit + blks_read)
  ):: numeric as hit_ratio 
from 
  pg_stat_database WHERE 
  datname not in (
    'postgres', 'template0', 'template1'
  );

If the hit ratio is less than 90%, there may be a problem with low allocation of shared buffers or queries doing large table scans. The hit ratio should be close to 100%. Boost shared buffers or tweak queries that do more IO.

Check unvacummed dead tupes – Get table bloat information

Bloat can slow down other write and creates other issues.

1st script:

WITH constants AS (
  SELECT current_setting('block_size')::numeric AS bs, 23 AS hdr, 4 AS ma
), bloat_info AS (
  SELECT
    ma,bs,schemaname,tablename,
    (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr,
    (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
  FROM (
    SELECT
      schemaname, tablename, hdr, ma, bs,
      SUM((1-null_frac)*avg_width) AS datawidth,
      MAX(null_frac) AS maxfracsum,
      hdr+(
        SELECT 1+count(*)/8
        FROM pg_stats s2
        WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
      ) AS nullhdr
    FROM pg_stats s, constants
    GROUP BY 1,2,3,4,5
  ) AS foo
), table_bloat AS (
  SELECT
    schemaname, tablename, cc.relpages, bs,
    CEIL((cc.reltuples*((datahdr+ma-
      (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta
  FROM bloat_info
  JOIN pg_class cc ON cc.relname = bloat_info.tablename
  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = bloat_info.schemaname AND nn.nspname <> 'information_schema'
), index_bloat AS (
  SELECT
    schemaname, tablename, bs,
    COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,
    COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols
  FROM bloat_info
  JOIN pg_class cc ON cc.relname = bloat_info.tablename
  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = bloat_info.schemaname AND nn.nspname <> 'information_schema'
  JOIN pg_index i ON indrelid = cc.oid
  JOIN pg_class c2 ON c2.oid = i.indexrelid
)
SELECT
  type, schemaname, object_name, bloat, pg_size_pretty(raw_waste) as waste
FROM
(SELECT
  'table' as type,
  schemaname,
  tablename as object_name,
  ROUND(CASE WHEN otta=0 THEN 0.0 ELSE table_bloat.relpages/otta::numeric END,1) AS bloat,
  CASE WHEN relpages < otta THEN '0' ELSE (bs*(table_bloat.relpages-otta)::bigint)::bigint END AS raw_waste
FROM
  table_bloat
    UNION
SELECT
  'index' as type,
  schemaname,
  tablename || '::' || iname as object_name,
  ROUND(CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages/iotta::numeric END,1) AS bloat,
  CASE WHEN ipages < iotta THEN '0' ELSE (bs*(ipages-iotta))::bigint END AS raw_waste
FROM
  index_bloat) bloat_summary
ORDER BY raw_waste DESC, bloat DESC;

2nd Script:

WITH constants AS 
(
       -- define some constants for sizes of things
       -- for reference down the query and easy maintenance
       
   SELECT
      current_setting('block_size')::numeric AS bs,
      23 AS hdr,
      8 AS ma 
)
,
no_stats AS 
(
       -- screen out table who have attributes
       -- which dont have stats, such as JSON
       
   SELECT
      table_schema,
      table_name,
              n_live_tup::numeric as est_rows,
              pg_table_size(relid)::numeric as table_size     
   FROM
      information_schema.columns         
      JOIN
         pg_stat_user_tables as psut            
         ON table_schema = psut.schemaname            
         AND table_name = psut.relname         
      LEFT OUTER JOIN
         pg_stats         
         ON table_schema = pg_stats.schemaname             
         AND table_name = pg_stats.tablename             
         AND column_name = attname     
   WHERE
      attname IS NULL         
      AND table_schema NOT IN 
      (
         'pg_catalog',
         'information_schema'
      )
          
   GROUP BY
      table_schema,
      table_name,
      relid,
      n_live_tup 
)
,
null_headers AS 
(
       -- calculate null header sizes
       -- omitting tables which dont have complete stats
       -- and attributes which aren't visible
       
   SELECT
              hdr + 1 + (sum(
      case
         when
            null_frac <> 0 
         THEN
            1 
         else
            0 
      END
) / 8) as nullhdr,         SUM((1 - null_frac)*avg_width) as datawidth,         MAX(null_frac) as maxfracsum,         schemaname,         tablename,         hdr, ma, bs     
   FROM
      pg_stats 
      CROSS JOIN
         constants         
      LEFT OUTER JOIN
         no_stats             
         ON schemaname = no_stats.table_schema             
         AND tablename = no_stats.table_name     
   WHERE
      schemaname NOT IN 
      (
         'pg_catalog', 'information_schema'
      )
              
      AND no_stats.table_name IS NULL         
      AND EXISTS 
      (
         SELECT
            1             
         FROM
            information_schema.columns                 
         WHERE
            schemaname = columns.table_schema                     
            AND tablename = columns.table_name 
      )
          
   GROUP BY
      schemaname,
      tablename,
      hdr,
      ma,
      bs 
)
,
data_headers AS 
(
       -- estimate header and row size
       
   SELECT
              ma,
      bs,
      hdr,
      schemaname,
      tablename,
              (datawidth + (hdr + ma - (
      case
         when
            hdr % ma = 0 
         THEN
            ma 
         ELSE
            hdr % ma 
      END
)))::numeric AS datahdr,         (maxfracsum*(nullhdr + ma - (
      case
         when
            nullhdr % ma = 0 
         THEN
            ma 
         ELSE
            nullhdr % ma 
      END
))) AS nullhdr2     
   FROM
      null_headers 
)
, table_estimates AS 
(
       -- make estimates of how large the table should be
       -- based on row and page size
       
   SELECT
      schemaname,
      tablename,
      bs,
              reltuples::numeric as est_rows,
      relpages * bs as table_bytes,
          CEIL((reltuples*             (datahdr + nullhdr2 + 4 + ma -                 (
      CASE
         WHEN
            datahdr % ma = 0                     
         THEN
            ma 
         ELSE
            datahdr % ma 
      END
)                 ) / (bs - 20))) * bs AS expected_bytes,         reltoastrelid     
   FROM
      data_headers         
      JOIN
         pg_class 
         ON tablename = relname         
      JOIN
         pg_namespace 
         ON relnamespace = pg_namespace.oid             
         AND schemaname = nspname     
   WHERE
      pg_class.relkind = 'r' 
)
, estimates_with_toast AS 
(
       -- add in estimated TOAST table sizes
       -- estimate based on 4 toast tuples per page because we dont have
       -- anything better.  also append the no_data tables
       
   SELECT
      schemaname,
      tablename,
              TRUE as can_estimate,
              est_rows,
              table_bytes + ( coalesce(toast.relpages, 0) * bs ) as table_bytes,
              expected_bytes + ( ceil( coalesce(toast.reltuples, 0) / 4 ) * bs ) as expected_bytes     
   FROM
      table_estimates 
      LEFT OUTER JOIN
         pg_class as toast         
         ON table_estimates.reltoastrelid = toast.oid             
         AND toast.relkind = 't' 
)
,
table_estimates_plus AS 
(
   -- add some extra metadata to the table data
   -- and calculations to be reused
   -- including whether we cant estimate it
   -- or whether we think it might be compressed
       
   SELECT
      current_database() as databasename,
                  schemaname,
      tablename,
      can_estimate,
                  est_rows,
                  
      CASE
         WHEN
            table_bytes > 0                 
         THEN
            table_bytes::NUMERIC                 
         ELSE
            NULL::NUMERIC 
      END
                      AS table_bytes,             
      CASE
         WHEN
            expected_bytes > 0                 
         THEN
            expected_bytes::NUMERIC                 
         ELSE
            NULL::NUMERIC 
      END
                          AS expected_bytes,             
      CASE
         WHEN
            expected_bytes > 0 
            AND table_bytes > 0                 
            AND expected_bytes <= table_bytes                 
         THEN
(table_bytes - expected_bytes)::NUMERIC                 
         ELSE
            0::NUMERIC 
      END
      AS bloat_bytes     
   FROM
      estimates_with_toast     
   UNION ALL
       
   SELECT
      current_database() as databasename,
              table_schema,
      table_name,
      FALSE,
              est_rows,
      table_size,
              NULL::NUMERIC,
      NULL::NUMERIC     FROM no_stats 
)
,
bloat_data AS 
(
       -- do final math calculations and formatting
       
   select
      current_database() as databasename,
              schemaname,
      tablename,
      can_estimate,
              table_bytes,
      round(table_bytes / (1024 ^ 2)::NUMERIC, 3) as table_mb,
              expected_bytes,
      round(expected_bytes / (1024 ^ 2)::NUMERIC, 3) as expected_mb,
              round(bloat_bytes*100 / table_bytes) as pct_bloat,
              round(bloat_bytes / (1024::NUMERIC ^ 2), 2) as mb_bloat,
              table_bytes,
      expected_bytes,
      est_rows     
   FROM
      table_estimates_plus 
)
-- filter output for bloated tables
SELECT
   databasename,
   schemaname,
   tablename,
       can_estimate,
       est_rows,
       pct_bloat,
   mb_bloat,
       table_mb 
FROM
   bloat_data -- this where clause defines which tables actually appear
   -- in the bloat chart
   -- example below filters for tables which are either 50%
   -- bloated and more than 20mb in size, or more than 25%
   -- bloated and more than 1GB in size
WHERE
   (
      pct_bloat >= 50 
      AND mb_bloat >= 20 
   )
       
   OR 
   (
      pct_bloat >= 25 
      AND mb_bloat >= 1000 
   )
ORDER BY
   pct_bloat DESC;

Finding Unused Indexes

The following query will return any unused indexes which are not part of any constraint.

SELECT s.schemaname,
  s.relname AS tablename,
  s.indexrelname AS indexname,
  pg_relation_size(s.indexrelid) AS index_size
FROM pg_catalog.pg_stat_user_indexes s
  JOIN pg_catalog.pg_index i ON s.indexrelid = i.indexrelid
WHERE s.idx_scan = 0  
  AND 0 <>ALL (i.indkey)  
  AND NOT i.indisunique  
  AND NOT EXISTS  
  (SELECT 1 FROM pg_catalog.pg_constraint c
  WHERE c.conindid = s.indexrelid)
  AND NOT EXISTS  
  (SELECT 1 FROM pg_catalog.pg_inherits AS inh
  WHERE inh.inhrelid = s.indexrelid)
ORDER BY pg_relation_size(s.indexrelid) DESC;

Check query performance

SELECT query,
       calls,
       total_time,
       total_time / calls as time_per,
       stddev_time,
       rows,
       rows / calls as rows_per,
       100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
WHERE query not similar to '%pg_%'
and calls > 500
--ORDER BY calls
--ORDER BY total_time
order by time_per
--ORDER BY rows_per
DESC LIMIT 20;

Check commit ratio of database

We must engage with the application team to determine why there are so many transaction rollbacks if the commit percentage is less than 95%. Consider a scenario where many of your transactions contain DML which can result in fragmentation.

SELECT 
  datname, 
  round(
    (
      xact_commit :: float * 100 /(xact_commit + xact_rollback)
    ):: numeric, 
    2
  ) as successful_xact_ratio 
FROM 
  pg_stat_database 
WHERE 
  datname not in (
    'postgres', 'template0', 'template1'
  );

Get the temp file usage of database

Use the log_temp_files parameter to log queries utilizing the temp files and modify the queries if you see that the temp files and bytes are high.

select 
  datname, 
  temp_files, 
  round(temp_bytes / 1024 / 1024, 2) as temp_filesize_MB 
from 
  pg_stat_database 
WHERE 
  datname not in (
    'postgres', 'template0', 'template1'
  ) 
  and temp_files > 0;

Frequency of Checkpoints

There are two important columns checkpoints_req, checkpoints_timed.

If the checkpoints_req is more than the checkpoints_timed, PostgreSQL is doing checkpoints due to the high WAL generation.

If the checkpoints are happening frequently it will cause more IO load on the machine so increase the max_wal_size parameter.

checkpoints_req > checkpoints_timed = (bad)PostgreSQL is doing checkpoints due to the high WAL generation.

Use the below query to find the frequency of the checkpoints.

Use below query to find the frequency of the checkpoints
WITH sub as (
  SELECT 
    EXTRACT(
      EPOCH 
      FROM 
        (now() - stats_reset)
    ) AS seconds_since_start, 
    (
      checkpoints_timed + checkpoints_req
    ) AS total_checkpoints 
  FROM 
    pg_stat_bgwriter
) 
SELECT 
  total_checkpoints, 
  seconds_since_start / total_checkpoints / 60 AS minutes_between_checkpoints 
FROM 
  sub;

Get top five tables with highest sequential scans

SELECT 
  schemaname, 
  relname, 
  seq_scan, 
  seq_tup_read, 
  seq_tup_read / seq_scan as avg_seq_tup_read 
FROM 
  pg_stat_all_tables 
WHERE 
  seq_scan > 0 
  and pg_total_relation_size(schemaname || '.' || relname) > 104857600 
  and schemaname IN ('"gameserver"')
ORDER BY 
  5 DESC 
LIMIT 
  20;

Cheers!

PostgreSQL – Using pgbench for Benchmark Testing

By Enrique 31/07/2023 21/03/2024 PostgreSQL, Database Administration, Performance Optimization

pgbench is a tool for testing the performance of a PostgreSQL database. It works by simulating a specified number of concurrent clients who are executing a series of SQL commands. You can use pgbench to measure the performance of your database and to compare the performance of different database configurations. The pgbench approach is based on TPC-B . The TPC-B benchmark focuses on benchmarking as opposed to OLTP type testing. Read More

The measurement is based on transactions per second. In other words – you’ll define how many transactions to execute per client and this will result in a rate per second.

Using pgbench for Benchmark Testing

1. Create a sample database

CREATE DATABASE example;

We now have an empty database named example. Let’s return to our bash shell to execute pgbench commands. We can do this by issuing the \q (quit) command.

2. Initialize the pgbench DB. This will create the sample tables in the DB

pgbench -h 10.190.30.209 -p 5432 -U jvalencia -i -s 50 example

In the command above, we ran pgbench with the -i option and the -s option followed by the database name (example).

The -i (initialize) option will tell pgbench to initialize the database specified. This will create the following tables within the example database.

The -s is the scaling option. It is used to multiply the number of rows entered into each table. It will take the default rows and multiply it by 50 or whatever scaling number you entered.

3. Establish a baseline.

pgbench -h 10.190.30.209 -p 5432 -U jvalencia -c 10 -j 2 -t 10000 example

-c number of clients

-j number of threads

-t amount of transactions

These values are 10000 transactions per client (10 x 10000 = 100,000 transactions).

3. Get results

Below is a sample result set.

The part of the output that we are most interested in is the following:

tps = 2077.939220 (without initial connection time)

This indicates that our baseline is 2077 database transactions per second.

Cheers!

PostgreSQL – Analyzing Queries Using pg_stat_statements

By Enrique 19/07/2023 21/03/2024 PostgreSQL, Database Administration, Performance Optimization

pg_stat_statements tracks planning and execution statistics of all SQL statements executed by a server.

Find the top 10 time-consuming queries

SELECT 
  round(
    (
      100 * total_exec_time / sum(total_exec_time) OVER ()
    ):: numeric, 
    2
  ) percent, 
  round(total_exec_time :: numeric, 2) AS total, 
  calls, 
  round(mean_exec_time :: numeric, 2) AS mean, 
  substring(query, 1, 200) 
FROM 
  pg_stat_statements 
ORDER BY 
  total_exec_time DESC 
LIMIT 
  10;

Find the queries that are writing to the temp the most Read More

select queryid, left(query,40), calls, temp_blks_read,
temp_blks_written from pg_stat_statements order by
temp_blks_written desc;

Find the queries that are reading from temp the most

select queryid, left(query,40), calls, temp_blks_read,
temp_blks_written from pg_stat_statements order by temp_blks_read
desc;

Find the queries that are reading from disk the most

select queryid, left(query,40), calls, local_blks_read,
local_blks_read/calls as avg_read_per_call from pg_stat_statements
order by local_blks_read desc;

Find the queries with the highest execution times

select queryid, left(query,40), mean_exec_time, max_exec_time from
pg_stat_statements where calls > 10 order by mean_exec_time desc,
max_exec_time desc;

Cheers!

PostgreSQL – Backing up and restoring one or more tables using pg_dump and pg_restore

By Enrique 05/07/2023 21/03/2024 PostgreSQL, Database Administration

There are 2 types of backups in PostgreSQL:

Logical
Physical

Logical backups in PostgreSQL can be taken using pg_dump. A backup taken using pg_dump can exist in both a text file (human-readable script file) format or a custom format. Read More

If you take a backup in plain format (.sql file), we can use psql to restore.

If you take a backup in a custom format (tar/dir/binary), the only option to restore is pg_restore.

pg_restore is 10x faster than psql.

When you do a backup using pg_dump, it will not apply any locks; hence, it will not cause any waits to the database traffic.

There are a few things to note before using pg_dump and planning your backup strategy:

Backing up multiple databases is not possible using pg_dump. It can only backup one database at a time.
pg_dump is not suitable for point-in-time recovery. To do a point-in-time recovery, WAL segments need to be replayed, but it will not be possible to replay transaction logs (WALs) after restoring backups taken using pg_dump.
To ensure that all the data from the disk is readable and have no corrupted pages, it is important to have a periodical logical backup testing.

Backing up a single or several tables from a database is possible using pg_dump. However, backing up a table from one database and another table from another database simultaneously is not possible. pg_dump can only work on one database at a time. Furthermore, taking a backup of an entire database excluding one or more tables is also possible.

We shall use the flag -t to backup and restore a table.

Syntax for backing up a table:

pg_dump -h source_server -p source_port -U  database_name -Fformat -t tablename -f /location/dumpfile_name.bin

Syntax to restore a custom format dump:

pg_restore -h target_server -p target_port -U username -d database_name -t tablename /backup_location/dumpfile_name.bin

Syntax to restore a plain format dump:

psql -h target_server -p target_port -U username -d mypgdb -f /backup_dir/tables.sql

Example: Backing up one or more tables

1. In the example below, we are going to backup employees and salary tables.

Custom format:

pg_dump -t employees -t salary -Fc -f /postgres/backups/employees_salary.bin -v

Plain Format:

pg_dump -t employees -t salary -f /postgres/backups/employees_salary.sql -v

2. If you want to restore these tables to another database in a target server, then create the database first. You may ignore this step if the database already exists.

psql -c "CREATE DATABASE perf_review"

3. Now, we are going to restore the two tables from backup to another database called perf_review:

Custom format:

pg_restore -p5432 -U postgres -t employees -t salary -d perf_review /postgres/backups/employees_salary.bin -v

Plain Format:

psql -p5432 -U postgres -t employees -t salary -d perf_review -f /postgres/backups/employees_salary.sql

Cheers!

PostgreSQL – Backing Up and Restoring a Database Using pg_dump and pg_restore

By Enrique 26/06/2023 21/03/2024 PostgreSQL, Database Administration

There are 2 types of backups in PostgreSQL:

Logical
Physical

Logical backups in PostgreSQL can be taken using pg_dump. A backup taken using pg_dump can exist in both a text file (human-readable script file) format or a custom format. Read More

If you take a backup in plain format (.sql file), we can use psql to restore.

If you take a backup in a custom format (tar/dir/binary), the only option to restore is pg_restore.

pg_restore is 10x faster than psql.

When you do a backup using pg_dump, it will not apply any locks; hence, it will not cause any waits to the database traffic.

There are a few things to note before using pg_dump and planning your backup strategy:

Backing up multiple databases is not possible using pg_dump. It can only backup one database at a time.
pg_dump is not suitable for point-in-time recovery. To do a point-in-time recovery, WAL segments need to be replayed, but it will not be possible to replay transaction logs (WALs) after restoring backups taken using pg_dump.
To ensure that all the data from the disk is readable and have no corrupted pages, it is important to have a periodical logical backup testing.

The following are the custom formats we can use:

binary format (.bin)
tar format (.tar)
directory format (.dir)

This blog post contains 2 sections. The first section is for the custom format and the second section is for the plain format. The format of a backup can be specified using the command-line argument -F. If you want to perform the backup remotely, the command should include the additional command-line arguments, such as -h for hostname, -p for port number, and -U for username

Backing up a database in a custom format and restore to another database on a target server

1. Backup the conjuror database in custom format (binary) from the source server:

Syntax:

pg_dump -h source_server -p source_port -U username database_name -F format -f backup_location/dumpfile_name.bin

Example:

pg_dump -p 5432 -U postgres -d conjuror -Fc -f backups/conjuror.bin -v

2. Create the database to which you will restore the backup taken on the target server.

psql -c "CREATE DATABASE mymagicaldb"

3. Restore the backup of conjuror database to the new database created called mymagicaldb. What will happen is that all the objects of the conjuror database will be restored to mymagicaldb.

Syntax:

pg_restore -h target_server -p target_port -U username mymagicaldb /backups/conjuror.bin -v

Example:

pg_restore -h 172.xx.xx.xx -p 5432 -U postgres -d mymagicaldb /backups/conjuror.bin -v

Backing up a database in a plain format and restoring to another database on a target server

1. Backup the conjuror database in plain format from the source server:

Syntax:

pg_dump -h source_server -p source_port -U username -d database_name -F format -f backup_location/dumpfile_name.sql

Example:

pg_dump -p 5432 -U postgres -d conjuror -Fp -f backups/conjuror.sql -v

2. Create the database to which you will restore the backup taken on the target server.

psql -c "CREATE DATABASE mymagicaldb2"

3. Restore the backup of conjuror database to the new database created called mymagicaldb2. All the objects of the conjuror database will be restored to mymagicaldb2.

Syntax:

psql -h target_server -p target_port -U username -d mymagicaldb -f /backups/conjuror.sql -v

Example:

psql -h 172.xx.xx.xx -p 5432 -U postgres -d mymagicaldb2 -f /backups/conjuror.sql

Cheers!

PostgreSQL – Shutting down PostgreSQL using different shutdown modes

By Enrique 19/06/2023 21/03/2024 PostgreSQL, Database Administration

We can choose to shutdown PostgreSQL in 3 modes:

Smart mode
Fast mode
Immediate mode

You can use the flag -m to invoke PostgreSQL shutdown using a specific mode. Read More

Smart mode: This is the default mode; hence just a stop should work. Explicitly invoking shutdown using smart mode is actually not needed.

pg_ctl -D $PGDATA stop
pg_ctl -D $PGDATA stop -ms

Fast mode: To stop PostgreSQL in fast mode, we should use -mf as shown below.

pg_ctl -D $PGDATA stop -mf

Immediate mode: We must use -mi to stop the PostgreSQL server using immediate mode.

pg_ctl -D $PGDATA stop -mi

Explanation

Smart mode (-ms): After receiving SIGTERM, the server disallows new connections, but lets existing sessions end their work normally. It shuts down only after all of the sessions terminate. If the server is in recovery when a smart shutdown is requested, recovery and streaming replication will be stopped only after all regular sessions have terminated.
Fast mode (-mf): The server disallows new connections and sends all existing server processes SIGTERM, which will cause them to abort their current transactions and exit promptly. It then waits for all server processes to exit and finally shuts down.
Immediate mode (-mi): The server will send SIGQUIT to all child processes and wait for them to terminate. If any do not terminate within 5 seconds, they will be sent SIGKILL. The supervisor server process exits as soon as all child processes have exited, without doing normal database shutdown processing. This will lead to recovery (by replaying the WAL log) upon next start-up. This is recommended only in emergencies.

Cheers!

PostgreSQL – Vacuum Basics

By Enrique 22/03/2023 21/03/2024 PostgreSQL, Database Administration

To implement one of the ACID properties called “Isolation”, PostgreSQL provides MVCC (Multi-Version Concurrency Control). In this way, the maximum possible concurrency among current transactions is achievable. PostgreSQL does this by creating versions of each tuple when the tuple receives any modifications. For instance, say that a tuple received n concurrent modifications. The n versions of the same tuple will still be kept and will make the last committed modified tuple visible to other transactions. Now, this will lead to more disk space usage because you are having both visible and non-visible tuples. The good news is that PostgreSQL offers a few ways to reuse the non-visible tuples to make way for further write operations. Read More

PostgreSQL databases require periodic maintenance known as vacuuming.

PostgreSQL’s VACUUM command has to process each table on a regular basis for several reasons:

To recover or reuse disk space occupied by updated or deleted rows
To update data statistics used by the PostgreSQL query planner
To update the visibility map, which speeds up index-only scans
To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound

Let’s perform a test with the usage of VACUUM. We will create a sample table and run a few SQL statements that will generate non-visible tuples or dead tuples.

Connect to your database as a super use then run the command below:

$ psql -h localhost -U postgres
postgres=# CREATE EXTENSION pg_freespacemap;
CREATE

Let’s create a test table:

postgres=# CREATE TABLE test(t INT);

Let’s turn off autovacuum for a moment for demo purposes:

postgres=# ALTER TABLE test SET(autovacuum_enabled=off);
ALTER TABLE

Let’s observe how many pages this table will occupy in the disk when we insert 1,000 rows:

postgres=> INSERT INTO test VALUES(generate_series(1, 7000));
INSERT 0 7000
postgres=> ANALYZE test;
ANALYZE
postgres=> SELECT relpages FROM pg_class WHERE relname='test';
 relpages
----------
       31
(1 row)

PostgreSQL used thirty-one relpages to store 7000 integer records.

Using the pg_freespace() function, let’s verify the free space available in the relation test:

postgres=> SELECT sum(avail) freespace FROM pg_freespace('test');
 freespace
-----------
         0
(1 row)

Let’s update all the 1,000 records and gather the relpages and freespace map information again:

postgres=> UPDATE test SET t=7000;
UPDATE 7000
postgres=> ANALYZE test;
ANALYZE
postgres=> SELECT relpages FROM pg_class WHERE relname='test';
 relpages
----------
       62
(1 row)

postgres=> SELECT sum(avail) freespace FROM pg_freespace('test');
 freespace
-----------
         0
(1 row)

We’ve gained 31 additional pages by running the update on 7,000 tuples. In other words, the table size expanded by up to 50% in the disk storage upon updating the entire table.

Now, let’s gather the number of non-visible or dead tuples of the table. We can do this by using the pg_stat_user_tables view.

postgres=> SELECT n_dead_tup FROM pg_stat_user_tables WHERE relname='test';
 n_dead_tup
------------
       7000
(1 row)

Let’s run the VACUUM then observe the amount of free space available for the relation ‘test’:

postgres=> VACUUM test;
VACUUM
postgres=> SELECT sum(avail) freespace FROM pg_freespace('test');
 freespace
-----------
    252288
(1 row)

After executing VACUUM command, it logically cleans up the dead tuples resulting in free space being available. Now, we are gonna insert additional 7,000 records into the table. Then let’s look at the relpages count and free information.

postgres=> INSERT INTO test VALUES(generate_series(1, 7000));
INSERT 0 7000
postgres=> ANALYZE test;
ANALYZE
postgres=> SELECT relpages FROM pg_class WHERE relname='test';
 relpages
----------
       62
(1 row)

postgres=> SELECT sum(avail) freespace FROM pg_freespace('test');
 freespace
-----------
      7488

(1 row)

From the previous SQL statement output, since we ran the VACUUM command before inserting records, we see that even after inserting additional 7,000 records into the table, the relpages did not increase. We can also see that the free space that was logically deleted by the previous VACUUM command has been properly used.

How about doing the same steps above without using the VACUUM command before inserting records? Let’s do that, then observe the disk usage.

postgres=> TRUNCATE test;
TRUNCATE TABLE
postgres=> ANALYZE test;
ANALYZE
postgres=> INSERT INTO test VALUES(generate_series(1, 7000));
INSERT 0 7000
postgres=> UPDATE test SET t=700;
UPDATE 7000
postgres=> INSERT INTO test VALUES(generate_series(1, 7000));
INSERT 0 7000
postgres=> ANALYZE test;
ANALYZE
postgres=> SELECT relpages FROM pg_class WHERE relname='test';
 relpages
----------
      93

postgres=> SELECT sum(avail) freespace FROM pg_freespace('test');
 freespace
-----------
         0
(1 row)

From the previous example, we can see that the table size increased by 50% when we perform an additional insert operation after updating the table. The reason for this is that the table has no available free space map, and it increased the table storage. In the prior example, after the insert operation has been completed, the total table size takes nine pages, and in this scenario, it takes 93 pages.

Now, we can validate that a table will reuse its dead tuple storage for any new incoming transactions when VACUUM operation is used. However, it is not recommended to run VACUUM explicitly on every insert/update/delete operation. It will cause I/O utilization to increase because VACUUM is a heavy process that scans all the tuples in a table. It is also not suggested to entirely stop VACUUM on any table; otherwise, you will have more dead tuples on your disk, which can result in ineffective disk utilization. The best practice is to run VACUUM process periodically. For example, we can run it once a day or once a week. We can either schedule it using crontab or pgBucket.

What Happens When We Execute VACUUM Command

When we run =VACUUM command on any table, it will ask the Postgres instance to transmit all the currently running SQL queries. Once the vacuum process sees that list, it will proceed to spot the dead tuples that are not in view from the running SQL queries. Afterwards, it will utilize the maintenance_work_mem to load the relation into memory where it can then identify dead tuples. Once all the dead tuples have been detected by the vacuum process, it will initiate an FSM(Free Space Map) for that relation.

Another duty of the vacuum process is to update each table row transaction status by confirming pg_clog and pg_subtrans. Each row header in contains xmin and xmax values. The xmin and xmax values relate to the transaction statuses: In progress, Committed, Aborted by citing the status bits from pg_clog.

xmin

The column that records the transaction id that created the row.

xmax

The column records the transaction id that expired the row, either through an UPDATE or DELETE.

Cheers!

Facebook

Google

Twitter

SQL Conjuror

PostgreSQL – Checkpoint

PostgreSQL – Error: Canceling statement due to conflict with recovery

PostgreSQL – Monitoring Replication

PostgreSQL – Health Check Scripts

PostgreSQL – Using pgbench for Benchmark Testing

PostgreSQL – Analyzing Queries Using pg_stat_statements

PostgreSQL – Backing up and restoring one or more tables using pg_dump and pg_restore

PostgreSQL – Backing Up and Restoring a Database Using pg_dump and pg_restore

PostgreSQL – Shutting down PostgreSQL using different shutdown modes

PostgreSQL – Vacuum Basics

Recent Posts

Archives

Categories

Recent Posts

Archives

Categories

Tag Cloud