Database Troubleshooting for Developers at Brighton Ruby
This post contains the slides from the talk that I gave at Brighton Ruby
last week, along with the transcript and other information.
You can watch the video here
(a massive thanks to Andy Croll for the awesome souvenir video!).
Before I get into my slides and talk, I just want to say what a fantastic experience the event was.
The talks were excellent; engaging, interesting, educational and inspiring.
The atmosphere, the venue, and the company (not to mention the ice cream) all contributed to a very enjoyable and useful day.
I'm very grateful for the opportunity, and I highly recommend going along next year if you have the chance.
Talk Abstract and Summary
As a developer, what do you do if something goes wrong with your database?
Maybe your queries are running slowly, you're getting a weird error, or an important table seems to have disappeared.
If you don't have the luxury of a dedicated DBA, you'll probably need to figure out what's going wrong and fix it yourself.
Let's look at some common database issues and give you the tools you need to investigate and fix them.
The talk walks through some of the most common database issues that developers are likely to encounter.
For example, common error messages, slow running queries, WAL directory filling up.
For each issue, I explain how to investigate the issue (which logs to look in, which queries or commands to run) and how to resolve it.
I take care to explain why the problem occurs and, perhaps most importantly, how to avoid it happening again.
The details and examples in the talk are specific to PostgreSQL, but the concepts are relevant to most database systems.
Introduction
A few facts about me, just for fun (you can skip this bit if you've already seen one of my talks or read my previous posts):
This is me in my happy place, splattered in mud!
I live in a small village in the French Alps and I love getting out on one of my bikes - road, gravel or mtb depending on how I’m feeling.
As well as being a Senior Solutions Architect at Crunchy Data, I’m on the PostgreSQL Europe Board of directors
and I give talks at Postgres conferences, and at developer events like this one,
sharing database knowledge and spreading the love for Postgres.
I always feel the need to get this confession out of the way before I get too far in to the talk - I’m not a developer (sorry)!
But as you can see in this diagram of my career so far, I at least know about databases.
I was a DBA for 20 years before becoming a database consultant and I'm now a senior solutions architect.
I might not have the word "database" in my job title any more, but I still only work with databases, specifically PostgreSQL.
What’s the plan for this session?
There’ll be an intro, because I always like to talk a bit about what we’re going to do and why,
and then a section about where and how to find the information you need.
Most of the rest is the “help!” section - “What do I do if…”
And then a bit at the end about how to practice to make sure you reinforce the things you’ve learnt.
As I said in the abstract, the examples and details in the talk are specific to Postgres,
because it’s the database system that I know and love best,
but the general concepts and methods used for troubleshooting are useful for other database systems.
As a developer, you’re often expected to look after your own databases,
but you haven’t necessarily been taught about database administrationn
so you might find it difficult to know how to resolve any database issues.
I'm not going to turn you into a database troubleshooting expert in 30 minutes,
but I'll try to give you some starting points, and the confidence that you need
to deal with some of the most common database problems that you might encounter.
Where to Find Information
None of us, even so-called “experts” can know everything. The most important thing is to know where to find the information you need.
Some of the places you can look:
-
PostgreSQL documentation
Replace “current” with the version number you’re using.
I include a link to the relevant part of the docs for most of the things I cover in the talk so you can go and learn more. - Postgres Mailing Lists
- Help in the psql command line tool
- Logs for your Postgres cluster
- PostgreSQL Slack
- And more!
I’m going to use the psql
command line tool for the examples in this talk,
mainly because I’m an old-school DBA and I love command line tools.
psql is a really useful, powerful little tool that’s always available to you on your database server.
It’s the only way I ever really interact with a database if I need to do debugging
(or anything really).
You can either type in queries and get results interactively, or you can provide a script as input.
There are also lots of meta-commands that give you information or perform tasks for you – we’ll see some as we go along.
Once you’re at the psql command line,
you can get help via \?
for psql meta-commands
and via \help
for SQL commands.
Your Postgres logs
contain a lot of useful information.
You really want to find them and make sure they contain the info you want before you actually need them;
you don’t want to be searching for them in a panic when something’s gone wrong.
First, make sure that logging_collector
is set to on
.
Then check the log_directory
parameter to find out where the logs are.
Mine are in a directory called log. Great. Where’s that?
If log_directory
isn’t an absolute path, it’s relative to your data directory.
But where’s your data directory?
That depends on your distribution of Postgres, but you can find it with:
postgres=# show data_directory
/var/lib/pgsql/16/data
is an example for an Ubuntu install.
Top tip: set the PGDATA
environment variable so you don't forget where the data directory is,
and to save a few keystrokes later on.
The log_filename
parameter tells you how the log files are named.
The default depends on your distribution, but you can change it to whatever works best for you.
You might have log names that contain the day of the week: log_filename=postgresql-%a.log
You might have log names that include a timestamp:
log_filename=postgresql-%Y-%m-%d_%H%M%S.log
There are other parameters that deal with log rotation that I don’t have time to get into now,
but you can find info about them in the docs.
You want the logs to be as useful as possible.
One of the ways to do that is to set log_line_prefix
,
which is a string included at the beginning of each log line.
The default value '%m [%p] '
just contains a timestamp and process ID,
which isn't necessarily enough to be able to debug any issues,
so change it to include more information.
The example '%t:%r:%u@%d:[%p]: '
includes:
- %t timestamp (without milliseconds)
- %r remote host name/address and port
- %u PostgreSQL username
- %d database name
- %p process ID
- a space at the end
(to separate the prefix from the actual log message)
Help! I Can't Connect
Now that you know where to find the logs, and you’ve got the information you need in them,
Let’s move on to the “Help!” section.
What do you do if you can’t connect to Postgres?
If you call psql without any parameters, it will use default values to establish a connection.
In this case it will try to connect:
- to a user called “karen”
(a Postgres user with the same name as the currently connected OS user) - to a database with the same name
- to the cluster on the local server (via localhost)
- on port 5432 (the default Postgres port)
If you’re using the pg gem to interact with postgres,
it’s the same as calling PG.connect()
without any parameters.
This script is pretty much the extent of my Ruby expertise!
If I try that, psql returns an error.
The output from my ruby script contains exactly the same error.
Fortunately, most of the postgres errors give fairly useful information.
This one tells us that:
database “karen” does not exist
which makes sense, because I haven’t created a database called karen.
On the other hand, I have created a database called brighton_ruby. So we can try to connect again, this time specifying the brighton_ruby database.
This time, I can connect!
The psql prompt shows which database you're connected to.
The command:
brighton_ruby=# select current_user;
will tell you which user you’re currently connected as.
Let’s try to connect to the brighton_ruby database as a user called brighton_app_user, either via:
psql brighton_ruby -U brighton_app_user
or by adding user: 'brighton_app_user'
to the PG.connect()
call.
We can’t connect!
But again, at least the error message contains useful information.
The user (or role as it says in the message - the two are synonyms in most cases) doesn’t exist.
FATAL: role "brighton_app_user" does not exist
If we connect again as the karen user, we can use the
\du
psql meta-command to list the roles that exist in the cluster.
There's only karen at the moment, which is a superuser,
and was created automatically when I installed Postgres on my laptop with homebrew.
Most often, you’ll see a superuser called postgres.
Let’s fix our connection issue by creating the brighton_app_user role.
And this time, when we connect with brighton_app_user...
We still get an error!
role "brighton_app_user" is not permitted to log in
Create role is one of the rare cases where role and user aren’t synonyms in PostgreSQL.
When you do a create role
,
by default it's with nologin
,
whereas create user
is with login
by default.
If I do:
brighton_ruby=# alter user brighton_app_user with login;
brighton_app_user can finally connect!
Up to this point, we’ve connected to a local Postgres cluster,
but often you’ll be trying to connect to a database on a different server.
In that case, you’ll give the hostname or address of the database server
with -h
for psql
or by adding host
to the PG.connect()
call.
If you see this error:
Permission denied
Is the server running on that host
and accepting connections on that socket?
It probably means that PostgreSQL is only listening for local connections.
To fix that, you’ll need to change the listen_addresses parameter,
that specifies which IP addresses Postgres will listen on for connections.
By default, it’s 'localhost'
,
so Postgres will only listen for local, loopback connections.
To allow remote connections, you can set it to '*'
to listen on all available IP interfaces.
Unfortunately, even after changing listen_addresses to '*'
we still can’t connect.
Now what’s going on?
This time, the error message is talking about a missing
pg_hba.conf
entry.
The pg_hba.conf file (in the data directory) controls who can and can’t connect, where from, with which authentication method etc. The default pg_hba.conf only allows local connections, but it also contains examples and explanations about how to create the entries that you need.
To fix our issue and allow brighton_app_user to connect, here’s the line that I added
to the pg_hba.conf
file:
# TYPE DATABASE USER ADDRESS METHOD
host brighton_ruby brighton_app_user 192.168.1.3/32 scram-sha-256
But how did I know which values to use?
There were clues in the error message.
The options for connection type are local or host.
I know this isn't a local connection, so I chose host
.
host can be with or without ssl or gssapi encryption.
The messages says "no encryption" so I didn’t specify either of those.
The database is brighton_ruby.
The user is brighton_app_user.
And I'm trying to connect from 192.168.1.3 so I specify that as the ADDRESS.
Finally, I chose scram-sha-256
as the authentication method,
which is actually now the Postgres default.
You need to reload the parameters for the pg_hba.conf changes to take effect.
You can use the pg_reload_conf()
function to do that.
And now when we try to connect, we’re prompted for the password and it finally works.
Help! PostgreSQL Isn't Running
What if Postgres isn’t even running?
If you get this error when you try to connect:
No such file or directory
Is the server running on that host
and accepting connections on that socket?
It suggests that Postgres isn’t even running.
You’ll need to check the status of the PostgreSQL service:
- either with
pg_ctl status
or -
if Postgres was started with systemd, with
systemctl status
.
You might need to find the name of your PostgreSQL service in that case withsystemctl list-units
. Look for a name containing "postgres".
Before you try starting it back up, it’s probably a good idea to look in the postgres logs (that you already found earlier)
to figure out why it stopped, and if you need to do something to fix it.
There’ll be a message:
FATAL: the database system is shutting down
that tells you that Postgres stopped. Look at the messages just before that to find out why.
If there’s no problem (maybe someone just accidentally stopped Postgres), you can restart it with:
pg_ctl start
orsystemctl start <servicename>
or-
brew services start <servicename>
(in the case of my local test cluster)
Help! I've Run Out of Space
What if your database has run out of space?
If you try to insert data or create a table and you see this type of message, telling you there’s no space left on the device:
ERROR: could not extend file "base/20429/2187":
No space left on device
You get a hint telling you to check free disk space.
But which disk?
The file name is given in the message, and it’s relative to the data directory that we saw earlier
(and that you’ve set as PGDATA
so you don’t have to keep looking for it).
Once you’ve figured out which disk, you can try to make some space.
The safest thing to do to fix the immediate problem is to add some space.
If you can’t do that, or if you still need more space,
maybe the disk that contains your data directory also contains files
that don’t relate to Postgres that you can delete or move.
And if that’s still not enough, you might need to look inside the database to find things that you can delete.
Are there temporary tables you can get rid of? Do you have unused indices?
This option is a bit scary – definitely don’t delete anything unless you’re absolutely sure it’s not needed!
If it’s the disk containing the write ahead log (WAL) files that’s full, that’s more serious.
Postgres will do a PANIC shutdown to protect the database.
In that case, you don’t have a choice but to immediately make some space.
In the short term, adding some space to the disk will buy you some time.
And if you’re thinking
“it’s fine, I’ll just delete some of the older WAL files, I don’t need them any more.”
Please don’t do that! Postgres needs the WAL files.
If you delete them, you risk not being able to restart your database and having to restore from a backup.
Often, the problem is that the WAL archiving has stopped working.
WAL archiving is the process of sending a copy of each WAL file to your backup respository
so they're available in case you need to do a point in time recovery.
The archive_command
parameter
contains the command that’s used to perform this copy.
Postgres can’t delete WAL files that haven’t yet been archived,
so if archiving is broken, the WAL files will continue to build up and take up more and more space in the WAL directory.
Again, if you look in the Postgres logs, you should be able to find what the problem is.
In the example, I deliberately used the badly formulated archive command
faulty_archive_command
to generate the error.
Alternatively, it could be that the archive directory is full
or doesn't have the correct permissions.
Once you’ve fixed the issue, the archiving should restart,
the archived WALs will be automatically deleted from the WAL directory,
and Postgres can be restarted.
Help! My Queries are Too Slow
There can be lots of different reasons for slow queries
and usually there’s no error to let us know what’s going on.
We can’t look at all the different cases in the next few minutes,
but we can learn to identify the cause and start to figure out the solution.
There are a few general things that can help.
The log_min_duration_statement
parameter
can help to identify queries that need to be optimised.
It’s deactivated by default.
If you set it to 1 second, it will log every completed SQL statement that ran for more than 1 second.
Obviously if all of your queries run for more than a second, that’s not going to help you much.
Set it to whatever will just show you outliers for your application to avoid too much noise in your logs.
A useful tool for checking what’s going on in your database is the pg_stat_statements extension. If it's not already enabled, you need to:
-
add it to the
shared_preload_libraries
parameter and restart postgres. -
Issue
create extension pg_stat_statements
in each database where you want to use it.
and then you can query the pg_stat_statements
view
for information about long running or frequently executed queries.
To investigate, start by asking yourself some questions. For example:
- Are all queries slow, or is it just a particular query?
- Is it always slow, or just at certain times?
- Has it recently started to be slow, or was it always slow?
- Does if affect everyone, or just certain users, certain applications, certain locations etc.?
If all queries are impacted, you may need to look outside of Postgres to see what’s going on, potentially checking resources on the database server, looking for IO latency, network issues etc.
On the database side, look for clues in the logs; maybe there are messages there.
Check if there are too many connections, and particularly too many active connections.
Postgres works best with a smallish number of concurrent connections, preferably in the low hundreds.
You can look in pg_stat_activity
to get the number of sessions and their current state.
If there are too many sessions, look at :
- Reconfiguring your application to create fewer sessions.
- Making sure sessions are closed when they’re not needed.
- Putting connection pooling in place
If it’s just one query that’s slow, you’re going to need to tune that query.
For that, you’ll need to see the execution plan for the query,
which tells you how the PostgreSQL optimiser is going to execute it.
It includes information such as:
- the way the tables will be accessed; sequential scan, index scan etc.
- The join methods that will be used if there’s more than one table involved.
- An estimate of the number of rows in each step.
- An estimate of the cost of execution.
To display the execution plan, take your slow running query and prefix it with
explain
or explain analyze
.
The actual query here doesn’t matter,
I’m just selecting some data from a single table and I’m grouping and sorting it.
There are various different options that you can read about
in the PostgreSQL documentation.
explain analyze
will actually execute the query
so it gives more information than just explain
.
Just be aware that if your query makes changes, and you don’t want to keep those changes,
you will need to put it in a transaction so you can rollback afterwards.
The explain returns something that looks like this, which can be confusing to start with.
You read an execution plan from the most indented to the least indented.
Here, we see there’s a full table scan seq scan
on our drinking fountains table.
The results of the table scan are fed into the HashAggregate
,
which does our group by.
And those results are fed into the sort
,
which is the last step.
I know some people prefer a visual approach. I didn't have time to talk about this,
but there are various tools to help with visualising and understanding the EXPLAIN output. These include:
There's also a PostgreSQL wiki page with information about how to use and understand EXPLAIN.
The PostgreSQL planner/optimizer
needs up to date statistics to be able to come up with the right execution plans.
If the plan doesn’t seem right, and especially if there have recently been lots of changes to the table data,
maybe your stats need to be updated.
You can recalculate stats for a single table with the
analyze <tablename> command.
If you don’t say which table (or list of tables) it will analyze all of the tables in your database.
If your query is only slow at certain times, what else is going on at that time that could be impacting it?
- Backup?
- Analytics queries?
- Maintenance tasks? (index rebuilds, stats collection, vacuum…)
If performance has changed recently, what’s changed?
- Is there more data?
- More connections than before?
- A new application using lots of resources?
If it only affects certain users, certain applications or certain locations:
- Who does it affect?
- Where are they connecting from?
- Which applications?
Maybe it’s a network issue, there's an issue with a specific application server, or resources have been limited for particular users.
Help! I've Lost Some Data
What if you’ve lost an object, or some data seems to have disappeared?
Luckily, often when you think an object has disappeared, it’s actually still there, you just need to know where to look for it.
-
Are you connected to the right database?
(a postgres clulster can be made up of multiple databases.) -
Are you looking in the right schema?
(there might be several schemas, effectively namespaces, in a database.) -
Do you have permissions on the object?
(maybe you’re looking in the right place, but you don’t have the right permissions.)
Connected to the brighton_ruby database as karen, if I try to select from a table that I created called drinking_fountains (from an open data set that shows where you can find drinking fountains in Paris)...
...I get an error telling me that the relation (table) doesn’t exist.
But I know the table exists because it’s me that created it.
What I’d forgotten is that I’d created it in a schema called brighton_data.
If I qualify the table name with the schema name...
...it finds the table and I get the results.
The reason it didn’t find it to start with is because the
search_path
parameter,
which tells Postgres which schemas to look in,
was left at its default value: "$user", public
.
This translates to a schema with the same name as the user that you’re connected with – currently karen,
and the public schema.
Now, I’m going to log in as my application user brighton_app_user.
Let’s assume that all of the application tables are in the brighton_data schema.
I can set search_path = brighton_data
so Postgres always looks there.
Now I don’t need to specify the schema name for the tables in my queries.
But now I’ve no idea what’s going on!
I know the table's there, I’ve set search path, and it’s still telling me the table doesn’t exist.
This time it’s because I don’t have permissions on the drinking_fountains table.
I need to ask brighton_app_owner nicely to grant
the right to select from the table.
Once that's done, fortunately, brighton_app_user can access the table and get the results.
If the table really doesn’t exist, or the data has been deleted, it’s going to be more complicated.
You want to hope that you’ve already put in place backups of your database
and that you’ve practiced doing restores.
pgBackRest is a reliable, open source Postgres backup and recovery tool.
If you’re not already using it, go and install and configure it straight after this talk!
Be aware that a restore is for an entire Postgres cluster.
If you want to create an extract or dump of just certain objects from a cluster, there’s the pg_dump tool. And you can use pg_restore to import data from the extract into your database.
How can I practice?
How can you practice and make sure you reinforce the things you've learnt?
pg_kaboom is an extension that will crash your postgres cluster in "multiple varied and destructive ways".
So destructive that you have to fill in a disclaimer each time you use it!
SET pg_kaboom.disclaimer = 'I can afford to lose this data and server';
In the example in the slide, I’ve chosen the option to fill the WAL directory.
Is it safe?
Absolutely not!
Do not, under any circumstances, use this in a production cluster.
In conclusion: in case of database problems, don’t panic!
- Look in the logs (find the logs before you need them).
- Read the error messages.
- Look in the PostgreSQL documentation.
- Enable pg_stat_statements before you have issues.
- Ask a question in the PostgreSQL Slack
or on the PostgreSQL mailing lists.
I’ve grouped together some of the resources that I’ve used in the presentation,
plus others that could be useful for practicing or for troubleshooting.
- For log analysis: pgBadger
- Database GUI: DBeaver, pgAdmin
- Managed PostgreSQL service: Crunchy Bridge*
- To generate database activity: pgbench, pgSimload
- View database activity: pg_stat_statements
- Practice troubleshooting: pg_kaboom
- Understand execution plans: Depesz EXPLAIN, DBeaver, Using EXPLAIN
* Other managed PostgreSQL services are available!
The QR code on the final slide takes you to a page with links to recordings and pdf version of slides from
talks I've given in 2024.
I've also included some contact details so you can get in touch for more database chat:
Mastodon @karenhjex@mastodon.online
X/Twitter @karenhjex
LinkedIn https://www.linkedin.com/in/karenhjex/