Database Troubleshooting for Developers at Brighton Ruby

July 04, 2024

Database Troubleshooting for Developers at Brighton Ruby

This post contains the slides from the talk that I gave at Brighton Ruby last week, along with the transcript and other information.

You can watch the video here (a massive thanks to Andy Croll for the awesome souvenir video!).

Screenshot of Brighton Ruby 2024 Schedule

Before I get into my slides and talk, I just want to say what a fantastic experience the event was. The talks were excellent; engaging, interesting, educational and inspiring. The atmosphere, the venue, and the company (not to mention the ice cream) all contributed to a very enjoyable and useful day.

I'm very grateful for the opportunity, and I highly recommend going along next year if you have the chance.

Title slide: Database Troubleshooting for Developers

Talk Abstract and Summary

As a developer, what do you do if something goes wrong with your database? Maybe your queries are running slowly, you're getting a weird error, or an important table seems to have disappeared. If you don't have the luxury of a dedicated DBA, you'll probably need to figure out what's going wrong and fix it yourself. Let's look at some common database issues and give you the tools you need to investigate and fix them.

The talk walks through some of the most common database issues that developers are likely to encounter. For example, common error messages, slow running queries, WAL directory filling up. For each issue, I explain how to investigate the issue (which logs to look in, which queries or commands to run) and how to resolve it. I take care to explain why the problem occurs and, perhaps most importantly, how to avoid it happening again. The details and examples in the talk are specific to PostgreSQL, but the concepts are relevant to most database systems.

Slide 2: whoami (Head shot of Karen in bike helmet and sunglasses)

Introduction

A few facts about me, just for fun (you can skip this bit if you've already seen one of my talks or read my previous posts):

This is me in my happy place, splattered in mud!

I live in a small village in the French Alps and I love getting out on one of my bikes - road, gravel or mtb depending on how I’m feeling.

As well as being a Senior Solutions Architect at Crunchy Data, I’m on the PostgreSQL Europe Board of directors and I give talks at Postgres conferences, and at developer events like this one, sharing database knowledge and spreading the love for Postgres.

Slide 3: Diagram representing career to date. Database icons joined by arrows and labelled Junior DBA, DBA, Senior DBA, Database Expert, Senior Database Consultant, Senior Solutions Architect

I always feel the need to get this confession out of the way before I get too far in to the talk - I’m not a developer (sorry)!

But as you can see in this diagram of my career so far, I at least know about databases. I was a DBA for 20 years before becoming a database consultant and I'm now a senior solutions architect. I might not have the word "database" in my job title any more, but I still only work with databases, specifically PostgreSQL.

Slide 5: Agenda (Introduction, Where to find information, 'I can’t connect', 'PostgreSQL isn’t running', 'I’ve run out of space', 'My queries are too slow', 'I’ve lost a table / data', How to practice ?)

What’s the plan for this session?

There’ll be an intro, because I always like to talk a bit about what we’re going to do and why, and then a section about where and how to find the information you need.

Most of the rest is the “help!” section - “What do I do if…”

And then a bit at the end about how to practice to make sure you reinforce the things you’ve learnt.

As I said in the abstract, the examples and details in the talk are specific to Postgres, because it’s the database system that I know and love best, but the general concepts and methods used for troubleshooting are useful for other database systems.

Slide 7: picture of cute elephant holding a heart shaped balloon in its trunk

As a developer, you’re often expected to look after your own databases, but you haven’t necessarily been taught about database administrationn so you might find it difficult to know how to resolve any database issues.

I'm not going to turn you into a database troubleshooting expert in 30 minutes, but I'll try to give you some starting points, and the confidence that you need to deal with some of the most common database problems that you might encounter.

Where to Find Information

None of us, even so-called “experts” can know everything. The most important thing is to know where to find the information you need.

Some of the places you can look:

PostgreSQL documentation
Replace “current” with the version number you’re using.
I include a link to the relevant part of the docs for most of the things I cover in the talk so you can go and learn more.
Postgres Mailing Lists
Help in the psql command line tool
Logs for your Postgres cluster
PostgreSQL Slack
And more!

Slide 10: psql (bullet list explained in text)

I’m going to use the psql command line tool for the examples in this talk, mainly because I’m an old-school DBA and I love command line tools. psql is a really useful, powerful little tool that’s always available to you on your database server. It’s the only way I ever really interact with a database if I need to do debugging (or anything really).

You can either type in queries and get results interactively, or you can provide a script as input. There are also lots of meta-commands that give you information or perform tasks for you – we’ll see some as we go along.

Once you’re at the psql command line, you can get help via \? for psql meta-commands and via \help for SQL commands.

Slide 12: PostgreSQL logs: link and terminal command/output explained in text

Your Postgres logs contain a lot of useful information. You really want to find them and make sure they contain the info you want before you actually need them; you don’t want to be searching for them in a panic when something’s gone wrong.

First, make sure that logging_collector is set to on.

Then check the log_directory parameter to find out where the logs are. Mine are in a directory called log. Great. Where’s that?

If log_directory isn’t an absolute path, it’s relative to your data directory.

Slide 14: PostgreSQL logs (data directory)

But where’s your data directory?

That depends on your distribution of Postgres, but you can find it with:

  
    postgres=# show data_directory

/var/lib/pgsql/16/data is an example for an Ubuntu install.

Top tip: set the PGDATA environment variable so you don't forget where the data directory is, and to save a few keystrokes later on.

Slide 15: PostgreSQL logs (log_filename containing day of week)

The log_filename parameter tells you how the log files are named. The default depends on your distribution, but you can change it to whatever works best for you.

You might have log names that contain the day of the week: log_filename=postgresql-%a.log

Slide 16: PostgreSQL logs (log_filename containing timestamp)

You might have log names that include a timestamp:
log_filename=postgresql-%Y-%m-%d_%H%M%S.log

There are other parameters that deal with log rotation that I don’t have time to get into now, but you can find info about them in the docs.

Slide 17: PostgreSQL logs (log_line_prefix)

You want the logs to be as useful as possible. One of the ways to do that is to set log_line_prefix, which is a string included at the beginning of each log line.

The default value '%m [%p] ' just contains a timestamp and process ID, which isn't necessarily enough to be able to debug any issues, so change it to include more information.

The example '%t:%r:%u@%d:[%p]: ' includes:

%t timestamp (without milliseconds)
%r remote host name/address and port
%u PostgreSQL username
%d database name
%p process ID
a space at the end
(to separate the prefix from the actual log message)

Help! I Can't Connect

Now that you know where to find the logs, and you’ve got the information you need in them, Let’s move on to the “Help!” section.

What do you do if you can’t connect to Postgres?

If you call psql without any parameters, it will use default values to establish a connection.

In this case it will try to connect:

to a user called “karen”
(a Postgres user with the same name as the currently connected OS user)
to a database with the same name
to the cluster on the local server (via localhost)
on port 5432 (the default Postgres port)

Slide20: Help! I can’t connect. Simple Ruby script using conn=PG.connect()

If you’re using the pg gem to interact with postgres, it’s the same as calling PG.connect() without any parameters.

This script is pretty much the extent of my Ruby expertise!

Slide 21: Database Doesn't Exist. Error message when connecting to psql

If I try that, psql returns an error.

Slide 22: Database Doesn't Exist. Error message returned by Ruby script

The output from my ruby script contains exactly the same error.

Fortunately, most of the postgres errors give fairly useful information. This one tells us that:

  
    database “karen” does not exist

which makes sense, because I haven’t created a database called karen.

Slide 23: psql brighton_ruby and conn = PG.connect( dbname: 'brighton_ruby' )

On the other hand, I have created a database called brighton_ruby. So we can try to connect again, this time specifying the brighton_ruby database.

This time, I can connect!

The psql prompt shows which database you're connected to.

The command:

  
    brighton_ruby=# select current_user;

will tell you which user you’re currently connected as.

Let’s try to connect to the brighton_ruby database as a user called brighton_app_user, either via:

  
    psql brighton_ruby -U brighton_app_user

or by adding user: 'brighton_app_user' to the PG.connect() call.

Slide 28: Error „role brighton_app_user does not exist„

We can’t connect!

But again, at least the error message contains useful information. The user (or role as it says in the message - the two are synonyms in most cases) doesn’t exist.

  
    FATAL:  role "brighton_app_user" does not exist

$Slide 29: \du command in psql to list roles showing just the karen superuser$

If we connect again as the karen user, we can use the \du psql meta-command to list the roles that exist in the cluster.

There's only karen at the moment, which is a superuser, and was created automatically when I installed Postgres on my laptop with homebrew. Most often, you’ll see a superuser called postgres.

Slide 30: create role brighton_app_user with password='secret';

Let’s fix our connection issue by creating the brighton_app_user role.

Slide 31: psql brighton_ruby -U brighton_app_user

And this time, when we connect with brighton_app_user...

Slide 33: role „brighton_app_user„ is not permitted to log in

We still get an error!

  
    role "brighton_app_user" is not permitted to log in

Slide 34: Attribute „cannot login„ for brighton_app_user

Create role is one of the rare cases where role and user aren’t synonyms in PostgreSQL. When you do a create role, by default it's with nologin, whereas create user is with login by default.

Slide 35: alter role brighton_app_user with login;

If I do:

  
    brighton_ruby=# alter user brighton_app_user with login;

brighton_app_user can finally connect!

Up to this point, we’ve connected to a local Postgres cluster, but often you’ll be trying to connect to a database on a different server. In that case, you’ll give the hostname or address of the database server with -h for psql or by adding host to the PG.connect() call.

If you see this error:

  
  Permission denied
      Is the server running on that host 
      and accepting connections on that socket?

It probably means that PostgreSQL is only listening for local connections.

To fix that, you’ll need to change the listen_addresses parameter, that specifies which IP addresses Postgres will listen on for connections.

By default, it’s 'localhost' , so Postgres will only listen for local, loopback connections. To allow remote connections, you can set it to '*' to listen on all available IP interfaces.

Unfortunately, even after changing listen_addresses to '*' we still can’t connect.

Now what’s going on?

This time, the error message is talking about a missing pg_hba.conf entry.

Slide 43: Picture representing a bouncer with a list of who can and can't come through a door

The pg_hba.conf file (in the data directory) controls who can and can’t connect, where from, with which authentication method etc. The default pg_hba.conf only allows local connections, but it also contains examples and explanations about how to create the entries that you need.

To fix our issue and allow brighton_app_user to connect, here’s the line that I added to the pg_hba.conf file:

  
    # TYPE  DATABASE       USER               ADDRESS         METHOD
    host    brighton_ruby  brighton_app_user  192.168.1.3/32  scram-sha-256

But how did I know which values to use?

Slide 45: no pg_hba.conf entry for host „192.168.1.3„, user „brighton_app_user„, database „brighton_ruby„, no encryption

There were clues in the error message.

The options for connection type are local or host. I know this isn't a local connection, so I chose host. host can be with or without ssl or gssapi encryption. The messages says "no encryption" so I didn’t specify either of those.

Slide 47: database „brighton_ruby„ (DATABASE brighton_ruby)

The database is brighton_ruby.

Slide 48: user „brighton_app_user„ (USER brighton_app_user)

The user is brighton_app_user.

Slide 49: host „192.168.1.3„ (ADDRESS 192.168.1.3/32)

And I'm trying to connect from 192.168.1.3 so I specify that as the ADDRESS.

Finally, I chose scram-sha-256 as the authentication method, which is actually now the Postgres default.

You need to reload the parameters for the pg_hba.conf changes to take effect. You can use the pg_reload_conf() function to do that.

And now when we try to connect, we’re prompted for the password and it finally works.

Help! PostgreSQL Isn't Running

What if Postgres isn’t even running?

Slide 54: Error returned by psql command

If you get this error when you try to connect:

  
    No such file or directory
      Is the server running on that host 
      and accepting connections on that socket?

It suggests that Postgres isn’t even running.

Slide 56: Check the Status of the PostgreSQL Service

You’ll need to check the status of the PostgreSQL service:

either with pg_ctl status or
if Postgres was started with systemd, with systemctl status.
You might need to find the name of your PostgreSQL service in that case with systemctl list-units. Look for a name containing "postgres".

Slide 57: FATAL: the database system is shutting down

Before you try starting it back up, it’s probably a good idea to look in the postgres logs (that you already found earlier) to figure out why it stopped, and if you need to do something to fix it.

There’ll be a message:

  
    FATAL: the database system is shutting down

that tells you that Postgres stopped. Look at the messages just before that to find out why.

If there’s no problem (maybe someone just accidentally stopped Postgres), you can restart it with:

pg_ctl start or
systemctl start <servicename> or
brew services start <servicename>
(in the case of my local test cluster)

Help! I've Run Out of Space

What if your database has run out of space?

If you try to insert data or create a table and you see this type of message, telling you there’s no space left on the device:

  
  ERROR:  could not extend file "base/20429/2187":
    No space left on device

You get a hint telling you to check free disk space.

But which disk?

The file name is given in the message, and it’s relative to the data directory that we saw earlier (and that you’ve set as PGDATA so you don’t have to keep looking for it).

Slide 62: Make Some Space: Image showing PostgreSQL database inside a box that also contains files and folders

Once you’ve figured out which disk, you can try to make some space.

Slide 63: As previous slide with arrows showing box being extended

The safest thing to do to fix the immediate problem is to add some space.

If you can’t do that, or if you still need more space, maybe the disk that contains your data directory also contains files that don’t relate to Postgres that you can delete or move.

Slide 65: As previous slide but with fewer files and folders, and with temp objects in database

And if that’s still not enough, you might need to look inside the database to find things that you can delete.

Are there temporary tables you can get rid of? Do you have unused indices?

This option is a bit scary – definitely don’t delete anything unless you’re absolutely sure it’s not needed!

Slide 66: WAL Directory Full: Picture of red button marked PANIC

If it’s the disk containing the write ahead log (WAL) files that’s full, that’s more serious.

Postgres will do a PANIC shutdown to protect the database.

In that case, you don’t have a choice but to immediately make some space. In the short term, adding some space to the disk will buy you some time.

Slide 67: Text „I’ll just delete some WAL files to make some space„ partially covered by warning NEVER DELETE WAL FILES!

And if you’re thinking “it’s fine, I’ll just delete some of the older WAL files, I don’t need them any more.”

Please don’t do that! Postgres needs the WAL files. If you delete them, you risk not being able to restart your database and having to restore from a backup.

Slide 68: Image representing WAL files being copied to backup repository. Arrow mared archive_command is crossed out

Often, the problem is that the WAL archiving has stopped working.

WAL archiving is the process of sending a copy of each WAL file to your backup respository so they're available in case you need to do a point in time recovery. The archive_command parameter contains the command that’s used to perform this copy.

Postgres can’t delete WAL files that haven’t yet been archived, so if archiving is broken, the WAL files will continue to build up and take up more and more space in the WAL directory.

Again, if you look in the Postgres logs, you should be able to find what the problem is.

In the example, I deliberately used the badly formulated archive command faulty_archive_command to generate the error. Alternatively, it could be that the archive directory is full or doesn't have the correct permissions.

Once you’ve fixed the issue, the archiving should restart, the archived WALs will be automatically deleted from the WAL directory, and Postgres can be restarted.

Help! My Queries are Too Slow

There can be lots of different reasons for slow queries and usually there’s no error to let us know what’s going on.

We can’t look at all the different cases in the next few minutes, but we can learn to identify the cause and start to figure out the solution.

There are a few general things that can help.

The log_min_duration_statement parameter can help to identify queries that need to be optimised. It’s deactivated by default.

If you set it to 1 second, it will log every completed SQL statement that ran for more than 1 second. Obviously if all of your queries run for more than a second, that’s not going to help you much. Set it to whatever will just show you outliers for your application to avoid too much noise in your logs.

A useful tool for checking what’s going on in your database is the pg_stat_statements extension. If it's not already enabled, you need to:

add it to the shared_preload_libraries parameter and restart postgres.
Issue create extension pg_stat_statements in each database where you want to use it.

Slide 73: select userid, queryid, mean_exec_time, calls from pg_stat_statements;

and then you can query the pg_stat_statements view for information about long running or frequently executed queries.

To investigate, start by asking yourself some questions. For example:

Are all queries slow, or is it just a particular query?
Is it always slow, or just at certain times?
Has it recently started to be slow, or was it always slow?
Does if affect everyone, or just certain users, certain applications, certain locations etc.?

If all queries are impacted, you may need to look outside of Postgres to see what’s going on, potentially checking resources on the database server, looking for IO latency, network issues etc.

Slide 76: select state, count(*) from pg_stat_activity where state is not null group by state;

On the database side, look for clues in the logs; maybe there are messages there.

Check if there are too many connections, and particularly too many active connections. Postgres works best with a smallish number of concurrent connections, preferably in the low hundreds.

You can look in pg_stat_activity to get the number of sessions and their current state.

If there are too many sessions, look at :

Reconfiguring your application to create fewer sessions.
Making sure sessions are closed when they’re not needed.
Putting connection pooling in place

Slide 77: A Specific Query is Slow (picture of a snail)

If it’s just one query that’s slow, you’re going to need to tune that query.

For that, you’ll need to see the execution plan for the query, which tells you how the PostgreSQL optimiser is going to execute it.

It includes information such as:

the way the tables will be accessed; sequential scan, index scan etc.
The join methods that will be used if there’s more than one table involved.
An estimate of the number of rows in each step.
An estimate of the cost of execution.

To display the execution plan, take your slow running query and prefix it with explain or explain analyze.

The actual query here doesn’t matter, I’m just selecting some data from a single table and I’m grouping and sorting it. There are various different options that you can read about in the PostgreSQL documentation.

explain analyze will actually execute the query so it gives more information than just explain. Just be aware that if your query makes changes, and you don’t want to keep those changes, you will need to put it in a transaction so you can rollback afterwards.

Slide 81: sample explain plan output with arrow going from bottom right to top left

The explain returns something that looks like this, which can be confusing to start with.

You read an execution plan from the most indented to the least indented.

Slide 82: Seq Scan on drinking_fountains

Here, we see there’s a full table scan seq scan on our drinking fountains table.

Slide 83: HashAggregate Group Key: town, type

The results of the table scan are fed into the HashAggregate, which does our group by.

And those results are fed into the sort, which is the last step.

I know some people prefer a visual approach. I didn't have time to talk about this, but there are various tools to help with visualising and understanding the EXPLAIN output. These include:

There's also a PostgreSQL wiki page with information about how to use and understand EXPLAIN.

Slide 85: Statistics (analyze drinking_fountains;)

The PostgreSQL planner/optimizer needs up to date statistics to be able to come up with the right execution plans. If the plan doesn’t seem right, and especially if there have recently been lots of changes to the table data, maybe your stats need to be updated.

You can recalculate stats for a single table with the analyze <tablename> command. If you don’t say which table (or list of tables) it will analyze all of the tables in your database.

If your query is only slow at certain times, what else is going on at that time that could be impacting it?

Backup?
Analytics queries?
Maintenance tasks? (index rebuilds, stats collection, vacuum…)

If performance has changed recently, what’s changed?

Is there more data?
More connections than before?
A new application using lots of resources?

If it only affects certain users, certain applications or certain locations:

Who does it affect?
Where are they connecting from?
Which applications?

Maybe it’s a network issue, there's an issue with a specific application server, or resources have been limited for particular users.

Help! I've Lost Some Data

What if you’ve lost an object, or some data seems to have disappeared?

Luckily, often when you think an object has disappeared, it’s actually still there, you just need to know where to look for it.

Are you connected to the right database?
(a postgres clulster can be made up of multiple databases.)
Are you looking in the right schema?
(there might be several schemas, effectively namespaces, in a database.)
Do you have permissions on the object?
(maybe you’re looking in the right place, but you don’t have the right permissions.)

Connected to the brighton_ruby database as karen, if I try to select from a table that I created called drinking_fountains (from an open data set that shows where you can find drinking fountains in Paris)...

ERROR: relation „ drinking_fountains „ does not exist

...I get an error telling me that the relation (table) doesn’t exist.

But I know the table exists because it’s me that created it.

Slide 93: select count(*) from brighton_data.drinking_fountains;

What I’d forgotten is that I’d created it in a schema called brighton_data.

If I qualify the table name with the schema name...

...it finds the table and I get the results.

The reason it didn’t find it to start with is because the search_path parameter, which tells Postgres which schemas to look in, was left at its default value: "$user", public. This translates to a schema with the same name as the user that you’re connected with – currently karen, and the public schema.

Slide 96: set search_path=brighton_data;

Now, I’m going to log in as my application user brighton_app_user.

Let’s assume that all of the application tables are in the brighton_data schema. I can set search_path = brighton_data so Postgres always looks there.

Slide 97: select count(*) from drinking_fountains;

Now I don’t need to specify the schema name for the tables in my queries.

Slide 98: ERROR: relation „drinking_fountains„ does not exist

But now I’ve no idea what’s going on!

I know the table's there, I’ve set search path, and it’s still telling me the table doesn’t exist.

Slide 99: grant select on brighton_data.drinking_fountains to brighton_app_user;

This time it’s because I don’t have permissions on the drinking_fountains table. I need to ask brighton_app_owner nicely to grant the right to select from the table.

Once that's done, fortunately, brighton_app_user can access the table and get the results.

If the table really doesn’t exist, or the data has been deleted, it’s going to be more complicated.

You want to hope that you’ve already put in place backups of your database and that you’ve practiced doing restores.

pgBackRest is a reliable, open source Postgres backup and recovery tool. If you’re not already using it, go and install and configure it straight after this talk!

Be aware that a restore is for an entire Postgres cluster.

Slide 102: Recovery (pg_dump, pg_restore)

If you want to create an extract or dump of just certain objects from a cluster, there’s the pg_dump tool. And you can use pg_restore to import data from the extract into your database.

How can I practice?

How can you practice and make sure you reinforce the things you've learnt?

pg_kaboom is an extension that will crash your postgres cluster in "multiple varied and destructive ways".

So destructive that you have to fill in a disclaimer each time you use it!

  
  SET pg_kaboom.disclaimer = 'I can afford to lose this data and server';

In the example in the slide, I’ve chosen the option to fill the WAL directory.

Is it safe?

Absolutely not!

Do not, under any circumstances, use this in a production cluster.

In conclusion: in case of database problems, don’t panic!

Look in the logs (find the logs before you need them).
Read the error messages.
Look in the PostgreSQL documentation.
Enable pg_stat_statements before you have issues.
Ask a question in the PostgreSQL Slack
or on the PostgreSQL mailing lists.

I’ve grouped together some of the resources that I’ve used in the presentation, plus others that could be useful for practicing or for troubleshooting.

For log analysis: pgBadger
Database GUI: DBeaver, pgAdmin
Managed PostgreSQL service: Crunchy Bridge*
To generate database activity: pgbench, pgSimload
View database activity: pg_stat_statements
Practice troubleshooting: pg_kaboom
Understand execution plans: Depesz EXPLAIN, DBeaver, Using EXPLAIN

* Other managed PostgreSQL services are available!

The QR code on the final slide takes you to a page with links to recordings and pdf version of slides from talks I've given in 2024. I've also included some contact details so you can get in touch for more database chat:

Mastodon @karenhjex@mastodon.online
X/Twitter @karenhjex
LinkedIn https://www.linkedin.com/in/karenhjex/

Search This Blog

Karen Jex

Database Troubleshooting for Developers at Brighton Ruby

Talk Abstract and Summary

Introduction

Where to Find Information

Help! I Can't Connect

Help! PostgreSQL Isn't Running

Help! I've Run Out of Space

Help! My Queries are Too Slow

Help! I've Lost Some Data

How can I practice?

Thank you!