How to optimize PostgreSQL database size
PostgreSQL is a powerful, open source relational database system. It has more than 15 years of active development and a proven architecture that has earned it a strong reputation for reliability, data integrity, and correctness.
One of the PostgreSQL's most sophisticated features is so called Multi-Version Concurrency Control (MVCC), a standard technique for avoiding conflicts between reads and writes of the same object in database. MVCC guarantees that each transaction sees a consistent view of the database by reading non-current data for objects modified by concurrent transactions. Thanks to MVCC, PostgreSQL has great scalability, a robust hot backup tool and many other nice features comparable to the most advanced commercial databases.
Unfortunately, there is one downside to MVCC, the databases tend to grow over time and sometimes it can be a problem. In recent versions of PostgreSQL there is a separate server process called the autovacuum daemon (pg_autovacuum), whose purpose is to keep the database size reasonable. It does that by trying to recover reusable chunks of the database files. Still, there are many scenarios that will force the database to grow, even if the amount of the useful data in it doesn't really change. That happens typically if you have lots of UPDATE and/or DELETE statements in the applications that are using the database.
Mechanisms for recovering used space are well documented, and I decided to do a little test to see how much space can actually be recovered. Unfortunately, all of the procedures you'll find in this article demand that database is taken offline during their run. But, that's the price you'll have to pay to get your disk space back.
For this interesting test I have used a real database that was taken from a great commercial antispam/antivirus product. Original size of the database was 5876 MB. The database was running for few weeks before I decided to take its database for "academic" purposes. The tests were run on a quite old (dual PIII), but still capable machine (mostly thanks to a 6 SCSI 10krpm disk RAID6 array). The PostgreSQL database under test was a slightly older one, version 7.4.8, because that is what the product is shipping currently. The pg_autovacuum daemon was running all the time while the database was in production (keeping metadata information about lots of quarantined spam and/or virus email messages, you probably wanted to know :)).
Standard procedure for getting back the space is the VACUUM FULL command. While plain VACUUM (without FULL) simply reclaims space and makes it available for re-use, VACUUM FULL does more extensive processing, including compacting the database files to the minimum number of disk blocks. The documentation clearly says that this form is much slower and requires an exclusive lock on each table while it is being processed.
So, running it in the form of "vacuumdb -afvz" command, which covered the whole database and also generated some useful statistics, revealed that it indeed is a slow process. It took full 3 hours for the command to finish. The resulting database was occupying 4278 MB database (a 27% reduction in database size). During the run, vacuumdb was heavy on both CPU and disk.
Next recommended procedure is reindexing the whole database, because indexes can be even tougher when it comes to reusing the old blocks. I was scared that reindexing would took more time than vacuuming, but it finished in only 9 minutes. What was even more surprising is that reindexing the database (using reindexdb command) saved another 39% of the initial database size!
It looks like reindexdb is much faster and more efficient about getting the disk space back, at least on this test database. The final database size was 1990 MB, only 34% of what we started with. I hope I don't need to remind that the database still holds the same data, it's just physical disk usage that has dropped three times after slightly more than 3 hours of vacuuming and reindexing. Reindexing was mostly CPU bound, so it would be even faster on a modern machine with faster processors.
Take 2: dump and restore
Then I decided to do a completely different test. I reverted the test database to the initial state, full 5876 MB, and then proceeded to dump it, with the idea of clean restore after that, of course. That is a standard procedure when you're upgrading a PostgreSQL database from one major version to another (for example: 7.4 to 8.0). Dumping the database (with a standard pg_dump command) took only 3 and half minutes, which came as a pleasant surprise. After that I dropped the database (with dropdb) and recreated it (createdb) with the same encoding and owner and used the standard psql utility to import the data back which took another 16 minutes.
If you compare this times to the former ones, you'll see that this procedure took only around 20 minutes which is almost 10 times faster! But, best part is yet to come. The final database size was only 1685 MB which is less than what we achieved with the vacuumdb/reindexdb procedure. Once again, the useful data in the database is identical to the data we initially had.
If you have read all the details above closely, you now know that standard PostgreSQL recipe of running VACUUM FULL in regular intervals (typically overnight or on weekends) might not be the best choice. If reloading database (dump/restore procedure) does the job 10 faster and with a better final result then maybe that is what you should do, no matter how ugly it looks. But, of course, before you make final decision, you should run tests on your own database to see what's best for you.
Also it can be concluded that the PostgreSQL database could use some improvements in this specific area of disk space management. I know that disk space is quite cheap nowadays, but I'm also pretty sure that oversized database is not only taking more space, but is also somewhat slower because it has to keep track about more data on disk (think disk caches, RAM...).
In the end, it would be interesting to hear what have other people observed on various database and hardware setups out there.