Programming

Soft deletion explained in PostgreSQL - and how to implement it

Markus

Oct 8, 2024 — 7 min read

In the world of database management, much of your data is most likely never actually deleted. The concept of deletion has evolved over time.

Enter soft deletion - a well-known technique that's changed how we think about data lifecycle management. This post explores the ins and outs of soft deletion, its benefits, and how to implement it effectively.

What is Soft Deletion?

Imagine you're managing a bustling digital library. Instead of permanently removing books from your shelves (hard deletion), soft deletion is like moving them to a special reserved section. They're out of the main circulation, but still accessible if needed.

In technical terms, soft deletion involves marking records as deleted in the database, rather than permanently removing them. This is typically achieved by adding a boolean flag or better, a timestamp to indicate when a record was "deleted."

The Advantages of Going Soft (on Deletion)

Data Recovery: Accidentally deleted important information? With soft deletion, recovery is just a database query away.
Audit Trails: Maintain a complete history of data changes, crucial for compliance and forensic analysis.
Performance Optimization: Avoid the potential performance hit of cascading deletes in large, complex datasets.
Trend Analysis: Analyze historical trends including "deleted" data, providing a more complete picture of your data over time.
Enhanced User Experience: Allow users to recover their own deleted data, adding a layer of forgiveness to your application.

How Soft Deletion Works

Deletion Flag: Add a boolean column (e.g., is_deleted) or a timestamp column (e.g., deleted_at) to your tables.
Modified Queries: Instead of using DELETE statements, update the deletion flag to mark records as deleted.
Filtered Results: Adjust SELECT queries to exclude "deleted" records by default.
Recovery Mechanism: Implement a process to "undelete" by resetting the deletion flag.

When dealing with related tables, it's important to ensure that soft deletes cascade properly. There are two main approaches:

Database Triggers: Create triggers that automatically propagate the soft delete to related records.
Application Logic: Handle cascading deletes in your application code, offering more flexibility but requiring more development effort.

For some tables, cascading may not even be required - but that again, depends on your application. If you find that your data is fine without cascades, then that's fine.

Implementing Soft Deletion in PostgreSQL

Let's look at a practical example of implementing soft deletion in PostgreSQL:

Create tables with a soft delete column:

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100),
    deleted_at TIMESTAMP
);

CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    title VARCHAR(200),
    content TEXT,
    deleted_at TIMESTAMP
);

Use soft deletion in your queries:

-- Soft delete a user
UPDATE users SET deleted_at = NOW() WHERE id = 1;

-- Select only non-deleted users
SELECT * FROM users WHERE deleted_at IS NULL;

-- Restore a soft-deleted user
UPDATE users SET deleted_at = NULL WHERE id = 1;

For most applications, these changed to your tables is probably enough. You can of course make it more complex and make related data automatically cascade (soft delete) as well.

Below you can see how that can be done, using triggers and functions!

Create a function for soft deletion:

CREATE OR REPLACE FUNCTION soft_delete()
RETURNS TRIGGER AS $$
BEGIN
    NEW.deleted_at = NOW();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

This little piece of code is the heart of our soft deletion system. When you create it, you're essentially telling PostgreSQL, "Hey, I've got a special way of handling deletes." Instead of actually removing a row, this function simply marks it as deleted by setting its deleted_at field to the current time. It's like putting a timestamp on a piece of paper to show when it was "thrown away", but you're really just putting it in a special drawer instead of the trash can.

Set up triggers for soft deletion:

CREATE TRIGGER soft_delete_user
BEFORE UPDATE ON users
FOR EACH ROW
WHEN (NEW.deleted_at IS NOT NULL AND OLD.deleted_at IS NULL)
EXECUTE FUNCTION soft_delete();

CREATE TRIGGER soft_delete_post
BEFORE UPDATE ON posts
FOR EACH ROW
WHEN (NEW.deleted_at IS NOT NULL AND OLD.deleted_at IS NULL)
EXECUTE FUNCTION soft_delete();

Now, onto the soft delete triggers. These are like vigilant gatekeepers for your tables. You set them up on your users and posts tables, and they stand guard, watching for any attempts to update the deleted_at field. When they see an update that's trying to change deleted_at from null to a timestamp, they spring into action. They call our soft delete function, which does the actual work of setting the timestamp. It's like having a librarian who, instead of letting you remove a book from the library, simply marks it as "checked out indefinitely" whenever you try to take it.

Implement cascading soft deletes:

CREATE OR REPLACE FUNCTION cascade_soft_delete_user()
RETURNS TRIGGER AS $$
BEGIN
    UPDATE posts SET deleted_at = NEW.deleted_at
    WHERE user_id = NEW.id AND deleted_at IS NULL;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER cascade_soft_delete_user
AFTER UPDATE ON users
FOR EACH ROW
WHEN (NEW.deleted_at IS NOT NULL AND OLD.deleted_at IS NULL)
EXECUTE FUNCTION cascade_soft_delete_user();

Finally, we have the cascading soft delete function and trigger. This is where things get a bit more sophisticated. Imagine you have a user who has written several posts. If you delete the user, you probably want to delete their posts too, ...right? Actually.. keep it if it's a public forum please!

That's what this function does. When a user is soft deleted, this function goes through all the posts associated with that user and marks them as deleted too. The trigger is what kicks off this process. It's like telling the librarian, "Hey, when you mark this author as inactive, also mark all their books as unavailable."

Restoring data, one query at a time

Let's say you messed up and soft deleted a whole table. Or maybe you want to restore some rows using a point in time. Fortunately, your rows have the timestamps already marked!

You can use your SQL knowledge to make sufficient selects and restore the deleted rows.

-- Assume we want to restore users deleted on July 1, 2023 between 2:00 PM and 2:05 PM

-- First, let's see which users were deleted in this time range
SELECT id, name, email, deleted_at
FROM users
WHERE deleted_at BETWEEN '2023-07-01 14:00:00' AND '2023-07-01 14:05:00';

-- Now, let's restore these users by setting their deleted_at to NULL
UPDATE users
SET deleted_at = NULL
WHERE deleted_at BETWEEN '2023-07-01 14:00:00' AND '2023-07-01 14:05:00';

-- Verify the restoration
SELECT id, name, email, deleted_at
FROM users
WHERE deleted_at IS NULL
  AND id IN (
    SELECT id
    FROM users
    WHERE deleted_at BETWEEN '2023-07-01 14:00:00' AND '2023-07-01 14:05:00'
  );

-- If you need to restore associated data (like posts), you might do:
UPDATE posts
SET deleted_at = NULL
WHERE user_id IN (
    SELECT id
    FROM users
    WHERE deleted_at BETWEEN '2023-07-01 14:00:00' AND '2023-07-01 14:05:00'
);

Best Practices for Effective Soft Deletion

Indexing: Add an index to the deleted_at column to maintain query performance.

CREATE INDEX idx_users_deleted_at ON users (deleted_at);
CREATE INDEX idx_posts_deleted_at ON posts (deleted_at);

Indexing: Add an index to the `deleted_at` column

Partial Indexes: Utilize partial indexes to exclude soft-deleted records from your indexes.

-- Partial Indexes: Exclude soft-deleted records from indexes
CREATE INDEX idx_active_users_email ON users (email) WHERE deleted_at IS NULL;
CREATE INDEX idx_active_posts_title ON posts (title) WHERE deleted_at IS NULL;

Partial Indexes: Exclude soft-deleted records from indexes

Periodic Hard Deletes: Implement a process to permanently delete old soft-deleted records to manage database size. For example, to delete rows that have been deleted years ago.

-- Periodic Hard Deletes: Create a function to permanently delete old soft-deleted records
CREATE OR REPLACE FUNCTION permanently_delete_old_records()
RETURNS void AS $$
BEGIN
    -- Delete users soft-deleted more than 1 year ago
    DELETE FROM users WHERE deleted_at < NOW() - INTERVAL '1 year';
    
    -- Delete posts soft-deleted more than 6 months ago
    DELETE FROM posts WHERE deleted_at < NOW() - INTERVAL '6 months';
END;
$$ LANGUAGE plpgsql;

Schedule this function to run periodically (this would be done outside of PostgreSQL, e.g., using cron)

Unique Constraints: Be mindful of unique constraints when using soft deletes, especially if you plan to reuse unique values.

CREATE UNIQUE INDEX idx_active_users_email_unique ON users (email) WHERE deleted_at IS NULL;

Unique Constraints: Use partial unique indexes for active records

API Design: Design your API to handle soft deletes transparently, maintaining a consistent experience for consumers of your data.

CREATE VIEW active_users AS SELECT * FROM users WHERE deleted_at IS NULL;
CREATE VIEW active_posts AS SELECT * FROM posts WHERE deleted_at IS NULL;

-- Example API query using the view
SELECT * FROM active_users WHERE id = 1;

-- Example of restoring a soft-deleted record
UPDATE users SET deleted_at = NULL WHERE id = 1;

API Design: Create views for active records

Conclusion

Soft deletion is a powerful technique that can significantly enhance your data management strategy. It provides flexibility in data recovery, improves auditing capabilities, and can boost performance in large, interconnected datasets. While it does add some complexity to your database design and queries, the benefits often outweigh the costs for many applications.

Hope you learned something new today. Thanks, and please do share! ;)