EA Repository Disaster Recovery: Backup, Restore and Business Continuity Planning for Sparx EA

An enterprise architecture repository may contain years of governance decisions, approved designs, programme traceability artefacts, and compliance evidence. Losing it is not a technology problem — it is a business problem. EA repository disaster recovery requires the same planning discipline as any critical business system: regular verified backups, tested restore procedures, defined RTO and RPO targets, and a documented business continuity plan for when components fail. This article covers backup options for each database platform, why testing the restore matters as much as taking the backup, what happens when the PCS layer fails, and how to think about tiered impact when EA GraphLink is in the stack.

Key Takeaways

EA repository backup is critical — years of governance decisions, compliance evidence, and approved designs are at risk
Backup options vary by database: SQL Server Agent, pg_dump, mysqldump — each with cloud backup extensions
Active repositories should be backed up at minimum daily; programme-critical repositories should run continuous or near-continuous backup
A backup that has not been tested for restore is not a reliable backup
PCS failure affects architect access but has a read-only fallback; EA GraphLink failure affects AI/BI access but not core EA access — these are tiered, not equivalent
Platform Support provides proactive backup monitoring, incident response, and business continuity management

Why EA Repository Backup Is Different

Most enterprise systems have mature backup disciplines. The EA repository is frequently left out.

The reasons are partly organisational: the EA repository is often viewed as “the architects’ tool” rather than a business-critical system. It is not typically included in enterprise backup policy during initial deployment because it is small (at first), not customer-facing, and not immediately obvious as a compliance or regulatory concern.

The reasons this view is wrong become clear when you consider what accumulates in a mature EA repository:

Architecture decisions. Why was application A chosen over application B? What architectural constraints govern the programme? What trade-offs were accepted in the technology strategy? These decisions live in the repository as tagged values, element notes, decision records, and relationship structures. Losing them means losing the institutional knowledge of why the current architecture exists.

Programme traceability. Architecture artefacts produced for programme gates — approved designs, compliance mappings, technology standards assessments — may be referenced in programme documentation that exists outside the repository. If the repository is lost, those references become unresolvable.

Compliance evidence. In regulated industries, the architecture repository may contain evidence that specific designs were reviewed, approved, and compliant with regulatory requirements at specific points in time. This evidence cannot be reconstructed after the fact.

Years of invested effort. A mature EA practice represents thousands of hours of modelling, governance, review, and curation. This is not recoverable without the data.

Backup Options by Database Platform

SQL Server

SQL Server Agent. SQL Server Agent provides scheduled backup job capability. A full database backup job, scheduled daily, creates a backup file that can restore the complete repository. Transaction log backups (when the database is in full recovery model) provide point-in-time recovery capability — allowing restore to any point within the transaction log retention window, not just the last full backup.

Azure Backup. For SQL Server hosted on Azure (Azure SQL Database or SQL Server on an Azure VM), Azure Backup provides automated, policy-driven backup with configurable retention. Azure SQL Database includes automated backup as a built-in service feature — full backups weekly, differential backups every 12 hours, transaction log backups every 5–12 minutes.

Recommended approach. Daily full backups via SQL Server Agent (or Azure Backup automated policy), with transaction log backups every hour for active repositories. Retain daily backups for 30 days minimum; retain weekly backups for 12 weeks.

PostgreSQL

pgdump. pgdump creates a logical backup of the PostgreSQL database as a SQL script or archive file. It can be run as a scheduled task (cron job) and produces a portable backup that can be restored to any compatible PostgreSQL instance. pg_dump is the standard backup approach for self-hosted PostgreSQL.

pgbasebackup. For continuous archive configurations, pgbasebackup captures a filesystem-level snapshot of the PostgreSQL data directory. Combined with WAL (Write-Ahead Logging) archiving, this enables point-in-time recovery to any point within the WAL archive retention window.

Managed cloud PostgreSQL backup. Amazon RDS for PostgreSQL, Azure Database for PostgreSQL, and Google Cloud SQL provide automated backup management as a service feature. RDS for PostgreSQL provides automated daily backups with point-in-time recovery up to the configured backup retention period (up to 35 days on RDS).

Recommended approach. Daily pg_dump backups for self-hosted instances, with WAL archiving for point-in-time recovery. Managed cloud PostgreSQL instances should have backup retention configured to a minimum of 14 days.

MySQL

mysqldump. mysqldump creates a logical backup of the MySQL database as a SQL script. Like pg_dump, it can be scheduled and produces a portable backup file. For small to medium EA repositories on MySQL, mysqldump on a daily schedule is adequate.

MySQL binary log. Binary logging enables point-in-time recovery from mysqldump backups. The binary log records all transactions, allowing restore to any point between the base backup and the current time.

Managed cloud MySQL backup. Amazon RDS for MySQL and Azure Database for MySQL provide automated backup with configurable retention, similar to their PostgreSQL equivalents.

EAP File Backup (Small or Solo Repositories)

For repositories still using the EAP file format (the local single-file repository), backup is simple: copy the EAP file to a backup location. EAP files should be backed up to a location outside the machine they reside on — at minimum a network share, ideally cloud storage. For EAP files in active use, daily copy-to-backup is appropriate.

Backup Frequency: What “Active” Means

Backup frequency should match the rate at which the repository changes — specifically, the acceptable data loss window if a failure requires restore from the most recent backup.

Lightly used repositories (one to two architects, occasional additions): Daily backup is adequate. RPO is up to 24 hours of data — acceptable for repositories where daily progress is modest.

Actively used repositories (five or more architects, daily modelling activity): Daily backup as a minimum, with transaction log or WAL-based point-in-time recovery to reduce RPO to hours.

Programme-critical repositories (repository is being actively used for programme delivery artefacts in a critical phase): Continuous or near-continuous backup with point-in-time recovery. In a programme gate review period, losing even four hours of repository work may be unacceptable. Transaction log/WAL archiving with frequent checkpoints is appropriate.

Recovery Testing: The Step That Is Always Skipped

A backup that has not been tested for restore is not a backup you can rely on. This is not a theoretical concern — backup files can be corrupt, incomplete, or written to storage that is inaccessible when a recovery is actually needed.

What recovery testing means: take a recent backup file, restore it to a test database instance (staging or a dedicated recovery test environment), connect Sparx EA to the restored database, and confirm that the repository content is complete, accessible, and functional.

How frequently. For active repositories, a monthly restore test is appropriate practice. For repositories with automated cloud backup, quarterly restore tests validate that the automated backup process is producing recoverable output.

What to check in a restore test: element count and package structure match the production repository, diagrams render correctly, MDG profiles are intact, users can connect via PCS, reports generate correctly. A restore test that only confirms the database came up is not thorough enough.

RTO and RPO Targets for the EA Repository

RTO (Recovery Time Objective) is how long the repository can be offline before the impact is unacceptable. RPO (Recovery Point Objective) is how much data loss is acceptable — measured in time since the last recoverable state.

Repository Classification	Suggested RTO	Suggested RPO
Low-use repository (few architects, non-critical period)	24–48 hours	24 hours
Active practice repository	4–8 hours	4–8 hours
Programme-critical repository (active delivery period)	2–4 hours	1–2 hours
High-stakes compliance/regulated environment	1–2 hours	< 1 hour

These targets should drive backup frequency, infrastructure configuration (cloud vs on-premise, single vs replicated), and monitoring investment.

Business Continuity: Tiered Impact Scenarios

Different failure modes in the Sparx EA stack produce different impact profiles. Understanding the tiered nature of these impacts is important for prioritising continuity investment.

PCS Server Failure

When the Pro Cloud Server instance becomes unavailable, Sparx EA clients cannot connect to the shared repository. All multi-user access is offline. Architects who have recently worked on content may have locally cached versions of recent diagrams, but the shared repository is not accessible for read or write.

Read-only fallback. If the repository database is still accessible, some organisations configure a direct database read connection as an emergency read-only access path. This is not a standard supported configuration for ongoing use — it bypasses PCS security controls — but it can allow critical content retrieval during a short PCS outage.

Recovery. PCS service restart is typically fast (minutes). If the PCS server itself has failed (hardware, OS), recovery requires starting a replacement PCS server and reconnecting it to the database. RTO for PCS failure is primarily determined by how quickly a new PCS instance can be brought online — a pre-configured standby instance reduces this significantly.

Database Server Failure

If the repository database becomes unavailable, the entire EA repository is inaccessible — PCS cannot broker connections to a database that does not respond. This is the more serious failure scenario.

Recovery. Database recovery from backup is the primary path. RTO depends on backup restore time (which depends on backup size and infrastructure), and RPO depends on backup frequency. For managed cloud databases (Azure SQL, RDS), failover to a standby replica can reduce RTO significantly.

EA GraphLink Layer Failure

If EA GraphLink is unavailable, AI-assisted querying, BI connectivity, and MCP-based AI agent access to the repository are offline. This is an additive layer failure — it does not affect core Sparx EA access.

Impact. Architects can continue working in Sparx EA. AI-assisted analysis and BI dashboard refresh are unavailable. Stakeholders relying on Kernaro Assist queries or EA GraphLink-powered dashboards are affected; architects working directly in the EA client are not.

Recovery. EA GraphLink service restart typically resolves the majority of availability issues. Configuration issues (API key expiry, network path changes) require investigation and may take longer.

The key point for business continuity planning: EA GraphLink failure is not an EA failure. Communicate the distinction to stakeholders so that an AI tool outage does not trigger a repository incident response.

FAQ

How much storage do EA repository backups require? Backup storage requirements depend on repository size and retention policy. A typical active EA repository database might range from a few hundred megabytes to several gigabytes. With daily backups retained for 30 days and weekly backups retained for 12 weeks, storage requirements are modest — likely 50–200 GB for most practices. Cloud storage costs for this volume are low (tens of dollars per month on AWS S3, Azure Blob, or Google Cloud Storage).

Can Sparx EA be restored to a different database server than the original? Yes. The repository database can be restored to a different server instance of the same database type (SQL Server to SQL Server, PostgreSQL to PostgreSQL). After the restore, PCS must be reconfigured to point to the new database server address. For disaster recovery scenarios where the original server is unavailable, this is the standard recovery path.

What is the impact on users if we take the repository offline for a backup? For database-level backups (SQL Server Agent, pg_dump, mysqldump), the backup can typically run without taking the repository offline. These tools back up the database while it is in use. The exception is very large repositories where backup I/O impacts database performance — in these cases, scheduling backups during low-activity periods (overnight) minimises impact on architects.

Should the EA repository backup be included in our enterprise backup policy? Yes, unambiguously. The EA repository should be treated as a business-critical system in your enterprise backup policy, equivalent to other systems that hold institutional knowledge, programme records, and compliance evidence. If it is not currently in scope, work with your IT infrastructure or platform team to include it.

What happens if we lose the PCS configuration during a server failure? PCS configuration — ODBC connections, authentication settings, repository definitions, port and security settings — should be documented and backed up separately from the repository database. PCS stores its configuration in files on the PCS server. These files should be included in server backup policy. Losing PCS configuration without a backup means manually reconstructing all settings — time-consuming and error-prone. Sparx Services Platform Support documents and maintains PCS configuration as part of the service.

Does Sparx Services Platform Support include backup monitoring and incident response? Yes. Platform Support includes proactive monitoring of backup job success and failure, periodic restore testing to validate backup viability, incident response for repository access failures, and PCS configuration management. For organisations that do not have the internal capacity to maintain these practices, Platform Support provides the operational discipline that keeps a high-value repository protected.

Is Your EA Repository as Protected as the Architecture Value It Contains?

Sparx Services Platform Support covers proactive backup monitoring, restore testing, incident response, and business continuity management for Sparx EA — so the repository your practice depends on is protected with the same rigour as any other business-critical system.

Talk to Sparx Services about Platform Support →

EA Repository Disaster Recovery: Backup, Restore and Business Continuity Planning for Sparx EA