Design Decisions#

ATLAS-Q’s design reflects deliberate choices balancing performance, usability, extensibility, and numerical stability. This document explains the key architectural decisions, their rationale, and alternative approaches considered.

Overview #

Design Philosophy #

ATLAS-Q follows these core principles:

Performance by default: GPU acceleration and optimized operations without configuration
Fail safely: Automatic numerical stability measures prevent silent errors
Extensibility: Modular architecture allows adding new algorithms and backends
User control: Override defaults when needed for advanced use cases
Research-friendly: Comprehensive diagnostics and statistics for debugging

Key challenges addressed:

Numerical instability: SVD failures, truncation errors, ill-conditioning
Memory constraints: Large bond dimensions exceed GPU memory
Performance bottlenecks: SVD dominates runtime for χ>128
Usability: Quantum computing experts may not be system programmers
Heterogeneous hardware: CPUs, NVIDIA GPUs, future AMD/Intel GPUs

Alternative Approaches Considered #

Option 1: NumPy-based implementation

Pros: Simple, pure Python, widely understood
Cons: Poor GPU support, manual batching, slow for production
Decision: Rejected due to GPU performance critical for >30 qubits

Option 2: Custom CUDA kernels

Pros: Maximum performance control
Cons: High development cost, maintenance burden, NVIDIA-only
Decision: Hybrid approach - PyTorch default, Triton for hot paths

Option 3: JAX-based

Pros: Excellent for automatic differentiation
Cons: Less mature cuBLAS/cuSOLVER integration than PyTorch
Decision: PyTorch chosen for better GPU libraries, more stable

Core Architecture #

MPS Representation #

Decision: Store MPS as list of PyTorch tensors, not unified tensor

class AdaptiveMPS:
    def __init__(self, num_qubits, bond_dim, device='cuda'):
        self.tensors = [
            torch.zeros(chi_left, d, chi_right, dtype=torch.complex64, device=device)
            for i in range(num_qubits)
        ]

Rationale:

Heterogeneous bond dimensions: Each bond can have different χ
Memory efficiency: Only allocate needed dimensions
Parallelization: Operations on different sites can be independent
Compatibility: Standard format used by other MPS libraries

Alternative considered: Unified tensor with padding

Pros: Single memory allocation, potential BLAS optimization
Cons: Wasted memory for variable χ, complex indexing
Rejected: Memory waste significant for adaptive bond dimensions

Adaptive vs Fixed Bond Dimension #

Decision: Adaptive bond dimension as default

AdaptiveMPS (default):

mps = AdaptiveMPS(
    num_qubits=50,
    bond_dim=64,           # Initial χ
    chi_max_per_bond=256,  # Maximum allowed
    adaptive_mode=True     # Enable adaptation
)

Automatically adjusts χ based on truncation error.

MatrixProductStatePyTorch (fixed):

from atlas_q.mps_pytorch import MatrixProductStatePyTorch
mps = MatrixProductStatePyTorch(num_qubits=50, bond_dim=64)

Keeps χ=64 throughout simulation.

Rationale:

Adaptive: Better accuracy/performance tradeoff for most use cases
Fixed: Predictable memory usage, easier to reason about
Both: Different algorithms have different needs (QAOA→fixed, TDVP→adaptive)

Statistics and Diagnostics #

Decision: Track statistics by default with minimal overhead

mps = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cuda')

# Apply gates
for i in range(100):
    mps.apply_cnot(i % 29, (i % 29) + 1)

# Automatic statistics
stats = mps.statistics
print(f"Truncations: {stats.num_truncations}")
print(f"Global error: {stats.total_truncation_error:.2e}")
print(f"Max bond dimension: {max(s.shape[1] for s in mps.tensors)}")

Rationale:

Observability: Users should know simulation accuracy
Research: Essential for debugging and understanding algorithms
Performance: <1% overhead using lightweight counters
Optional: Can disable via track_statistics=False if needed

Alternative: No default tracking, require explicit requests

Pros: Absolute minimum overhead
Cons: Users unaware of errors, poor research experience
Rejected: Observability too important for quantum simulation

API Design #

Device Agnostic Interface #

Decision: Single API works on CPU and GPU

# Same code works on both devices
mps_gpu = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cuda')
mps_cpu = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cpu')

# Automatic fallback if CUDA unavailable
mps = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cuda:0')
# Falls back to CPU if no GPU

Rationale:

Portability: Same code runs on laptops (CPU) and servers (GPU)
Testing: CI/CD can test on CPU without GPU runners
Development: Prototyping on CPU before scaling to GPU
Hybrid: Some operations better on CPU (e.g., small χ)

Lazy vs Eager Imports #

Decision: Support both for compatibility

Direct imports (recommended):

from atlas_q.adaptive_mps import AdaptiveMPS
from atlas_q.vqe_qaoa import VQE
from atlas_q.tdvp import TDVP

Lazy imports (legacy):

from atlas_q import get_adaptive_mps, get_vqe

AdaptiveMPS = get_adaptive_mps()['AdaptiveMPS']
VQE = get_vqe()['VQE']

Rationale:

Direct imports fail if dependencies missing (e.g., PySCF not installed)
Lazy imports allow partial functionality (core MPS without chemistry)
Both supported for backwards compatibility with existing code

Modern approach: Optional dependencies properly declared in setup.py

pip install atlas-q              # Core only
pip install atlas-q[chemistry]   # With PySCF
pip install atlas-q[all]         # Everything

Automatic vs Manual Canonicalization #

Decision: Automatic canonicalization before sensitive operations

# TDVP automatically canonicalizes
tdvp = TDVP(mps=mps, hamiltonian=H, dt=0.05)
tdvp.evolve_step()  # Automatic canonicalization ensures stability

# Manual override if needed
mps.canonicalize(center=15)  # Explicit control

Rationale:

Correctness: TDVP requires canonical form for numerical stability
User-friendly: Don’t require quantum algorithm knowledge
Performance: Only when needed (not after every gate)
Control: Advanced users can manage manually

Alternative: Always manual, never automatic

Pros: Explicit control, no hidden operations
Cons: Easy to forget, leads to numerical errors
Rejected: Too error-prone for non-expert users

Backend Selection #

PyTorch as Foundation #

Decision: PyTorch for tensor operations and GPU support

Why PyTorch over alternatives:

Feature	PyTorch	NumPy	JAX
GPU Support	Excellent	Poor (CuPy)	Good
cuBLAS	Native	Via CuPy	Via XLA
SVD	cuSOLVER	Slow	Good
Ecosystem	Large	Largest	Growing
Maturity	Very mature	Most mature	Maturing
Triton	Compatible	No	Compatible

Rationale:

cuBLAS/cuSOLVER: PyTorch has best integration
Ecosystem: Largest deep learning ecosystem, many tools
GPU memory: Excellent allocator, memory pooling
JIT: TorchScript for optimization (not yet used, future)
Community: Large user base, extensive documentation

JAX consideration:

JAX was seriously considered due to automatic differentiation for gradients. However:

PyTorch autodiff sufficient for parameter-shift rule
PyTorch’s cuSOLVER integration more mature (critical for SVD)
Could add JAX backend in future if needed (modular design allows this)

Triton for Custom Kernels #

Decision: Triton for performance-critical kernels, PyTorch fallback

# Automatically uses Triton if available and beneficial (χ>64)
from atlas_q.triton_kernels.mps_complex import fused_two_qubit_gate

# Falls back to PyTorch if:
# - Triton not installed
# - χ too small (overhead dominates)
# - Operation not implemented in Triton

Rationale:

Performance: 1.5-3× speedup for χ>64 via fusion
Maintainability: Python syntax easier than CUDA C++
Portability: Triton compiles for different GPUs
Optional: Not required, provides bonus performance

CUDA C++ consideration:

Could achieve 10-20% more performance, but:

Development time 5-10× longer
Maintenance burden much higher
NVIDIA-only (Triton compiles for AMD ROCm potentially)

cuQuantum Integration #

Decision: Optional cuQuantum backend for >2× SVD speedup

from atlas_q.cuquantum_backend import CuQuantumConfig, CuQuantumBackend

# Optional: Use cuQuantum for better performance
config = CuQuantumConfig(svd_algorithm='gesvdj')
backend = CuQuantumBackend(config, device='cuda')

mps = AdaptiveMPS(num_qubits=40, bond_dim=256, backend=backend)

Rationale:

Performance: 2-3× faster SVD for χ>128 (dominant bottleneck)
Optional: Works without cuQuantum (PyTorch fallback)
NVIDIA-specific: Leverages cutting-edge GPU tensor network research
Future-proof: Can add other backends (AMD, Intel) similarly

Alternative: Require cuQuantum

Pros: Always maximum performance
Cons: NVIDIA-only, complicates installation
Rejected: Portability and ease-of-install important

Numerical Stability #

Robust Linear Algebra #

Decision: Automatic fallbacks for ill-conditioned operations

from atlas_q.linalg_robust import robust_svd

# Automatic fallback sequence:
# 1. Try torch.linalg.svd on GPU
# 2. Add Tikhonov regularization if ill-conditioned
# 3. Fall back to CPU if GPU fails
# 4. Try different LAPACK drivers

U, S, Vh = robust_svd(tensor, threshold=1e-14)

Rationale:

Reliability: SVD failures crash simulations
Automatic: Users shouldn’t debug linear algebra
Performance: Only use fallbacks when necessary
Transparency: Log warnings when using fallbacks

Alternative: Let SVD fail, user handles

Pros: Explicit, no hidden behavior
Cons: Most users can’t debug cuSOLVER failures
Rejected: Too difficult for typical users

Mixed Precision Strategy #

Decision: complex64 default, complex128 for stability

# Default: complex64 (faster, sufficient for most)
mps = AdaptiveMPS(num_qubits=50, bond_dim=64, dtype=torch.complex64)

# High precision when needed
mps_precise = AdaptiveMPS(num_qubits=30, bond_dim=128, dtype=torch.complex128)

# Automatic promotion policy
from atlas_q.adaptive_mps import DTypePolicy
policy = DTypePolicy(
    default=torch.complex64,
    promote_if_cond_gt=1e6  # Auto-upgrade on ill-conditioning
)

Rationale:

Performance: complex64 is 2× faster and uses half memory
Sufficient: Most simulations don’t need complex128 precision
Safety: Auto-promotion prevents silent errors
Control: Users can force complex128 if needed

Canonical Form Management #

Decision: Maintain canonical form lazily, enforce before TDVP

# Gates don't canonicalize (fast)
for i in range(100):
    mps.apply_cnot(i % 29, (i % 29) + 1)

# Algorithms canonicalize when needed
tdvp = TDVP(mps=mps, hamiltonian=H, dt=0.05)  # Canonicalizes here

# Manual if needed
mps.canonicalize(center=15)

Rationale:

Performance: Canonicalization is \(O(n \chi^3)\) - expensive
Not always needed: Gate application doesn’t require it
Automatic for algorithms: TDVP, measurements require it
User control: Can force when needed

Memory Management #

Adaptive Memory Budgets #

Decision: Global memory budgets with automatic χ reduction

mps = AdaptiveMPS(
    num_qubits=100,
    bond_dim=64,
    budget_global_mb=10 * 1024,  # 10 GB limit
    adaptive_mode=True
)

# Automatically reduces χ if budget exceeded
# Tracks truncation error from budget constraints

Rationale:

Prevent OOM: GPU out-of-memory crashes entire process
Graceful degradation: Reduce χ rather than crash
Transparency: Track error from budget reduction
Research: Can disable for maximum accuracy

Alternative: Crash on OOM

Pros: Explicit failure
Cons: Lose entire computation
Rejected: Graceful degradation better for long-running simulations

PyTorch Allocator Strategy #

Decision: Use PyTorch’s caching allocator, clear cache strategically

# PyTorch caches freed memory (fast)
# Manually clear when switching tasks
import torch
torch.cuda.empty_cache()  # Return memory to CUDA

Rationale:

Performance: Allocation ~100× faster with cache
Fragmentation: Large simulations fragment memory
Control: User can clear when needed
Automatic: PyTorch manages cache size

In-Place Operations #

Decision: Use in-place operations for large tensors when safe

# In-place (good for memory)
mps.tensors[i].mul_(factor)

# Out-of-place (creates copy)
mps.tensors[i] = mps.tensors[i] * factor

Rationale:

Memory: Avoid temporary copies for large tensors
Safety: Only when semantics allow (no dependencies)
Performance: Reduces memory traffic

Modularity and Extension #

Backend Interface #

Decision: Abstract backend interface for different implementations

class TensorBackend:
    def svd(self, tensor): ...
    def qr(self, tensor): ...
    def contract(self, a, b): ...

# Implementations
class PyTorchBackend(TensorBackend): ...
class CuQuantumBackend(TensorBackend): ...
# Future: JAXBackend, NumPyBackend

Rationale:

Extensibility: Add new backends (AMD, Intel) without touching core
Testing: Mock backend for unit tests
Comparison: Benchmark different backends easily
Future-proof: Quantum hardware may have custom backends

Algorithm Plugin Architecture #

Decision: Algorithms are separate classes, not methods on MPS

# Good: Algorithm as separate class
from atlas_q.tdvp import TDVP
tdvp = TDVP(mps=mps, hamiltonian=H, dt=0.05)
tdvp.evolve_step()

# Bad: Algorithm as MPS method
# mps.tdvp_evolve(hamiltonian=H, dt=0.05)  # Not scalable

Rationale:

Separation of concerns: MPS is data structure, algorithms use it
Extensibility: Add algorithms without modifying MPS class
Testing: Test algorithms independently
State: Algorithms maintain their own state (environment tensors, etc.)

Alternative: Algorithms as MPS methods

Pros: Simpler API, fewer imports
Cons: MPS class becomes huge, hard to extend
Rejected: Poor separation of concerns

Flexible Gate Interface #

Decision: Support multiple gate specification formats

# Matrix format
gate = np.array([[1, 0], [0, -1]])
mps.apply_one_site_gate(gate, site=5)

# String format (convenience)
mps.apply_h(5)     # Hadamard
mps.apply_cnot(5, 6)  # CNOT

# Batch format
mps.apply_batch_single_gates('H', sites=[0, 2, 4, 6])

Rationale:

Convenience: String format for common gates
Flexibility: Matrix for custom gates
Performance: Batch for many gates
Compatibility: Different use cases prefer different formats

Error Handling #

Fail-Fast Philosophy #

Decision: Detect errors early, fail with clear messages

mps = AdaptiveMPS(num_qubits=50, bond_dim=64)

# Fail fast on invalid input
try:
    mps.apply_cnot(49, 50)  # Out of bounds
except ValueError as e:
    print(e)  # "qubit_b=50 out of range [0, 49]"

Rationale:

Debugging: Catch errors at source, not later
Clear messages: Include values and valid ranges
Type safety: Validate inputs before expensive operations

Warnings for Numerical Issues #

Decision: Warn on numerical instability, don’t crash

import warnings

# Warn on high condition number
if condition_number > 1e6:
    warnings.warn(f"High condition number {cond:.2e}, consider complex128")

# Warn on large truncation error
if global_error > user_threshold:
    warnings.warn(f"Global error {global_error:.2e} exceeds threshold")

Rationale:

Don’t crash: Simulations can continue
Inform user: They can decide if error acceptable
Actionable: Message suggests fix (use complex128, increase χ)

Graceful Degradation #

Decision: Fall back to slower but correct operations

# Try fast path
try:
    U, S, Vh = torch.linalg.svd(tensor)
except RuntimeError:
    # Fall back to CPU
    tensor_cpu = tensor.cpu()
    U, S, Vh = torch.linalg.svd(tensor_cpu)
    U, S, Vh = U.cuda(), S.cuda(), Vh.cuda()
    warnings.warn("SVD failed on GPU, fell back to CPU")

Rationale:

Reliability: Simulation completes even if GPU SVD fails
Transparency: User informed of performance degradation
Research: Don’t lose hours of computation to rare GPU glitch

Future-Proofing #

Extensibility Points #

Designed for extension:

Backends: Add JAX, NumPy, vendor-specific backends
Algorithms: Plugin new algorithms (MERA, PEPS evolution)
Gates: Custom gates via matrix interface
Noise models: Extensible noise model framework
Hamiltonians: Custom MPO construction

Version Compatibility #

Decision: Maintain backwards compatibility for 1.x releases

# Legacy APIs kept for compatibility
from atlas_q import get_adaptive_mps  # Old style
from atlas_q.adaptive_mps import AdaptiveMPS  # New style
# Both work

Rationale:

User trust: Breaking changes hurt adoption
Deprecation path: Warnings for 2 releases before removal
Documentation: Clear migration guides
Semantic versioning: Follow semver strictly

Configuration System #

Decision: Dataclasses for configuration, not dicts

from dataclasses import dataclass

@dataclass
class VQEConfig:
    max_iterations: int = 500
    optimizer: str = 'L-BFGS-B'
    tol: float = 1e-6
    bond_dim: int = 64

# Type-checked, autocomplete in IDEs
config = VQEConfig(max_iterations=1000, bond_dim=128)