Design Decisions#
ATLAS-Q’s design reflects deliberate choices balancing performance, usability, extensibility, and numerical stability. This document explains the key architectural decisions, their rationale, and alternative approaches considered.
Overview#
Design Philosophy#
ATLAS-Q follows these core principles:
Performance by default: GPU acceleration and optimized operations without configuration
Fail safely: Automatic numerical stability measures prevent silent errors
Extensibility: Modular architecture allows adding new algorithms and backends
User control: Override defaults when needed for advanced use cases
Research-friendly: Comprehensive diagnostics and statistics for debugging
Key challenges addressed:
Numerical instability: SVD failures, truncation errors, ill-conditioning
Memory constraints: Large bond dimensions exceed GPU memory
Performance bottlenecks: SVD dominates runtime for χ>128
Usability: Quantum computing experts may not be system programmers
Heterogeneous hardware: CPUs, NVIDIA GPUs, future AMD/Intel GPUs
Alternative Approaches Considered#
Option 1: NumPy-based implementation
Pros: Simple, pure Python, widely understood
Cons: Poor GPU support, manual batching, slow for production
Decision: Rejected due to GPU performance critical for >30 qubits
Option 2: Custom CUDA kernels
Pros: Maximum performance control
Cons: High development cost, maintenance burden, NVIDIA-only
Decision: Hybrid approach - PyTorch default, Triton for hot paths
Option 3: JAX-based
Pros: Excellent for automatic differentiation
Cons: Less mature cuBLAS/cuSOLVER integration than PyTorch
Decision: PyTorch chosen for better GPU libraries, more stable
Core Architecture#
MPS Representation#
Decision: Store MPS as list of PyTorch tensors, not unified tensor
class AdaptiveMPS:
def __init__(self, num_qubits, bond_dim, device='cuda'):
self.tensors = [
torch.zeros(chi_left, d, chi_right, dtype=torch.complex64, device=device)
for i in range(num_qubits)
]
Rationale:
Heterogeneous bond dimensions: Each bond can have different χ
Memory efficiency: Only allocate needed dimensions
Parallelization: Operations on different sites can be independent
Compatibility: Standard format used by other MPS libraries
Alternative considered: Unified tensor with padding
Pros: Single memory allocation, potential BLAS optimization
Cons: Wasted memory for variable χ, complex indexing
Rejected: Memory waste significant for adaptive bond dimensions
Adaptive vs Fixed Bond Dimension#
Decision: Adaptive bond dimension as default
AdaptiveMPS (default):
mps = AdaptiveMPS(
num_qubits=50,
bond_dim=64, # Initial χ
chi_max_per_bond=256, # Maximum allowed
adaptive_mode=True # Enable adaptation
)
Automatically adjusts χ based on truncation error.
MatrixProductStatePyTorch (fixed):
from atlas_q.mps_pytorch import MatrixProductStatePyTorch
mps = MatrixProductStatePyTorch(num_qubits=50, bond_dim=64)
Keeps χ=64 throughout simulation.
Rationale:
Adaptive: Better accuracy/performance tradeoff for most use cases
Fixed: Predictable memory usage, easier to reason about
Both: Different algorithms have different needs (QAOA→fixed, TDVP→adaptive)
Statistics and Diagnostics#
Decision: Track statistics by default with minimal overhead
mps = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cuda')
# Apply gates
for i in range(100):
mps.apply_cnot(i % 29, (i % 29) + 1)
# Automatic statistics
stats = mps.statistics
print(f"Truncations: {stats.num_truncations}")
print(f"Global error: {stats.total_truncation_error:.2e}")
print(f"Max bond dimension: {max(s.shape[1] for s in mps.tensors)}")
Rationale:
Observability: Users should know simulation accuracy
Research: Essential for debugging and understanding algorithms
Performance: <1% overhead using lightweight counters
Optional: Can disable via
track_statistics=Falseif needed
Alternative: No default tracking, require explicit requests
Pros: Absolute minimum overhead
Cons: Users unaware of errors, poor research experience
Rejected: Observability too important for quantum simulation
API Design#
Device Agnostic Interface#
Decision: Single API works on CPU and GPU
# Same code works on both devices
mps_gpu = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cuda')
mps_cpu = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cpu')
# Automatic fallback if CUDA unavailable
mps = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cuda:0')
# Falls back to CPU if no GPU
Rationale:
Portability: Same code runs on laptops (CPU) and servers (GPU)
Testing: CI/CD can test on CPU without GPU runners
Development: Prototyping on CPU before scaling to GPU
Hybrid: Some operations better on CPU (e.g., small χ)
Lazy vs Eager Imports#
Decision: Support both for compatibility
Direct imports (recommended):
from atlas_q.adaptive_mps import AdaptiveMPS
from atlas_q.vqe_qaoa import VQE
from atlas_q.tdvp import TDVP
Lazy imports (legacy):
from atlas_q import get_adaptive_mps, get_vqe
AdaptiveMPS = get_adaptive_mps()['AdaptiveMPS']
VQE = get_vqe()['VQE']
Rationale:
Direct imports fail if dependencies missing (e.g., PySCF not installed)
Lazy imports allow partial functionality (core MPS without chemistry)
Both supported for backwards compatibility with existing code
Modern approach: Optional dependencies properly declared in setup.py
pip install atlas-q # Core only
pip install atlas-q[chemistry] # With PySCF
pip install atlas-q[all] # Everything
Automatic vs Manual Canonicalization#
Decision: Automatic canonicalization before sensitive operations
# TDVP automatically canonicalizes
tdvp = TDVP(mps=mps, hamiltonian=H, dt=0.05)
tdvp.evolve_step() # Automatic canonicalization ensures stability
# Manual override if needed
mps.canonicalize(center=15) # Explicit control
Rationale:
Correctness: TDVP requires canonical form for numerical stability
User-friendly: Don’t require quantum algorithm knowledge
Performance: Only when needed (not after every gate)
Control: Advanced users can manage manually
Alternative: Always manual, never automatic
Pros: Explicit control, no hidden operations
Cons: Easy to forget, leads to numerical errors
Rejected: Too error-prone for non-expert users
Backend Selection#
PyTorch as Foundation#
Decision: PyTorch for tensor operations and GPU support
Why PyTorch over alternatives:
Feature |
PyTorch |
NumPy |
JAX |
|---|---|---|---|
GPU Support |
Excellent |
Poor (CuPy) |
Good |
cuBLAS |
Native |
Via CuPy |
Via XLA |
SVD |
cuSOLVER |
Slow |
Good |
Ecosystem |
Large |
Largest |
Growing |
Maturity |
Very mature |
Most mature |
Maturing |
Triton |
Compatible |
No |
Compatible |
Rationale:
cuBLAS/cuSOLVER: PyTorch has best integration
Ecosystem: Largest deep learning ecosystem, many tools
GPU memory: Excellent allocator, memory pooling
JIT: TorchScript for optimization (not yet used, future)
Community: Large user base, extensive documentation
JAX consideration:
JAX was seriously considered due to automatic differentiation for gradients. However:
PyTorch autodiff sufficient for parameter-shift rule
PyTorch’s cuSOLVER integration more mature (critical for SVD)
Could add JAX backend in future if needed (modular design allows this)
Triton for Custom Kernels#
Decision: Triton for performance-critical kernels, PyTorch fallback
# Automatically uses Triton if available and beneficial (χ>64)
from atlas_q.triton_kernels.mps_complex import fused_two_qubit_gate
# Falls back to PyTorch if:
# - Triton not installed
# - χ too small (overhead dominates)
# - Operation not implemented in Triton
Rationale:
Performance: 1.5-3× speedup for χ>64 via fusion
Maintainability: Python syntax easier than CUDA C++
Portability: Triton compiles for different GPUs
Optional: Not required, provides bonus performance
CUDA C++ consideration:
Could achieve 10-20% more performance, but:
Development time 5-10× longer
Maintenance burden much higher
NVIDIA-only (Triton compiles for AMD ROCm potentially)
cuQuantum Integration#
Decision: Optional cuQuantum backend for >2× SVD speedup
from atlas_q.cuquantum_backend import CuQuantumConfig, CuQuantumBackend
# Optional: Use cuQuantum for better performance
config = CuQuantumConfig(svd_algorithm='gesvdj')
backend = CuQuantumBackend(config, device='cuda')
mps = AdaptiveMPS(num_qubits=40, bond_dim=256, backend=backend)
Rationale:
Performance: 2-3× faster SVD for χ>128 (dominant bottleneck)
Optional: Works without cuQuantum (PyTorch fallback)
NVIDIA-specific: Leverages cutting-edge GPU tensor network research
Future-proof: Can add other backends (AMD, Intel) similarly
Alternative: Require cuQuantum
Pros: Always maximum performance
Cons: NVIDIA-only, complicates installation
Rejected: Portability and ease-of-install important
Numerical Stability#
Robust Linear Algebra#
Decision: Automatic fallbacks for ill-conditioned operations
from atlas_q.linalg_robust import robust_svd
# Automatic fallback sequence:
# 1. Try torch.linalg.svd on GPU
# 2. Add Tikhonov regularization if ill-conditioned
# 3. Fall back to CPU if GPU fails
# 4. Try different LAPACK drivers
U, S, Vh = robust_svd(tensor, threshold=1e-14)
Rationale:
Reliability: SVD failures crash simulations
Automatic: Users shouldn’t debug linear algebra
Performance: Only use fallbacks when necessary
Transparency: Log warnings when using fallbacks
Alternative: Let SVD fail, user handles
Pros: Explicit, no hidden behavior
Cons: Most users can’t debug cuSOLVER failures
Rejected: Too difficult for typical users
Mixed Precision Strategy#
Decision: complex64 default, complex128 for stability
# Default: complex64 (faster, sufficient for most)
mps = AdaptiveMPS(num_qubits=50, bond_dim=64, dtype=torch.complex64)
# High precision when needed
mps_precise = AdaptiveMPS(num_qubits=30, bond_dim=128, dtype=torch.complex128)
# Automatic promotion policy
from atlas_q.adaptive_mps import DTypePolicy
policy = DTypePolicy(
default=torch.complex64,
promote_if_cond_gt=1e6 # Auto-upgrade on ill-conditioning
)
Rationale:
Performance: complex64 is 2× faster and uses half memory
Sufficient: Most simulations don’t need complex128 precision
Safety: Auto-promotion prevents silent errors
Control: Users can force complex128 if needed
Canonical Form Management#
Decision: Maintain canonical form lazily, enforce before TDVP
# Gates don't canonicalize (fast)
for i in range(100):
mps.apply_cnot(i % 29, (i % 29) + 1)
# Algorithms canonicalize when needed
tdvp = TDVP(mps=mps, hamiltonian=H, dt=0.05) # Canonicalizes here
# Manual if needed
mps.canonicalize(center=15)
Rationale:
Performance: Canonicalization is \(O(n \chi^3)\) - expensive
Not always needed: Gate application doesn’t require it
Automatic for algorithms: TDVP, measurements require it
User control: Can force when needed
Memory Management#
Adaptive Memory Budgets#
Decision: Global memory budgets with automatic χ reduction
mps = AdaptiveMPS(
num_qubits=100,
bond_dim=64,
budget_global_mb=10 * 1024, # 10 GB limit
adaptive_mode=True
)
# Automatically reduces χ if budget exceeded
# Tracks truncation error from budget constraints
Rationale:
Prevent OOM: GPU out-of-memory crashes entire process
Graceful degradation: Reduce χ rather than crash
Transparency: Track error from budget reduction
Research: Can disable for maximum accuracy
Alternative: Crash on OOM
Pros: Explicit failure
Cons: Lose entire computation
Rejected: Graceful degradation better for long-running simulations
PyTorch Allocator Strategy#
Decision: Use PyTorch’s caching allocator, clear cache strategically
# PyTorch caches freed memory (fast)
# Manually clear when switching tasks
import torch
torch.cuda.empty_cache() # Return memory to CUDA
Rationale:
Performance: Allocation ~100× faster with cache
Fragmentation: Large simulations fragment memory
Control: User can clear when needed
Automatic: PyTorch manages cache size
In-Place Operations#
Decision: Use in-place operations for large tensors when safe
# In-place (good for memory)
mps.tensors[i].mul_(factor)
# Out-of-place (creates copy)
mps.tensors[i] = mps.tensors[i] * factor
Rationale:
Memory: Avoid temporary copies for large tensors
Safety: Only when semantics allow (no dependencies)
Performance: Reduces memory traffic
Modularity and Extension#
Backend Interface#
Decision: Abstract backend interface for different implementations
class TensorBackend:
def svd(self, tensor): ...
def qr(self, tensor): ...
def contract(self, a, b): ...
# Implementations
class PyTorchBackend(TensorBackend): ...
class CuQuantumBackend(TensorBackend): ...
# Future: JAXBackend, NumPyBackend
Rationale:
Extensibility: Add new backends (AMD, Intel) without touching core
Testing: Mock backend for unit tests
Comparison: Benchmark different backends easily
Future-proof: Quantum hardware may have custom backends
Algorithm Plugin Architecture#
Decision: Algorithms are separate classes, not methods on MPS
# Good: Algorithm as separate class
from atlas_q.tdvp import TDVP
tdvp = TDVP(mps=mps, hamiltonian=H, dt=0.05)
tdvp.evolve_step()
# Bad: Algorithm as MPS method
# mps.tdvp_evolve(hamiltonian=H, dt=0.05) # Not scalable
Rationale:
Separation of concerns: MPS is data structure, algorithms use it
Extensibility: Add algorithms without modifying MPS class
Testing: Test algorithms independently
State: Algorithms maintain their own state (environment tensors, etc.)
Alternative: Algorithms as MPS methods
Pros: Simpler API, fewer imports
Cons: MPS class becomes huge, hard to extend
Rejected: Poor separation of concerns
Flexible Gate Interface#
Decision: Support multiple gate specification formats
# Matrix format
gate = np.array([[1, 0], [0, -1]])
mps.apply_one_site_gate(gate, site=5)
# String format (convenience)
mps.apply_h(5) # Hadamard
mps.apply_cnot(5, 6) # CNOT
# Batch format
mps.apply_batch_single_gates('H', sites=[0, 2, 4, 6])
Rationale:
Convenience: String format for common gates
Flexibility: Matrix for custom gates
Performance: Batch for many gates
Compatibility: Different use cases prefer different formats
Error Handling#
Fail-Fast Philosophy#
Decision: Detect errors early, fail with clear messages
mps = AdaptiveMPS(num_qubits=50, bond_dim=64)
# Fail fast on invalid input
try:
mps.apply_cnot(49, 50) # Out of bounds
except ValueError as e:
print(e) # "qubit_b=50 out of range [0, 49]"
Rationale:
Debugging: Catch errors at source, not later
Clear messages: Include values and valid ranges
Type safety: Validate inputs before expensive operations
Warnings for Numerical Issues#
Decision: Warn on numerical instability, don’t crash
import warnings
# Warn on high condition number
if condition_number > 1e6:
warnings.warn(f"High condition number {cond:.2e}, consider complex128")
# Warn on large truncation error
if global_error > user_threshold:
warnings.warn(f"Global error {global_error:.2e} exceeds threshold")
Rationale:
Don’t crash: Simulations can continue
Inform user: They can decide if error acceptable
Actionable: Message suggests fix (use complex128, increase χ)
Graceful Degradation#
Decision: Fall back to slower but correct operations
# Try fast path
try:
U, S, Vh = torch.linalg.svd(tensor)
except RuntimeError:
# Fall back to CPU
tensor_cpu = tensor.cpu()
U, S, Vh = torch.linalg.svd(tensor_cpu)
U, S, Vh = U.cuda(), S.cuda(), Vh.cuda()
warnings.warn("SVD failed on GPU, fell back to CPU")
Rationale:
Reliability: Simulation completes even if GPU SVD fails
Transparency: User informed of performance degradation
Research: Don’t lose hours of computation to rare GPU glitch
Future-Proofing#
Extensibility Points#
Designed for extension:
Backends: Add JAX, NumPy, vendor-specific backends
Algorithms: Plugin new algorithms (MERA, PEPS evolution)
Gates: Custom gates via matrix interface
Noise models: Extensible noise model framework
Hamiltonians: Custom MPO construction
Version Compatibility#
Decision: Maintain backwards compatibility for 1.x releases
# Legacy APIs kept for compatibility
from atlas_q import get_adaptive_mps # Old style
from atlas_q.adaptive_mps import AdaptiveMPS # New style
# Both work
Rationale:
User trust: Breaking changes hurt adoption
Deprecation path: Warnings for 2 releases before removal
Documentation: Clear migration guides
Semantic versioning: Follow semver strictly
Configuration System#
Decision: Dataclasses for configuration, not dicts
from dataclasses import dataclass
@dataclass
class VQEConfig:
max_iterations: int = 500
optimizer: str = 'L-BFGS-B'
tol: float = 1e-6
bond_dim: int = 64
# Type-checked, autocomplete in IDEs
config = VQEConfig(max_iterations=1000, bond_dim=128)
Rationale:
Type safety: Catch typos at development time
Documentation: Self-documenting via type annotations
IDE support: Autocomplete, type checking
Validation: Can add validators to dataclass
Lessons Learned#
What Worked Well#
PyTorch choice: GPU support and ecosystem exceeded expectations
Adaptive bond dimension: Default for 90% of use cases
Statistics tracking: Essential for research, <1% overhead
Modular backends: Made cuQuantum integration trivial
Device-agnostic API: Smooth development→production transition
What We’d Change#
Earlier cuQuantum integration: Should have been day-one priority
Dataclasses from start: Would have caught more bugs early
More aggressive in-place: Could save 10-20% memory
Benchmark suite: Should have built comprehensive benchmarks earlier
JAX exploration: Should pilot JAX backend for comparison
Trade-Offs We Accept#
PyTorch dependency: Heavy, but worth it for GPU support
Statistics overhead: <1% performance for essential observability
Automatic canonicalization: Hidden operation, but prevents errors
Memory for speed: Cache allocations rather than minimize memory
Complexity for performance: Triton kernels add complexity for 2-3× speedup
Summary#
ATLAS-Q’s design prioritizes:
Core Decisions:
PyTorch foundation: Best GPU support and ecosystem
Adaptive by default: Better accuracy/performance balance
Fail safely: Automatic stability measures, graceful degradation
Modular architecture: Clean separation, extensible backends
Observable: Comprehensive statistics and diagnostics
Key Innovations:
Adaptive bond dimensions with memory budgets
Robust linear algebra with automatic fallbacks
Mixed precision with automatic promotion
Modular backend system for future extensibility
Comprehensive statistics with minimal overhead
Trade-Offs:
Heavy PyTorch dependency for GPU support
Some hidden operations (canonicalization) for stability
Memory overhead for performance (caching, statistics)
Design Principles:
Performance by default: Fast without configuration
Fail safely: Prevent silent numerical errors
User control: Override defaults when needed
Extensibility: Add new backends and algorithms easily
Research-friendly: Observe simulation internals
These decisions enable ATLAS-Q to:
Simulate 100+ qubits on single GPU
Achieve 10-100× CPU speedup
Maintain numerical stability automatically
Support research with comprehensive diagnostics
Extend to new hardware and algorithms
For implementation details of specific design decisions, see:
GPU Acceleration - GPU backend implementation
Numerical Stability - Stability mechanisms
Performance Model - Performance implications of design choices
Comparisons - How ATLAS-Q compares to alternative designs