Performance Model#
Understanding the performance characteristics of MPS-based quantum simulation is essential for planning large-scale computations and optimizing resource usage. This document provides a comprehensive performance model for ATLAS-Q, including computational complexity, memory scaling, bottleneck analysis, and optimization strategies.
Overview#
Why Performance Modeling Matters#
MPS simulations trade exponential statevector memory (\(O(2^n)\)) for polynomial time complexity in bond dimension \(\chi\). Understanding this tradeoff is critical for:
Resource planning: Estimate memory and time requirements before running
Algorithm selection: Choose between TDVP, VQE, QAOA based on scaling
Optimization: Identify bottlenecks and apply targeted improvements
Feasibility assessment: Determine if simulation is practical on available hardware
Key Performance Factors:
Number of qubits (n): Linear impact on memory and time
Bond dimension (χ): Cubic impact on time, quadratic on memory
Circuit depth: Determines total operations
Entanglement structure: Affects required χ for accuracy
Hardware: GPU vs CPU, memory bandwidth, compute capability
Performance Characteristics#
MPS advantages:
Linear scaling in n: Can simulate 100+ qubits vs 30-40 for statevector
Controllable accuracy: Trade χ for precision
Efficient for 1D systems: Area-law entanglement needs only moderate χ
GPU acceleration: 10-100× speedup over CPU
MPS limitations:
Cubic scaling in χ: Large χ (>256) becomes expensive
Poor for highly entangled states: χ grows exponentially with entanglement
2D systems challenging: Volume-law scaling unless PEPS used
SVD bottleneck: Dominates runtime for χ > 128
Computational Complexity#
Single-Qubit Gates#
Applying single-qubit gate to site i:
where \(A^{[i]}\) has shape \((\chi_L, d, \chi_R)\).
Complexity: \(O(\chi_L \chi_R d^2) = O(\chi^2 d^2)\)
For qubits \(d=2\):
Time: \(O(\chi^2)\) Memory: \(O(\chi^2)\) for storing result
Example:
# Single-qubit gate on 50-qubit MPS with χ=64
# Operations: 64² × 2² = 16,384 complex multiplications
# GPU time: ~0.01 ms
mps.apply_h(25) # Hadamard on qubit 25
Bottleneck: Memory bandwidth (loading/storing tensors dominates)
Two-Qubit Gates#
Applying two-qubit gate to sites i and i+1 requires:
Contract: Merge \(A^{[i]}\) and \(A^{[i+1]}\) → \(\Theta\) of shape \((\chi_L, d, d, \chi_R)\)
Complexity: \(O(\chi^2 d^2)\)
Apply gate: \(\Theta' = G \cdot \Theta\)
Complexity: \(O(\chi^2 d^4)\)
SVD: Decompose \(\Theta'\) (\(\chi d \times \chi d\) matrix)
Complexity: \(O((\chi d)^3) = O(\chi^3 d^3)\)
Truncate: Keep largest \(\chi\) singular values
Complexity: \(O(\chi^2)\)
Total: \(O(\chi^3 d^3)\) dominated by SVD
For qubits:
Time: \(O(\chi^3 \times 8) = O(8\chi^3)\) Memory: \(O(\chi^2 d^2) = O(4\chi^2)\) for \(\Theta\)
Example:
# Two-qubit gate with χ=64
# SVD operations: (64×2)³ = 2,097,152
# GPU time (A100): ~0.15 ms
mps.apply_cnot(10, 11)
Bottleneck: SVD computation (60-80% of time)
Expectation Value Computation#
Computing \(\langle \psi | \hat{O} | \psi \rangle\) for operator \(\hat{O}\) represented as MPO:
Contract MPS with MPO: \(O(n \chi^2 \chi_{MPO} d^2)\)
Contract result with MPS: \(O(n \chi^2)\)
For local operator (single-site measurement):
Complexity: \(O(\chi^2 d^2)\)
For Hamiltonian (nearest-neighbor MPO with \(\chi_{MPO} \approx 4-8\)):
Complexity: \(O(n \chi^3 d^2)\)
Example:
# Energy of 30-qubit Heisenberg chain, χ=64
# Operations: 30 × 64³ × 4 = 31,457,280
# GPU time: ~2 ms
from atlas_q.mpo_ops import build_heisenberg_hamiltonian
H_mpo = build_heisenberg_hamiltonian(30, J=[1.0, 1.0, 1.0], h=0.5)
energy = mps.expectation_value(H_mpo)
Memory Scaling#
MPS Storage#
An n-qubit MPS with bond dimension χ and local dimension d stores:
For uniform χ and complex128 (16 bytes per complex number):
For qubits (d=2) and complex64 (8 bytes):
Examples:
Qubits n |
Bond χ |
Memory (GB) |
Comparison |
|---|---|---|---|
50 |
32 |
0.08 |
Tiny |
50 |
64 |
0.31 |
Small |
100 |
64 |
0.62 |
Moderate |
100 |
128 |
2.46 |
Significant |
200 |
64 |
1.23 |
Large |
200 |
128 |
4.92 |
Very large |
Statevector comparison: 50-qubit statevector = \(2^{50} \times 16\) bytes = 18 PB (impossible)
Working Memory#
Operations require temporary storage:
Single-qubit gate: - Input tensor: \(\chi^2 d \times 16\) bytes - Output tensor: \(\chi^2 d \times 16\) bytes - Total: \(2 \chi^2 d \times 16\) bytes
Two-qubit gate: - Merged tensor \(\Theta\): \(\chi^2 d^2 \times 16\) bytes - SVD workspace: \(3(\chi d)^2 \times 16\) bytes (U, S, Vh) - Total: \(\approx 4 \chi^2 d^2 \times 16\) bytes
For χ=256, d=2:
Two-qubit gate working memory: \(4 \times 256^2 \times 4 \times 16\) = 64 MB
GPU memory budget (16 GB A100):
MPS storage (200 qubits, χ=128): ~5 GB
Working memory (operations): ~1 GB
PyTorch allocator overhead: ~2 GB
Available: ~8 GB for multiple MPS / batching
MPO Storage#
Matrix Product Operator for nearest-neighbor Hamiltonian:
For Heisenberg model (\(\chi_{MPO} = 5\)):
Memory: \(n \times 25 \times 4 \times 16 = 1600n\) bytes ≈ 1.6 KB per qubit
MPO memory is usually negligible compared to MPS.
Time Complexity#
TDVP Time Evolution#
1-Site TDVP:
Per time step, sweep through n sites:
Site evolution: \(O(\chi^3 d^2)\) for effective Hamiltonian application
Bond evolution: \(O(\chi^3)\) for backward sweep
Total per site: \(O(\chi^3 d^2)\)
Per time step: \(O(n \chi^3 d^2)\)
For Hamiltonian with \(\chi_{MPO} = k\):
Refined complexity: \(O(n \chi^3 k d^2)\)
2-Site TDVP:
Includes SVD at each bond:
Per time step: \(O(n \chi^3 d^3)\)
Extra factor of d from two-site evolution and SVD of \((\chi d) \times (\chi d)\) matrix.
Comparison:
For 50 qubits, χ=64, 100 time steps:
1-site: \(100 \times 50 \times 64^3 \times 4 \approx 5.2 \times 10^9\) operations
2-site: \(100 \times 50 \times 64^3 \times 8 \approx 10.5 \times 10^9\) operations (2× slower)
VQE Optimization#
Energy evaluation:
Circuit application + Hamiltonian expectation:
where T = circuit depth.
Gradient computation (parameter-shift rule):
For p parameters:
Full VQE:
For N iterations:
Example:
50 qubits, χ=64, depth T=50, p=100 parameters, N=200 iterations:
On GPU: ~100-500 seconds (depends on hardware, batch size)
QAOA Performance#
QAOA with p layers evaluating cost function:
Per evaluation:
Apply 2p operators (problem + mixer): \(O(2p \cdot n \chi^3 d^2)\)
Hamiltonian expectation: \(O(n \chi^3 d^2)\)
Total per evaluation: \(O((2p+1) n \chi^3 d^2)\)
For optimization with N evaluations:
Total: \(O(N p n \chi^3 d^2)\)
QAOA characteristics:
Usually p=1-5 (shallow)
χ typically small (16-32) due to low entanglement
Fast convergence (N=50-200)
Example:
20 qubits, p=3, χ=32, N=100 iterations:
GPU time: ~5-10 seconds
Scaling Laws#
MPS vs Statevector#
Memory:
Compression: \(\frac{2^n}{n \chi^2} = \frac{2^{50}}{50 \times 64^2} \approx 5.5 \times 10^{12} \times\)
Time (single two-qubit gate):
Method |
Time |
50 Qubits |
|---|---|---|
Statevector |
\(O(2^n)\) |
Impossible |
MPS |
\(O(\chi^3)\) |
~0.15 ms (GPU) |
Entanglement Scaling#
Required bond dimension depends on entanglement:
Area Law (1D nearest-neighbor):
where S is entanglement entropy.
For area law: \(S = O(1)\) → \(\chi = O(1)\) (constant)
Volume Law (all-to-all coupling, random circuits):
Requires \(\chi = O(2^{n/2})\) (exponential) → MPS inefficient
Practical examples:
Heisenberg chain: χ=16-32 sufficient for high accuracy
Quantum Fourier Transform: χ ~ O(√n) due to logarithmic entanglement
Random circuit (depth d): χ ~ O(2^{min(d, n/2)})
Asymptotic Behavior#
Best case (product state):
χ=1 sufficient
All operations \(O(n)\) (linear)
Typical case (1D area law):
χ=32-128 sufficient
Operations \(O(n \chi^3) \approx O(n)\) (quasi-linear)
Worst case (volume law):
χ ~ \(2^{n/2}\) required
Operations \(O(n \times 2^{3n/2})\) (exponential)
MPS provides no advantage
Performance Predictions#
Gate Throughput#
GPU (A100), complex64:
Operation |
Bond χ |
Time (ms) |
Throughput |
|---|---|---|---|
Single-qubit gate |
32 |
0.005 |
200K ops/sec |
Single-qubit gate |
64 |
0.008 |
125K ops/sec |
Single-qubit gate |
128 |
0.020 |
50K ops/sec |
Two-qubit gate |
32 |
0.05 |
20K ops/sec |
Two-qubit gate |
64 |
0.13 |
7.7K ops/sec |
Two-qubit gate |
128 |
0.50 |
2K ops/sec |
Two-qubit gate |
256 |
2.0 |
500 ops/sec |
Observations:
Single-qubit gates ~10-20× faster than two-qubit
Time scales as \(\chi^3\) for two-qubit gates (SVD dominated)
cuQuantum provides 2-3× additional speedup for χ>128
Algorithm Runtime Estimates#
VQE (hardware-efficient ansatz):
Setup: 10 qubits, 3 layers (30 parameters), χ=32
Circuit depth: ~30 gates
Energy evaluation: ~5 ms
Gradient (parameter-shift): 30 params × 2 evaluations = 60 × 5 ms = 300 ms
Optimization: 100 iterations × 300 ms = 30 seconds
TDVP (Heisenberg chain):
Setup: 50 qubits, χ=64, dt=0.05
Time step (1-site): 50 sites × 0.5 ms = 25 ms
Evolution: 1000 steps × 25 ms = 25 seconds
QAOA (MaxCut):
Setup: 20 qubits, p=3, χ=32
Cost evaluation: ~10 ms
Optimization: 150 iterations × 10 ms = 1.5 seconds
Memory Requirements#
Practical limits (16 GB GPU):
Qubits |
Bond χ |
MPS Memory |
Max Qubits |
|---|---|---|---|
Any |
32 |
0.08 GB / 50q |
~1000 |
Any |
64 |
0.31 GB / 50q |
~400 |
Any |
128 |
1.23 GB / 50q |
~100 |
Any |
256 |
4.92 GB / 50q |
~25 |
Actual limits lower due to:
Working memory for operations (~20% overhead)
PyTorch allocator fragmentation (~30% overhead)
Multiple MPS instances or batching
Rule of thumb: Allocate 2× MPS storage for total memory budget.
Bottleneck Analysis#
Identifying Bottlenecks#
Use profiling to identify where time is spent:
import torch
from atlas_q.adaptive_mps import AdaptiveMPS
mps = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cuda')
# Profile gate applications
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
record_shapes=True
) as prof:
for i in range(100):
mps.apply_cnot(i % 29, (i % 29) + 1)
# Print breakdown
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
Common bottlenecks:
SVD (40-70% of time for χ>64)
Memory bandwidth (20-30% for large tensors)
Kernel launch overhead (5-10% for many small operations)
Truncation logic (5-10% adaptive bond dimension selection)
SVD Bottleneck#
Problem: SVD dominates for χ>128
Solutions:
Use cuQuantum: gesvdj algorithm 2-3× faster
from atlas_q.cuquantum_backend import CuQuantumConfig, CuQuantumBackend config = CuQuantumConfig( use_cutensornet=True, svd_algorithm='gesvdj' # Jacobi SVD ) backend = CuQuantumBackend(config, device='cuda') mps = AdaptiveMPS( num_qubits=40, bond_dim=256, backend=backend, device='cuda' )
Reduce truncation frequency: Compress every 10 gates instead of every gate
for i in range(100): for j in range(10): mps.apply_cnot(j % 29, j % 29 + 1, truncate=False) mps.compress() # Single truncation after 10 gates
Use complex64: 2× faster SVD than complex128
Memory Bandwidth Bottleneck#
Problem: Large tensors bottlenecked by memory transfer
Solutions:
Fused kernels: Use Triton kernels for χ>64
Increase arithmetic intensity: More compute per memory access
Optimize memory layout: Ensure coalesced access patterns
# Triton kernels automatically used for χ>64
from atlas_q.triton_kernels.mps_complex import use_triton_kernels
use_triton_kernels(True) # Enable fused operations
Launch Overhead Bottleneck#
Problem: Many small operations have high launch overhead
Solutions:
Batch operations: Group multiple gates
# Bad: Launch 100 separate kernels for i in range(100): mps.apply_h(i % 50) # Good: Batch into single call sites = list(range(50)) * 2 mps.apply_batch_single_gates('H', sites)
Use CUDA streams: Overlap compute and transfer
Optimization Strategies#
Systematic Optimization Process#
Profile: Identify bottlenecks
Optimize hot path: Focus on 80% of time
Measure: Verify improvement
Iterate: Repeat until target met
Algorithm-Level Optimizations#
Choose right algorithm:
Small χ needed: Use 1-site TDVP or QAOA
High accuracy needed: Use 2-site TDVP or increase χ
Clifford subcircuit: Use stabilizer backend (20× faster)
Reduce operations:
Circuit optimization: Merge gates, remove redundancy
Adaptive parameters: Vary dt or optimizer tolerance
Early stopping: Terminate when converged
Implementation-Level Optimizations#
GPU acceleration:
import torch
torch.backends.cuda.matmul.allow_tf32 = True # Enable tensor cores
mps = AdaptiveMPS(
num_qubits=50,
bond_dim=128,
dtype=torch.complex64, # Use single precision
device='cuda'
)
cuQuantum integration:
from atlas_q.cuquantum_backend import CuQuantumConfig, CuQuantumBackend
config = CuQuantumConfig(
use_cutensornet=True,
svd_algorithm='gesvdj',
enable_async=True
)
backend = CuQuantumBackend(config, device='cuda')
Multi-GPU:
from atlas_q.distributed_mps import DistributedMPS, DistributedConfig
config = DistributedConfig(
mode='bond_parallel',
world_size=4
)
mps = DistributedMPS(num_qubits=80, bond_dim=512, config=config)
Case Studies#
Case Study 1: Deep VQE Circuit#
Problem: VQE with 20 qubits, 10 layers, 200 parameters taking 10 minutes per iteration
Analysis:
Circuit depth: ~200 gates
Time per iteration: 600 seconds
Most time in energy evaluation (500s) and gradient (100s)
Optimization:
Reduced χ from 128 to 64 (entanglement analysis showed χ=64 sufficient)
Used complex64 instead of complex128
Enabled cuQuantum
Result:
Time per iteration: 60 seconds (10× improvement)
Accuracy maintained (energy error < 1e-6)
Case Study 2: Long-Time TDVP#
Problem: TDVP evolution of 100-qubit chain for 10,000 time steps running out of memory
Analysis:
χ=256 → 50 qubits max on 16 GB GPU
Need to reduce memory footprint
Optimization:
Analyzed entanglement: χ=128 sufficient for middle section, χ=64 for edges
Implemented adaptive per-bond χ
Used memory budgets
Result:
Fit 100 qubits with adaptive χ (64-128)
Simulation completed in 2 hours
Accuracy within tolerance (global error < 1e-5)
Case Study 3: QAOA for Large Graph#
Problem: QAOA on 100-node graph converging slowly (1000+ iterations)
Analysis:
Standard QAOA using random initialization
Many local minima causing slow convergence
Optimization:
Used warm-start from p=1 solution to initialize p=3
Linear ramp initialization for parameters
Switched to COBYLA optimizer (gradient-free, better for local minima)
Result:
Converged in 150 iterations (7× fewer)
Reached 0.85 approximation ratio
Total time: 30 seconds vs 5 minutes
Best Practices#
Performance Checklist#
Before running large simulation:
Estimate requirements: - Memory: \(16 n \chi^2\) bytes - Time: \(O(T \chi^3)\) for T operations - Check against available resources
Profile small version: - Run on 10-20 qubits first - Measure actual throughput - Extrapolate to full problem
Optimize configuration: - Start with χ=32, increase if needed - Use complex64 unless high precision required - Enable GPU and cuQuantum
Monitor during run: - Track memory usage - Check truncation errors - Verify convergence
Avoiding Common Pitfalls#
Don’t:
Assume χ=256 always needed (often χ=64 sufficient)
Use complex128 by default (2× slower than complex64)
Apply gates one-by-one (batch when possible)
Ignore memory budget warnings (will OOM later)
Do:
Analyze entanglement to determine required χ
Profile before optimizing
Use adaptive truncation thresholds
Check accuracy periodically
Summary#
ATLAS-Q’s performance model provides predictable scaling:
Computational Complexity:
Single-qubit gates: \(O(\chi^2)\)
Two-qubit gates: \(O(\chi^3)\) (SVD dominated)
TDVP time step: \(O(n \chi^3 d^2)\)
VQE iteration: \(O(p T \chi^3 d^2)\)
Memory Scaling:
MPS storage: \(O(n \chi^2 d)\) - linear in qubits
Working memory: \(O(\chi^2 d^2)\) - for operations
Practical limits: 100+ qubits with χ=128 on 16 GB GPU
Key Insights:
χ dominates: Cubic scaling in bond dimension for time
SVD bottleneck: 40-70% of runtime for χ>64
Linear in n: Number of qubits has linear impact
GPU essential: 10-100× speedup over CPU
Entanglement matters: Required χ depends on problem structure
Optimization Priority:
Choose right χ: Analyze entanglement, don’t over-provision
Use GPU: Always for production runs
Enable cuQuantum: 2-3× speedup for χ>128
Batch operations: Reduce kernel launch overhead
Profile and iterate: Measure, optimize hot path, repeat
Practical Performance (A100 GPU):
50 qubits, χ=64: ~10 ms per two-qubit gate
VQE (10 qubits, 3 layers): ~30 seconds for 100 iterations
TDVP (50 qubits, χ=64): ~25 seconds for 1000 time steps
QAOA (20 qubits, p=3): ~1.5 seconds for 150 iterations
For detailed optimization techniques, see:
GPU Acceleration - GPU performance optimization
Numerical Stability - Accuracy vs performance tradeoffs
How to Optimize Performance - Practical optimization guide
Benchmark Comparison - Benchmarking methodology