Performance Model#

Understanding the performance characteristics of MPS-based quantum simulation is essential for planning large-scale computations and optimizing resource usage. This document provides a comprehensive performance model for ATLAS-Q, including computational complexity, memory scaling, bottleneck analysis, and optimization strategies.

Overview #

Why Performance Modeling Matters #

MPS simulations trade exponential statevector memory (\(O(2^n)\)) for polynomial time complexity in bond dimension \(\chi\). Understanding this tradeoff is critical for:

Resource planning: Estimate memory and time requirements before running
Algorithm selection: Choose between TDVP, VQE, QAOA based on scaling
Optimization: Identify bottlenecks and apply targeted improvements
Feasibility assessment: Determine if simulation is practical on available hardware

Key Performance Factors:

Number of qubits (n): Linear impact on memory and time
Bond dimension (χ): Cubic impact on time, quadratic on memory
Circuit depth: Determines total operations
Entanglement structure: Affects required χ for accuracy
Hardware: GPU vs CPU, memory bandwidth, compute capability

Performance Characteristics #

MPS advantages:

Linear scaling in n: Can simulate 100+ qubits vs 30-40 for statevector
Controllable accuracy: Trade χ for precision
Efficient for 1D systems: Area-law entanglement needs only moderate χ
GPU acceleration: 10-100× speedup over CPU

MPS limitations:

Cubic scaling in χ: Large χ (>256) becomes expensive
Poor for highly entangled states: χ grows exponentially with entanglement
2D systems challenging: Volume-law scaling unless PEPS used
SVD bottleneck: Dominates runtime for χ > 128

Computational Complexity #

Single-Qubit Gates #

Applying single-qubit gate to site i:

\[A^{[i]}_{new} = \sum_s G_{s,s'} A^{[i]}_{s'}\]

where \(A^{[i]}\) has shape \((\chi_L, d, \chi_R)\).

Complexity: \(O(\chi_L \chi_R d^2) = O(\chi^2 d^2)\)

For qubits \(d=2\):

Time: \(O(\chi^2)\) Memory: \(O(\chi^2)\) for storing result

Example:

# Single-qubit gate on 50-qubit MPS with χ=64
# Operations: 64² × 2² = 16,384 complex multiplications
# GPU time: ~0.01 ms

mps.apply_h(25)  # Hadamard on qubit 25

Bottleneck: Memory bandwidth (loading/storing tensors dominates)

Two-Qubit Gates #

Applying two-qubit gate to sites i and i+1 requires:

Contract: Merge \(A^{[i]}\) and \(A^{[i+1]}\) → \(\Theta\) of shape \((\chi_L, d, d, \chi_R)\)
- Complexity: \(O(\chi^2 d^2)\)
Apply gate: \(\Theta' = G \cdot \Theta\)
- Complexity: \(O(\chi^2 d^4)\)
SVD: Decompose \(\Theta'\) (\(\chi d \times \chi d\) matrix)
- Complexity: \(O((\chi d)^3) = O(\chi^3 d^3)\)
Truncate: Keep largest \(\chi\) singular values
- Complexity: \(O(\chi^2)\)

Total: \(O(\chi^3 d^3)\) dominated by SVD

For qubits:

Time: \(O(\chi^3 \times 8) = O(8\chi^3)\) Memory: \(O(\chi^2 d^2) = O(4\chi^2)\) for \(\Theta\)

Example:

# Two-qubit gate with χ=64
# SVD operations: (64×2)³ = 2,097,152
# GPU time (A100): ~0.15 ms

mps.apply_cnot(10, 11)

Bottleneck: SVD computation (60-80% of time)

Expectation Value Computation #

Computing \(\langle \psi | \hat{O} | \psi \rangle\) for operator \(\hat{O}\) represented as MPO:

Contract MPS with MPO: \(O(n \chi^2 \chi_{MPO} d^2)\)
Contract result with MPS: \(O(n \chi^2)\)

For local operator (single-site measurement):

Complexity: \(O(\chi^2 d^2)\)

For Hamiltonian (nearest-neighbor MPO with \(\chi_{MPO} \approx 4-8\)):

Complexity: \(O(n \chi^3 d^2)\)

Example:

# Energy of 30-qubit Heisenberg chain, χ=64
# Operations: 30 × 64³ × 4 = 31,457,280
# GPU time: ~2 ms

from atlas_q.mpo_ops import build_heisenberg_hamiltonian
H_mpo = build_heisenberg_hamiltonian(30, J=[1.0, 1.0, 1.0], h=0.5)
energy = mps.expectation_value(H_mpo)

Memory Scaling #

MPS Storage #

An n-qubit MPS with bond dimension χ and local dimension d stores:

\[\text{Memory} = \sum_{i=1}^{n} \chi_{i-1} \times d \times \chi_i \times \text{bytes}\]

For uniform χ and complex128 (16 bytes per complex number):

\[\text{Memory} \approx n \chi^2 d \times 16 \text{ bytes} = 16 n \chi^2 d \text{ bytes}\]

For qubits (d=2) and complex64 (8 bytes):

\[\text{Memory} \approx 16 n \chi^2 \text{ bytes}\]

Examples:

Qubits n	Bond χ	Memory (GB)	Comparison
50	32	0.08	Tiny
50	64	0.31	Small
100	64	0.62	Moderate
100	128	2.46	Significant
200	64	1.23	Large
200	128	4.92	Very large

Statevector comparison: 50-qubit statevector = \(2^{50} \times 16\) bytes = 18 PB (impossible)

Working Memory #

Operations require temporary storage:

Single-qubit gate: - Input tensor: \(\chi^2 d \times 16\) bytes - Output tensor: \(\chi^2 d \times 16\) bytes - Total: \(2 \chi^2 d \times 16\) bytes

Two-qubit gate: - Merged tensor \(\Theta\): \(\chi^2 d^2 \times 16\) bytes - SVD workspace: \(3(\chi d)^2 \times 16\) bytes (U, S, Vh) - Total: \(\approx 4 \chi^2 d^2 \times 16\) bytes

For χ=256, d=2:

Two-qubit gate working memory: \(4 \times 256^2 \times 4 \times 16\) = 64 MB

GPU memory budget (16 GB A100):

MPS storage (200 qubits, χ=128): ~5 GB
Working memory (operations): ~1 GB
PyTorch allocator overhead: ~2 GB
Available: ~8 GB for multiple MPS / batching

MPO Storage #

Matrix Product Operator for nearest-neighbor Hamiltonian:

\[\text{MPO Memory} = n \times \chi_{MPO}^2 \times d^2 \times 16 \text{ bytes}\]

For Heisenberg model (\(\chi_{MPO} = 5\)):

Memory: \(n \times 25 \times 4 \times 16 = 1600n\) bytes ≈ 1.6 KB per qubit

MPO memory is usually negligible compared to MPS.

Time Complexity #

TDVP Time Evolution #

1-Site TDVP:

Per time step, sweep through n sites:

Site evolution: \(O(\chi^3 d^2)\) for effective Hamiltonian application
Bond evolution: \(O(\chi^3)\) for backward sweep
Total per site: \(O(\chi^3 d^2)\)

Per time step: \(O(n \chi^3 d^2)\)

For Hamiltonian with \(\chi_{MPO} = k\):

Refined complexity: \(O(n \chi^3 k d^2)\)

2-Site TDVP:

Includes SVD at each bond:

Per time step: \(O(n \chi^3 d^3)\)

Extra factor of d from two-site evolution and SVD of \((\chi d) \times (\chi d)\) matrix.

Comparison:

For 50 qubits, χ=64, 100 time steps:

1-site: \(100 \times 50 \times 64^3 \times 4 \approx 5.2 \times 10^9\) operations
2-site: \(100 \times 50 \times 64^3 \times 8 \approx 10.5 \times 10^9\) operations (2× slower)

VQE Optimization #

Energy evaluation:

Circuit application + Hamiltonian expectation:

\[\text{Cost}_{\text{energy}} = O(T \chi^3 d^2 + n \chi^3 k d^2)\]

where T = circuit depth.

Gradient computation (parameter-shift rule):

For p parameters:

\[\text{Cost}_{\text{gradient}} = O(2p \times \text{Cost}_{\text{energy}}) = O(p T \chi^3 d^2)\]

Full VQE:

For N iterations:

\[\text{Total} = O(N \cdot p \cdot T \cdot \chi^3 d^2)\]

Example:

50 qubits, χ=64, depth T=50, p=100 parameters, N=200 iterations:

\[\text{Operations} = 200 \times 100 \times 50 \times 64^3 \times 4 \approx 5.2 \times 10^{12}\]

On GPU: ~100-500 seconds (depends on hardware, batch size)

QAOA Performance #

QAOA with p layers evaluating cost function:

Per evaluation:

Apply 2p operators (problem + mixer): \(O(2p \cdot n \chi^3 d^2)\)
Hamiltonian expectation: \(O(n \chi^3 d^2)\)

Total per evaluation: \(O((2p+1) n \chi^3 d^2)\)

For optimization with N evaluations:

Total: \(O(N p n \chi^3 d^2)\)

QAOA characteristics:

Usually p=1-5 (shallow)
χ typically small (16-32) due to low entanglement
Fast convergence (N=50-200)

Example:

20 qubits, p=3, χ=32, N=100 iterations:

\[\text{Operations} = 100 \times 3 \times 20 \times 32^3 \times 4 \approx 7.9 \times 10^9\]

GPU time: ~5-10 seconds

Scaling Laws #

MPS vs Statevector #

Memory:

Compression: \(\frac{2^n}{n \chi^2} = \frac{2^{50}}{50 \times 64^2} \approx 5.5 \times 10^{12} \times\)

Time (single two-qubit gate):

Method	Time	50 Qubits
Statevector	\(O(2^n)\)	Impossible
MPS	\(O(\chi^3)\)	~0.15 ms (GPU)

Entanglement Scaling #

Required bond dimension depends on entanglement:

Area Law (1D nearest-neighbor):

\[\chi \sim \exp(c \cdot S)\]

where S is entanglement entropy.

For area law: \(S = O(1)\) → \(\chi = O(1)\) (constant)

Volume Law (all-to-all coupling, random circuits):

\[S = O(n)\]

Requires \(\chi = O(2^{n/2})\) (exponential) → MPS inefficient

Practical examples:

Heisenberg chain: χ=16-32 sufficient for high accuracy
Quantum Fourier Transform: χ ~ O(√n) due to logarithmic entanglement
Random circuit (depth d): χ ~ O(2^{min(d, n/2)})

Asymptotic Behavior #

Best case (product state):

χ=1 sufficient
All operations \(O(n)\) (linear)

Typical case (1D area law):

χ=32-128 sufficient
Operations \(O(n \chi^3) \approx O(n)\) (quasi-linear)

Worst case (volume law):

χ ~ \(2^{n/2}\) required
Operations \(O(n \times 2^{3n/2})\) (exponential)
MPS provides no advantage

Performance Predictions #

Gate Throughput #

GPU (A100), complex64:

Operation	Bond χ	Time (ms)	Throughput
Single-qubit gate	32	0.005	200K ops/sec
Single-qubit gate	64	0.008	125K ops/sec
Single-qubit gate	128	0.020	50K ops/sec
Two-qubit gate	32	0.05	20K ops/sec
Two-qubit gate	64	0.13	7.7K ops/sec
Two-qubit gate	128	0.50	2K ops/sec
Two-qubit gate	256	2.0	500 ops/sec

Observations:

Single-qubit gates ~10-20× faster than two-qubit
Time scales as \(\chi^3\) for two-qubit gates (SVD dominated)
cuQuantum provides 2-3× additional speedup for χ>128

Algorithm Runtime Estimates #

VQE (hardware-efficient ansatz):

Setup: 10 qubits, 3 layers (30 parameters), χ=32
Circuit depth: ~30 gates
Energy evaluation: ~5 ms
Gradient (parameter-shift): 30 params × 2 evaluations = 60 × 5 ms = 300 ms
Optimization: 100 iterations × 300 ms = 30 seconds

TDVP (Heisenberg chain):

Setup: 50 qubits, χ=64, dt=0.05
Time step (1-site): 50 sites × 0.5 ms = 25 ms
Evolution: 1000 steps × 25 ms = 25 seconds

QAOA (MaxCut):

Setup: 20 qubits, p=3, χ=32
Cost evaluation: ~10 ms
Optimization: 150 iterations × 10 ms = 1.5 seconds

Memory Requirements #

Practical limits (16 GB GPU):

Qubits	Bond χ	MPS Memory	Max Qubits
Any	32	0.08 GB / 50q	~1000
Any	64	0.31 GB / 50q	~400
Any	128	1.23 GB / 50q	~100
Any	256	4.92 GB / 50q	~25

Actual limits lower due to:

Working memory for operations (~20% overhead)
PyTorch allocator fragmentation (~30% overhead)
Multiple MPS instances or batching

Rule of thumb: Allocate 2× MPS storage for total memory budget.

Bottleneck Analysis #

Identifying Bottlenecks #

Use profiling to identify where time is spent:

import torch
from atlas_q.adaptive_mps import AdaptiveMPS

mps = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cuda')

# Profile gate applications
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True
) as prof:
    for i in range(100):
        mps.apply_cnot(i % 29, (i % 29) + 1)

# Print breakdown
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Common bottlenecks:

SVD (40-70% of time for χ>64)
Memory bandwidth (20-30% for large tensors)
Kernel launch overhead (5-10% for many small operations)
Truncation logic (5-10% adaptive bond dimension selection)

SVD Bottleneck #

Problem: SVD dominates for χ>128

Solutions:

Use cuQuantum: gesvdj algorithm 2-3× faster

from atlas_q.cuquantum_backend import CuQuantumConfig, CuQuantumBackend

config = CuQuantumConfig(
    use_cutensornet=True,
    svd_algorithm='gesvdj'  # Jacobi SVD
)
backend = CuQuantumBackend(config, device='cuda')

mps = AdaptiveMPS(
    num_qubits=40,
    bond_dim=256,
    backend=backend,
    device='cuda'
)

Reduce truncation frequency: Compress every 10 gates instead of every gate

for i in range(100):
    for j in range(10):
        mps.apply_cnot(j % 29, j % 29 + 1, truncate=False)
    mps.compress()  # Single truncation after 10 gates

Use complex64: 2× faster SVD than complex128

Memory Bandwidth Bottleneck #

Problem: Large tensors bottlenecked by memory transfer

Solutions:

Fused kernels: Use Triton kernels for χ>64
Increase arithmetic intensity: More compute per memory access
Optimize memory layout: Ensure coalesced access patterns

# Triton kernels automatically used for χ>64
from atlas_q.triton_kernels.mps_complex import use_triton_kernels
use_triton_kernels(True)  # Enable fused operations

Launch Overhead Bottleneck #

Problem: Many small operations have high launch overhead

Solutions:

Batch operations: Group multiple gates

# Bad: Launch 100 separate kernels
for i in range(100):
    mps.apply_h(i % 50)

# Good: Batch into single call
sites = list(range(50)) * 2
mps.apply_batch_single_gates('H', sites)

Use CUDA streams: Overlap compute and transfer

Optimization Strategies #

Systematic Optimization Process #

Profile: Identify bottlenecks
Optimize hot path: Focus on 80% of time
Measure: Verify improvement
Iterate: Repeat until target met

Algorithm-Level Optimizations #

Choose right algorithm:

Small χ needed: Use 1-site TDVP or QAOA
High accuracy needed: Use 2-site TDVP or increase χ
Clifford subcircuit: Use stabilizer backend (20× faster)

Reduce operations:

Circuit optimization: Merge gates, remove redundancy
Adaptive parameters: Vary dt or optimizer tolerance
Early stopping: Terminate when converged

Implementation-Level Optimizations #

GPU acceleration:

import torch
torch.backends.cuda.matmul.allow_tf32 = True  # Enable tensor cores

mps = AdaptiveMPS(
    num_qubits=50,
    bond_dim=128,
    dtype=torch.complex64,  # Use single precision
    device='cuda'
)

cuQuantum integration:

from atlas_q.cuquantum_backend import CuQuantumConfig, CuQuantumBackend

config = CuQuantumConfig(
    use_cutensornet=True,
    svd_algorithm='gesvdj',
    enable_async=True
)
backend = CuQuantumBackend(config, device='cuda')

Multi-GPU:

from atlas_q.distributed_mps import DistributedMPS, DistributedConfig

config = DistributedConfig(
    mode='bond_parallel',
    world_size=4
)
mps = DistributedMPS(num_qubits=80, bond_dim=512, config=config)

Case Studies #

Case Study 1: Deep VQE Circuit #

Problem: VQE with 20 qubits, 10 layers, 200 parameters taking 10 minutes per iteration

Analysis:

Circuit depth: ~200 gates
Time per iteration: 600 seconds
Most time in energy evaluation (500s) and gradient (100s)

Optimization:

Reduced χ from 128 to 64 (entanglement analysis showed χ=64 sufficient)
Used complex64 instead of complex128
Enabled cuQuantum

Result:

Time per iteration: 60 seconds (10× improvement)
Accuracy maintained (energy error < 1e-6)

Case Study 2: Long-Time TDVP #

Problem: TDVP evolution of 100-qubit chain for 10,000 time steps running out of memory

Analysis:

χ=256 → 50 qubits max on 16 GB GPU
Need to reduce memory footprint

Optimization:

Analyzed entanglement: χ=128 sufficient for middle section, χ=64 for edges
Implemented adaptive per-bond χ
Used memory budgets

Result:

Fit 100 qubits with adaptive χ (64-128)
Simulation completed in 2 hours
Accuracy within tolerance (global error < 1e-5)

Case Study 3: QAOA for Large Graph #

Problem: QAOA on 100-node graph converging slowly (1000+ iterations)

Analysis:

Standard QAOA using random initialization
Many local minima causing slow convergence

Optimization:

Used warm-start from p=1 solution to initialize p=3
Linear ramp initialization for parameters
Switched to COBYLA optimizer (gradient-free, better for local minima)

Result:

Converged in 150 iterations (7× fewer)
Reached 0.85 approximation ratio
Total time: 30 seconds vs 5 minutes

Best Practices #

Performance Checklist #

Before running large simulation:

Estimate requirements: - Memory: \(16 n \chi^2\) bytes - Time: \(O(T \chi^3)\) for T operations - Check against available resources
Profile small version: - Run on 10-20 qubits first - Measure actual throughput - Extrapolate to full problem
Optimize configuration: - Start with χ=32, increase if needed - Use complex64 unless high precision required - Enable GPU and cuQuantum
Monitor during run: - Track memory usage - Check truncation errors - Verify convergence

Avoiding Common Pitfalls #

Don’t:

Assume χ=256 always needed (often χ=64 sufficient)
Use complex128 by default (2× slower than complex64)
Apply gates one-by-one (batch when possible)
Ignore memory budget warnings (will OOM later)

Do:

Analyze entanglement to determine required χ
Profile before optimizing
Use adaptive truncation thresholds
Check accuracy periodically

Summary #

ATLAS-Q’s performance model provides predictable scaling:

Computational Complexity:

Single-qubit gates: \(O(\chi^2)\)
Two-qubit gates: \(O(\chi^3)\) (SVD dominated)
TDVP time step: \(O(n \chi^3 d^2)\)
VQE iteration: \(O(p T \chi^3 d^2)\)

Memory Scaling:

MPS storage: \(O(n \chi^2 d)\) - linear in qubits
Working memory: \(O(\chi^2 d^2)\) - for operations
Practical limits: 100+ qubits with χ=128 on 16 GB GPU

Key Insights:

χ dominates: Cubic scaling in bond dimension for time
SVD bottleneck: 40-70% of runtime for χ>64
Linear in n: Number of qubits has linear impact
GPU essential: 10-100× speedup over CPU
Entanglement matters: Required χ depends on problem structure

Optimization Priority:

Choose right χ: Analyze entanglement, don’t over-provision
Use GPU: Always for production runs
Enable cuQuantum: 2-3× speedup for χ>128
Batch operations: Reduce kernel launch overhead
Profile and iterate: Measure, optimize hot path, repeat

Practical Performance (A100 GPU):

50 qubits, χ=64: ~10 ms per two-qubit gate
VQE (10 qubits, 3 layers): ~30 seconds for 100 iterations
TDVP (50 qubits, χ=64): ~25 seconds for 1000 time steps
QAOA (20 qubits, p=3): ~1.5 seconds for 150 iterations

For detailed optimization techniques, see:

GPU Acceleration - GPU performance optimization
Numerical Stability - Accuracy vs performance tradeoffs
How to Optimize Performance - Practical optimization guide
Benchmark Comparison - Benchmarking methodology