Performance Model#

Understanding the performance characteristics of MPS-based quantum simulation is essential for planning large-scale computations and optimizing resource usage. This document provides a comprehensive performance model for ATLAS-Q, including computational complexity, memory scaling, bottleneck analysis, and optimization strategies.

Overview#

Why Performance Modeling Matters#

MPS simulations trade exponential statevector memory (\(O(2^n)\)) for polynomial time complexity in bond dimension \(\chi\). Understanding this tradeoff is critical for:

  1. Resource planning: Estimate memory and time requirements before running

  2. Algorithm selection: Choose between TDVP, VQE, QAOA based on scaling

  3. Optimization: Identify bottlenecks and apply targeted improvements

  4. Feasibility assessment: Determine if simulation is practical on available hardware

Key Performance Factors:

  • Number of qubits (n): Linear impact on memory and time

  • Bond dimension (χ): Cubic impact on time, quadratic on memory

  • Circuit depth: Determines total operations

  • Entanglement structure: Affects required χ for accuracy

  • Hardware: GPU vs CPU, memory bandwidth, compute capability

Performance Characteristics#

MPS advantages:

  • Linear scaling in n: Can simulate 100+ qubits vs 30-40 for statevector

  • Controllable accuracy: Trade χ for precision

  • Efficient for 1D systems: Area-law entanglement needs only moderate χ

  • GPU acceleration: 10-100× speedup over CPU

MPS limitations:

  • Cubic scaling in χ: Large χ (>256) becomes expensive

  • Poor for highly entangled states: χ grows exponentially with entanglement

  • 2D systems challenging: Volume-law scaling unless PEPS used

  • SVD bottleneck: Dominates runtime for χ > 128

Computational Complexity#

Single-Qubit Gates#

Applying single-qubit gate to site i:

\[A^{[i]}_{new} = \sum_s G_{s,s'} A^{[i]}_{s'}\]

where \(A^{[i]}\) has shape \((\chi_L, d, \chi_R)\).

Complexity: \(O(\chi_L \chi_R d^2) = O(\chi^2 d^2)\)

For qubits \(d=2\):

Time: \(O(\chi^2)\) Memory: \(O(\chi^2)\) for storing result

Example:

# Single-qubit gate on 50-qubit MPS with χ=64
# Operations: 64² × 2² = 16,384 complex multiplications
# GPU time: ~0.01 ms

mps.apply_h(25)  # Hadamard on qubit 25

Bottleneck: Memory bandwidth (loading/storing tensors dominates)

Two-Qubit Gates#

Applying two-qubit gate to sites i and i+1 requires:

  1. Contract: Merge \(A^{[i]}\) and \(A^{[i+1]}\)\(\Theta\) of shape \((\chi_L, d, d, \chi_R)\)

    • Complexity: \(O(\chi^2 d^2)\)

  2. Apply gate: \(\Theta' = G \cdot \Theta\)

    • Complexity: \(O(\chi^2 d^4)\)

  3. SVD: Decompose \(\Theta'\) (\(\chi d \times \chi d\) matrix)

    • Complexity: \(O((\chi d)^3) = O(\chi^3 d^3)\)

  4. Truncate: Keep largest \(\chi\) singular values

    • Complexity: \(O(\chi^2)\)

Total: \(O(\chi^3 d^3)\) dominated by SVD

For qubits:

Time: \(O(\chi^3 \times 8) = O(8\chi^3)\) Memory: \(O(\chi^2 d^2) = O(4\chi^2)\) for \(\Theta\)

Example:

# Two-qubit gate with χ=64
# SVD operations: (64×2)³ = 2,097,152
# GPU time (A100): ~0.15 ms

mps.apply_cnot(10, 11)

Bottleneck: SVD computation (60-80% of time)

Expectation Value Computation#

Computing \(\langle \psi | \hat{O} | \psi \rangle\) for operator \(\hat{O}\) represented as MPO:

  1. Contract MPS with MPO: \(O(n \chi^2 \chi_{MPO} d^2)\)

  2. Contract result with MPS: \(O(n \chi^2)\)

For local operator (single-site measurement):

Complexity: \(O(\chi^2 d^2)\)

For Hamiltonian (nearest-neighbor MPO with \(\chi_{MPO} \approx 4-8\)):

Complexity: \(O(n \chi^3 d^2)\)

Example:

# Energy of 30-qubit Heisenberg chain, χ=64
# Operations: 30 × 64³ × 4 = 31,457,280
# GPU time: ~2 ms

from atlas_q.mpo_ops import build_heisenberg_hamiltonian
H_mpo = build_heisenberg_hamiltonian(30, J=[1.0, 1.0, 1.0], h=0.5)
energy = mps.expectation_value(H_mpo)

Memory Scaling#

MPS Storage#

An n-qubit MPS with bond dimension χ and local dimension d stores:

\[\text{Memory} = \sum_{i=1}^{n} \chi_{i-1} \times d \times \chi_i \times \text{bytes}\]

For uniform χ and complex128 (16 bytes per complex number):

\[\text{Memory} \approx n \chi^2 d \times 16 \text{ bytes} = 16 n \chi^2 d \text{ bytes}\]

For qubits (d=2) and complex64 (8 bytes):

\[\text{Memory} \approx 16 n \chi^2 \text{ bytes}\]

Examples:

Qubits n

Bond χ

Memory (GB)

Comparison

50

32

0.08

Tiny

50

64

0.31

Small

100

64

0.62

Moderate

100

128

2.46

Significant

200

64

1.23

Large

200

128

4.92

Very large

Statevector comparison: 50-qubit statevector = \(2^{50} \times 16\) bytes = 18 PB (impossible)

Working Memory#

Operations require temporary storage:

Single-qubit gate: - Input tensor: \(\chi^2 d \times 16\) bytes - Output tensor: \(\chi^2 d \times 16\) bytes - Total: \(2 \chi^2 d \times 16\) bytes

Two-qubit gate: - Merged tensor \(\Theta\): \(\chi^2 d^2 \times 16\) bytes - SVD workspace: \(3(\chi d)^2 \times 16\) bytes (U, S, Vh) - Total: \(\approx 4 \chi^2 d^2 \times 16\) bytes

For χ=256, d=2:

  • Two-qubit gate working memory: \(4 \times 256^2 \times 4 \times 16\) = 64 MB

GPU memory budget (16 GB A100):

  • MPS storage (200 qubits, χ=128): ~5 GB

  • Working memory (operations): ~1 GB

  • PyTorch allocator overhead: ~2 GB

  • Available: ~8 GB for multiple MPS / batching

MPO Storage#

Matrix Product Operator for nearest-neighbor Hamiltonian:

\[\text{MPO Memory} = n \times \chi_{MPO}^2 \times d^2 \times 16 \text{ bytes}\]

For Heisenberg model (\(\chi_{MPO} = 5\)):

Memory: \(n \times 25 \times 4 \times 16 = 1600n\) bytes ≈ 1.6 KB per qubit

MPO memory is usually negligible compared to MPS.

Time Complexity#

TDVP Time Evolution#

1-Site TDVP:

Per time step, sweep through n sites:

  1. Site evolution: \(O(\chi^3 d^2)\) for effective Hamiltonian application

  2. Bond evolution: \(O(\chi^3)\) for backward sweep

  3. Total per site: \(O(\chi^3 d^2)\)

Per time step: \(O(n \chi^3 d^2)\)

For Hamiltonian with \(\chi_{MPO} = k\):

Refined complexity: \(O(n \chi^3 k d^2)\)

2-Site TDVP:

Includes SVD at each bond:

Per time step: \(O(n \chi^3 d^3)\)

Extra factor of d from two-site evolution and SVD of \((\chi d) \times (\chi d)\) matrix.

Comparison:

For 50 qubits, χ=64, 100 time steps:

  • 1-site: \(100 \times 50 \times 64^3 \times 4 \approx 5.2 \times 10^9\) operations

  • 2-site: \(100 \times 50 \times 64^3 \times 8 \approx 10.5 \times 10^9\) operations (2× slower)

VQE Optimization#

Energy evaluation:

Circuit application + Hamiltonian expectation:

\[\text{Cost}_{\text{energy}} = O(T \chi^3 d^2 + n \chi^3 k d^2)\]

where T = circuit depth.

Gradient computation (parameter-shift rule):

For p parameters:

\[\text{Cost}_{\text{gradient}} = O(2p \times \text{Cost}_{\text{energy}}) = O(p T \chi^3 d^2)\]

Full VQE:

For N iterations:

\[\text{Total} = O(N \cdot p \cdot T \cdot \chi^3 d^2)\]

Example:

50 qubits, χ=64, depth T=50, p=100 parameters, N=200 iterations:

\[\text{Operations} = 200 \times 100 \times 50 \times 64^3 \times 4 \approx 5.2 \times 10^{12}\]

On GPU: ~100-500 seconds (depends on hardware, batch size)

QAOA Performance#

QAOA with p layers evaluating cost function:

Per evaluation:

  • Apply 2p operators (problem + mixer): \(O(2p \cdot n \chi^3 d^2)\)

  • Hamiltonian expectation: \(O(n \chi^3 d^2)\)

Total per evaluation: \(O((2p+1) n \chi^3 d^2)\)

For optimization with N evaluations:

Total: \(O(N p n \chi^3 d^2)\)

QAOA characteristics:

  • Usually p=1-5 (shallow)

  • χ typically small (16-32) due to low entanglement

  • Fast convergence (N=50-200)

Example:

20 qubits, p=3, χ=32, N=100 iterations:

\[\text{Operations} = 100 \times 3 \times 20 \times 32^3 \times 4 \approx 7.9 \times 10^9\]

GPU time: ~5-10 seconds

Scaling Laws#

MPS vs Statevector#

Memory:

Compression: \(\frac{2^n}{n \chi^2} = \frac{2^{50}}{50 \times 64^2} \approx 5.5 \times 10^{12} \times\)

Time (single two-qubit gate):

Method

Time

50 Qubits

Statevector

\(O(2^n)\)

Impossible

MPS

\(O(\chi^3)\)

~0.15 ms (GPU)

Entanglement Scaling#

Required bond dimension depends on entanglement:

Area Law (1D nearest-neighbor):

\[\chi \sim \exp(c \cdot S)\]

where S is entanglement entropy.

For area law: \(S = O(1)\)\(\chi = O(1)\) (constant)

Volume Law (all-to-all coupling, random circuits):

\[S = O(n)\]

Requires \(\chi = O(2^{n/2})\) (exponential) → MPS inefficient

Practical examples:

  • Heisenberg chain: χ=16-32 sufficient for high accuracy

  • Quantum Fourier Transform: χ ~ O(√n) due to logarithmic entanglement

  • Random circuit (depth d): χ ~ O(2^{min(d, n/2)})

Asymptotic Behavior#

Best case (product state):

  • χ=1 sufficient

  • All operations \(O(n)\) (linear)

Typical case (1D area law):

  • χ=32-128 sufficient

  • Operations \(O(n \chi^3) \approx O(n)\) (quasi-linear)

Worst case (volume law):

  • χ ~ \(2^{n/2}\) required

  • Operations \(O(n \times 2^{3n/2})\) (exponential)

  • MPS provides no advantage

Performance Predictions#

Gate Throughput#

GPU (A100), complex64:

Operation

Bond χ

Time (ms)

Throughput

Single-qubit gate

32

0.005

200K ops/sec

Single-qubit gate

64

0.008

125K ops/sec

Single-qubit gate

128

0.020

50K ops/sec

Two-qubit gate

32

0.05

20K ops/sec

Two-qubit gate

64

0.13

7.7K ops/sec

Two-qubit gate

128

0.50

2K ops/sec

Two-qubit gate

256

2.0

500 ops/sec

Observations:

  • Single-qubit gates ~10-20× faster than two-qubit

  • Time scales as \(\chi^3\) for two-qubit gates (SVD dominated)

  • cuQuantum provides 2-3× additional speedup for χ>128

Algorithm Runtime Estimates#

VQE (hardware-efficient ansatz):

  • Setup: 10 qubits, 3 layers (30 parameters), χ=32

  • Circuit depth: ~30 gates

  • Energy evaluation: ~5 ms

  • Gradient (parameter-shift): 30 params × 2 evaluations = 60 × 5 ms = 300 ms

  • Optimization: 100 iterations × 300 ms = 30 seconds

TDVP (Heisenberg chain):

  • Setup: 50 qubits, χ=64, dt=0.05

  • Time step (1-site): 50 sites × 0.5 ms = 25 ms

  • Evolution: 1000 steps × 25 ms = 25 seconds

QAOA (MaxCut):

  • Setup: 20 qubits, p=3, χ=32

  • Cost evaluation: ~10 ms

  • Optimization: 150 iterations × 10 ms = 1.5 seconds

Memory Requirements#

Practical limits (16 GB GPU):

Qubits

Bond χ

MPS Memory

Max Qubits

Any

32

0.08 GB / 50q

~1000

Any

64

0.31 GB / 50q

~400

Any

128

1.23 GB / 50q

~100

Any

256

4.92 GB / 50q

~25

Actual limits lower due to:

  • Working memory for operations (~20% overhead)

  • PyTorch allocator fragmentation (~30% overhead)

  • Multiple MPS instances or batching

Rule of thumb: Allocate 2× MPS storage for total memory budget.

Bottleneck Analysis#

Identifying Bottlenecks#

Use profiling to identify where time is spent:

import torch
from atlas_q.adaptive_mps import AdaptiveMPS

mps = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cuda')

# Profile gate applications
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True
) as prof:
    for i in range(100):
        mps.apply_cnot(i % 29, (i % 29) + 1)

# Print breakdown
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Common bottlenecks:

  1. SVD (40-70% of time for χ>64)

  2. Memory bandwidth (20-30% for large tensors)

  3. Kernel launch overhead (5-10% for many small operations)

  4. Truncation logic (5-10% adaptive bond dimension selection)

SVD Bottleneck#

Problem: SVD dominates for χ>128

Solutions:

  1. Use cuQuantum: gesvdj algorithm 2-3× faster

    from atlas_q.cuquantum_backend import CuQuantumConfig, CuQuantumBackend
    
    config = CuQuantumConfig(
        use_cutensornet=True,
        svd_algorithm='gesvdj'  # Jacobi SVD
    )
    backend = CuQuantumBackend(config, device='cuda')
    
    mps = AdaptiveMPS(
        num_qubits=40,
        bond_dim=256,
        backend=backend,
        device='cuda'
    )
    
  2. Reduce truncation frequency: Compress every 10 gates instead of every gate

    for i in range(100):
        for j in range(10):
            mps.apply_cnot(j % 29, j % 29 + 1, truncate=False)
        mps.compress()  # Single truncation after 10 gates
    
  3. Use complex64: 2× faster SVD than complex128

Memory Bandwidth Bottleneck#

Problem: Large tensors bottlenecked by memory transfer

Solutions:

  1. Fused kernels: Use Triton kernels for χ>64

  2. Increase arithmetic intensity: More compute per memory access

  3. Optimize memory layout: Ensure coalesced access patterns

# Triton kernels automatically used for χ>64
from atlas_q.triton_kernels.mps_complex import use_triton_kernels
use_triton_kernels(True)  # Enable fused operations

Launch Overhead Bottleneck#

Problem: Many small operations have high launch overhead

Solutions:

  1. Batch operations: Group multiple gates

    # Bad: Launch 100 separate kernels
    for i in range(100):
        mps.apply_h(i % 50)
    
    # Good: Batch into single call
    sites = list(range(50)) * 2
    mps.apply_batch_single_gates('H', sites)
    
  2. Use CUDA streams: Overlap compute and transfer

Optimization Strategies#

Systematic Optimization Process#

  1. Profile: Identify bottlenecks

  2. Optimize hot path: Focus on 80% of time

  3. Measure: Verify improvement

  4. Iterate: Repeat until target met

Algorithm-Level Optimizations#

Choose right algorithm:

  • Small χ needed: Use 1-site TDVP or QAOA

  • High accuracy needed: Use 2-site TDVP or increase χ

  • Clifford subcircuit: Use stabilizer backend (20× faster)

Reduce operations:

  • Circuit optimization: Merge gates, remove redundancy

  • Adaptive parameters: Vary dt or optimizer tolerance

  • Early stopping: Terminate when converged

Implementation-Level Optimizations#

GPU acceleration:

import torch
torch.backends.cuda.matmul.allow_tf32 = True  # Enable tensor cores

mps = AdaptiveMPS(
    num_qubits=50,
    bond_dim=128,
    dtype=torch.complex64,  # Use single precision
    device='cuda'
)

cuQuantum integration:

from atlas_q.cuquantum_backend import CuQuantumConfig, CuQuantumBackend

config = CuQuantumConfig(
    use_cutensornet=True,
    svd_algorithm='gesvdj',
    enable_async=True
)
backend = CuQuantumBackend(config, device='cuda')

Multi-GPU:

from atlas_q.distributed_mps import DistributedMPS, DistributedConfig

config = DistributedConfig(
    mode='bond_parallel',
    world_size=4
)
mps = DistributedMPS(num_qubits=80, bond_dim=512, config=config)

Case Studies#

Case Study 1: Deep VQE Circuit#

Problem: VQE with 20 qubits, 10 layers, 200 parameters taking 10 minutes per iteration

Analysis:

  • Circuit depth: ~200 gates

  • Time per iteration: 600 seconds

  • Most time in energy evaluation (500s) and gradient (100s)

Optimization:

  1. Reduced χ from 128 to 64 (entanglement analysis showed χ=64 sufficient)

  2. Used complex64 instead of complex128

  3. Enabled cuQuantum

Result:

  • Time per iteration: 60 seconds (10× improvement)

  • Accuracy maintained (energy error < 1e-6)

Case Study 2: Long-Time TDVP#

Problem: TDVP evolution of 100-qubit chain for 10,000 time steps running out of memory

Analysis:

  • χ=256 → 50 qubits max on 16 GB GPU

  • Need to reduce memory footprint

Optimization:

  1. Analyzed entanglement: χ=128 sufficient for middle section, χ=64 for edges

  2. Implemented adaptive per-bond χ

  3. Used memory budgets

Result:

  • Fit 100 qubits with adaptive χ (64-128)

  • Simulation completed in 2 hours

  • Accuracy within tolerance (global error < 1e-5)

Case Study 3: QAOA for Large Graph#

Problem: QAOA on 100-node graph converging slowly (1000+ iterations)

Analysis:

  • Standard QAOA using random initialization

  • Many local minima causing slow convergence

Optimization:

  1. Used warm-start from p=1 solution to initialize p=3

  2. Linear ramp initialization for parameters

  3. Switched to COBYLA optimizer (gradient-free, better for local minima)

Result:

  • Converged in 150 iterations (7× fewer)

  • Reached 0.85 approximation ratio

  • Total time: 30 seconds vs 5 minutes

Best Practices#

Performance Checklist#

Before running large simulation:

  1. Estimate requirements: - Memory: \(16 n \chi^2\) bytes - Time: \(O(T \chi^3)\) for T operations - Check against available resources

  2. Profile small version: - Run on 10-20 qubits first - Measure actual throughput - Extrapolate to full problem

  3. Optimize configuration: - Start with χ=32, increase if needed - Use complex64 unless high precision required - Enable GPU and cuQuantum

  4. Monitor during run: - Track memory usage - Check truncation errors - Verify convergence

Avoiding Common Pitfalls#

Don’t:

  • Assume χ=256 always needed (often χ=64 sufficient)

  • Use complex128 by default (2× slower than complex64)

  • Apply gates one-by-one (batch when possible)

  • Ignore memory budget warnings (will OOM later)

Do:

  • Analyze entanglement to determine required χ

  • Profile before optimizing

  • Use adaptive truncation thresholds

  • Check accuracy periodically

Summary#

ATLAS-Q’s performance model provides predictable scaling:

Computational Complexity:

  • Single-qubit gates: \(O(\chi^2)\)

  • Two-qubit gates: \(O(\chi^3)\) (SVD dominated)

  • TDVP time step: \(O(n \chi^3 d^2)\)

  • VQE iteration: \(O(p T \chi^3 d^2)\)

Memory Scaling:

  • MPS storage: \(O(n \chi^2 d)\) - linear in qubits

  • Working memory: \(O(\chi^2 d^2)\) - for operations

  • Practical limits: 100+ qubits with χ=128 on 16 GB GPU

Key Insights:

  1. χ dominates: Cubic scaling in bond dimension for time

  2. SVD bottleneck: 40-70% of runtime for χ>64

  3. Linear in n: Number of qubits has linear impact

  4. GPU essential: 10-100× speedup over CPU

  5. Entanglement matters: Required χ depends on problem structure

Optimization Priority:

  1. Choose right χ: Analyze entanglement, don’t over-provision

  2. Use GPU: Always for production runs

  3. Enable cuQuantum: 2-3× speedup for χ>128

  4. Batch operations: Reduce kernel launch overhead

  5. Profile and iterate: Measure, optimize hot path, repeat

Practical Performance (A100 GPU):

  • 50 qubits, χ=64: ~10 ms per two-qubit gate

  • VQE (10 qubits, 3 layers): ~30 seconds for 100 iterations

  • TDVP (50 qubits, χ=64): ~25 seconds for 1000 time steps

  • QAOA (20 qubits, p=3): ~1.5 seconds for 150 iterations

For detailed optimization techniques, see: