Distributed MPS#
Multi-GPU MPS simulation for 100s-1000s of qubits.
Overview#
The distributed MPS module enables tensor network simulations on systems with hundreds to thousands of qubits by partitioning the computation across multiple GPUs. This is essential for simulating large quantum circuits that exceed single-GPU memory limits.
Mathematical Foundation
MPS represents a quantum state as a chain of tensors:
Each tensor \(A^{[k]}_{i_k}\) has shape \((\chi_{k-1}, d, \chi_k)\) where \(d=2\) is the physical dimension and \(\chi_k\) is the bond dimension between sites k and k+1.
Bond-Parallel Decomposition
The key insight is that MPS operations are mostly local to individual tensors or neighboring pairs. We partition the chain:
GPU 0: Sites [0, k₀) with tensors A⁽¹⁾, A⁽²⁾, …, A⁽ᵏ⁰⁻¹⁾
GPU 1: Sites [k₀, k₁) with tensors A⁽ᵏ⁰⁾, …, A⁽ᵏ¹⁻¹⁾
GPU p: Sites [kₚ₋₁, kₚ) with tensors on partition p
Gates acting within a partition are computed locally with no communication. Gates acting across partition boundaries require GPU-to-GPU communication of bond tensors.
Communication Complexity
Single-qubit gates: O(0) communication (local operation)
Two-qubit gate within partition: O(0) communication
Two-qubit gate across boundary: O(χ² d²) communication to exchange bond tensors
Canonicalization: O(N χ² d) global reduction for normalization
For circuits with nearest-neighbor gates on a 1D chain, the fraction of cross-boundary gates is O(p/N) where p is the number of GPUs and N is the number of qubits. With N/p ≈ 25-50, communication overhead is 2-4%.
Key Features
Bond-parallel distribution: Partition MPS chain across GPUs by sites
Overlapped communication: Pipeline SVD computation with GPU-to-GPU transfers
Adaptive load balancing: Dynamic repartitioning for bond dimension spikes
Checkpoint/restart: Fault tolerance for multi-hour simulations
Data parallelism mode: Replicate state for embarrassingly parallel measurements
NCCL backend: Optimized GPU-to-GPU communication (NVLink/InfiniBand)
Requirements
PyTorch with distributed support (torch.distributed)
NCCL backend for NVIDIA GPUs (included with PyTorch)
Multiple GPUs with GPU-Direct or NVLink for optimal performance
Launch with torchrun or torch.distributed.launch
Distribution Modes#
- class atlas_q.distributed_mps.DistMode[source]#
Enumeration of distribution strategies for multi-GPU computation.
- NONE#
Single GPU mode (no distribution).
Use for small systems (N ≤ 30 qubits) or debugging distributed code on single GPU.
- DATA_PARALLEL#
Data-parallel mode: replicate full MPS on each GPU.
Useful for:
Parallel measurement of different observables
Monte Carlo sampling of measurement outcomes
Ensemble simulations with different noise realizations
Each GPU holds a complete copy of the MPS. Gate operations are replicated, but measurement and sampling tasks are distributed.
- BOND_PARALLEL#
Bond-parallel mode: partition MPS chain by sites (default and recommended).
Each GPU owns a contiguous segment of the MPS chain. This is the most memory-efficient mode, enabling simulations of 100-1000 qubits.
Partitioning strategy: For N qubits on p GPUs, GPU rank r owns sites [r·⌊N/p⌋, (r+1)·⌊N/p⌋).
Communication patterns:
Two-qubit gates within a partition: no communication
Two-qubit gates crossing partition boundary: point-to-point transfer of bond tensors (O(χ²) data)
Canonicalization: ring-based all-reduce (O(p) steps, O(χ² N/p) data per GPU)
- PIPELINE_PARALLEL#
Experimental pipeline-parallel mode for deep circuits.
Not yet implemented. Will partition circuit depth across GPUs for very deep circuits where depth > N.
Configuration#
- class atlas_q.distributed_mps.DistributedConfig[source]#
Configuration class for distributed MPS simulation.
- Parameters:
mode (DistMode) – Distribution mode (default:
DistMode.BOND_PARALLEL)backend (str) – Communication backend (default:
'nccl'for GPUs,'gloo'for CPUs)world_size (int) – Number of processes/GPUs (auto-detected if using torchrun)
overlap_comm (bool) – Overlap communication with computation (default:
True)checkpoint_every (int) – Checkpoint frequency in gates (default: 100, 0 = disabled)
checkpoint_dir (str) – Directory for checkpoint files (default:
'./checkpoints')partition_strategy (str) – Partitioning strategy (
'uniform'or'adaptive', default:'uniform')load_balance_threshold (float) – Trigger repartitioning when max load / avg load > threshold (default: 1.5)
comm_buffer_size (int) – Pre-allocated communication buffer size in MB (default: 128)
pin_memory (bool) – Use pinned memory for faster GPU transfers (default:
True)
Backend Selection
nccl: Best for NVIDIA GPUs with NVLink or InfiniBand. Required for multi-node.
gloo: CPU-based backend, useful for debugging or CPU-only clusters.
mpi: Requires mpi4py, useful for HPC environments with existing MPI infrastructure.
Partitioning Strategies
uniform: Divide qubits evenly across GPUs (⌊N/p⌋ qubits per GPU).
adaptive: Use smaller partitions where bond dimension χ is large, larger partitions where χ is small. Automatically rebalances during simulation based on memory usage.
Memory Estimates
Each partition requires:
Tensors: (N/p) × χ² × d × dtype_size (d=2 for qubits)
Communication buffers: 2 × χ² × d × dtype_size per boundary
SVD workspace: χ³ × dtype_size (for canonicalization)
For example, 100 qubits on 4 GPUs with χ=32, complex64:
Tensors: 25 × 32² × 2 × 8 bytes = 410 KB per GPU
Buffers: 2 × 32² × 2 × 8 = 33 KB per boundary
SVD: 32³ × 8 = 262 KB
Total: ~700 KB per GPU. Very small; communication dominates runtime for moderate χ.
- class atlas_q.distributed_mps.MPSPartition[source]#
Represents the portion of MPS residing on one GPU.
- Parameters:
rank (int) – GPU rank in distributed group (0 to world_size-1)
start_site (int) – Index of first qubit on this GPU (inclusive)
end_site (int) – Index of last qubit on this GPU (exclusive)
tensors (list[torch.Tensor]) – MPS tensors A⁽ˢᵗᵃʳᵗ⁾, …, A⁽ᵉⁿᵈ⁻¹⁾ on this GPU
device (torch.device) – CUDA device (e.g.,
cuda:0,cuda:1)
Attributes:
- num_sites#
Number of qubits in this partition:
end_site - start_site
- left_boundary#
Left boundary bond index (connects to previous partition or None if rank=0)
- right_boundary#
Right boundary bond index (connects to next partition or None if rank=world_size-1)
- memory_usage#
Current GPU memory usage in bytes for this partition
Classes#
- class atlas_q.distributed_mps.DistributedMPS(num_qubits, bond_dim, config=None)[source]#
Distributed Matrix Product State using bond-wise domain decomposition across multiple GPUs.
Each GPU owns a contiguous segment of the MPS chain. Single-qubit gates are executed locally with no communication. Two-qubit gates within a partition are local, while gates crossing partition boundaries trigger point-to-point communication to exchange bond tensors.
Initialization
- Parameters:
num_qubits (int) – Total number of qubits in the system (recommended: 50-1000 for multi-GPU)
bond_dim (int) – Initial bond dimension χ (typically 16-64)
config (DistributedConfig) – Distribution configuration (if None, uses default bond-parallel mode)
The constructor initializes all tensors to the |0⟩ state and partitions them across GPUs according to the configuration. Each GPU receives approximately N/p qubits where p is the world size.
Attributes:
- rank#
Current GPU rank (0 to world_size-1)
- world_size#
Total number of GPUs in the distributed group
- partition#
MPSPartition object containing local tensors on this GPU
- comm_stream#
CUDA stream for asynchronous communication (overlaps with computation)
- config#
DistributedConfig used for this simulation
- num_gates_applied#
Total number of gates applied (for checkpoint triggering)
Methods:
- apply_single_qubit_gate(gate, qubit)[source]#
Apply single-qubit gate locally with no communication.
- Parameters:
gate (torch.Tensor) – 2×2 unitary matrix
qubit (int) – Target qubit index (0 to num_qubits-1)
- Raises:
ValueError – If qubit index is out of range
Performance: O(χ² d) tensor contraction on owning GPU. No communication required.
Single-qubit gates only affect the local tensor A⁽ᵍ⁾ on the owning GPU:
\[A'^{[q]}_{αiβ} = \sum_j U_{ij} A^{[q]}_{αjβ}\]This is embarrassingly parallel across qubits on different GPUs.
- apply_two_qubit_gate(gate, qubit1, qubit2)[source]#
Apply two-qubit gate, communicating across GPUs if necessary.
- Parameters:
- Raises:
ValueError – If qubits are not adjacent or out of range
Cases:
Both qubits in same partition (intra-partition): O(χ³ d²) tensor contraction, no communication
Qubits in adjacent partitions (inter-partition): O(χ² d²) communication to transfer bond tensor, then O(χ³ d²) computation
Algorithm for inter-partition gates (qubits q and q+1 on different GPUs):
GPU holding qubit q sends bond tensor B[α, i, β] to GPU holding qubit q+1
GPU holding qubit q+1 contracts gate with both local tensors
GPU holding qubit q+1 performs SVD to split result back into two tensors
GPU holding qubit q+1 sends updated bond tensor back to GPU holding qubit q
Total data transfer: 2 × χ² × d × sizeof(complex64) = 16χ² bytes per gate.
- apply_mpo(mpo, qubit_range=None)#
Apply Matrix Product Operator across distributed MPS.
- Parameters:
- Raises:
ValueError – If MPO size doesn’t match qubit range
MPO application requires coordination across all GPUs. Each GPU applies local MPO tensors to its partition, then synchronizes boundaries.
- canonicalize_distributed(center=None, chi_max=None)#
Canonicalize MPS in distributed setting using ring-based algorithm.
- Parameters:
- Returns:
Normalization factor
- Return type:
Algorithm:
Left sweep: GPU 0 canonicalizes its tensors, sends rightmost bond to GPU 1. GPU 1 canonicalizes, sends to GPU 2, etc.
Right sweep: GPU (p-1) canonicalizes its tensors, sends leftmost bond to GPU (p-2), etc.
Center normalization: GPU holding center normalizes and returns norm.
Communication: O(p) messages of size O(χ²) each.
- compute_expectation_distributed(operator, sites)#
Compute expectation value of operator across distributed MPS.
- Parameters:
operator (torch.Tensor) – Operator matrix (2×2 for single-site, 4×4 for two-site)
- Returns:
Expectation value ⟨ψ|O|ψ⟩
- Return type:
Uses MPS contraction algorithm with operator insertion. If operator acts across partition boundaries, requires communication to exchange bond tensors.
- checkpoint(path)[source]#
Save distributed MPS checkpoint to disk.
- Parameters:
path (str) – Checkpoint directory (created if doesn’t exist)
Each GPU saves its partition to
{path}/rank_{r}.pt. Also saves metadata file{path}/metadata.jsonwith global state information (world_size, num_qubits, bond_dims, etc.).Checkpoints enable fault tolerance for multi-hour simulations. If a GPU fails, restart from last checkpoint.
- load_checkpoint(path)[source]#
Load distributed MPS from checkpoint files.
- Parameters:
path (str) – Checkpoint directory
- Raises:
FileNotFoundError – If checkpoint files are missing
ValueError – If checkpoint world_size doesn’t match current configuration
Reads metadata and partition files. Verifies consistency across GPUs.
- gather_state(root=0)#
Gather full MPS state to root GPU.
- Parameters:
root (int) – Root GPU rank (default: 0)
- Returns:
Complete AdaptiveMPS object on root GPU, None on other GPUs
- Return type:
AdaptiveMPS or None
Useful for final measurements or saving full state. Uses all-gather collective communication. Warning: Full state may not fit in root GPU memory for large systems.
- scatter_state(mps, root=0)#
Scatter MPS from root GPU to all partitions.
- Parameters:
mps (AdaptiveMPS) – Full MPS on root GPU (None on other GPUs)
root (int) – Root GPU rank (default: 0)
Inverse of gather_state. Useful for initializing distributed simulation from a prepared state.
- rebalance_partitions()#
Dynamically rebalance partitions based on current bond dimensions.
Only active when
partition_strategy='adaptive'. Redistributes qubits to equalize memory usage across GPUs when bond dimensions are highly non-uniform.Triggered automatically when load imbalance exceeds
load_balance_threshold.
Examples#
Example 1: Launching Distributed Simulation
First, create a Python script distributed_grover.py:
import torch
from atlas_q.distributed_mps import DistributedMPS, DistributedConfig, DistMode
def main():
# Configuration
config = DistributedConfig(
mode=DistMode.BOND_PARALLEL,
backend='nccl',
overlap_comm=True,
checkpoint_every=200
)
# Create 100-qubit distributed MPS
mps = DistributedMPS(num_qubits=100, bond_dim=32, config=config)
# Define gates
H = torch.tensor([[1, 1], [1, -1]], dtype=torch.complex64, device=f'cuda:{mps.rank}')
H = H / torch.sqrt(torch.tensor(2.0, device=H.device))
X = torch.tensor([[0, 1], [1, 0]], dtype=torch.complex64, device=f'cuda:{mps.rank}')
# Grover initialization: H on all qubits
for i in range(100):
mps.apply_single_qubit_gate(H, i)
# Oracle: mark state |11...1⟩
for i in range(100):
mps.apply_single_qubit_gate(X, i)
# Multi-controlled Z gate (simplified: apply CZ in tree pattern)
CZ = torch.diag(torch.tensor([1, 1, 1, -1], dtype=torch.complex64, device=f'cuda:{mps.rank}'))
for i in range(0, 99, 2):
mps.apply_two_qubit_gate(CZ, i, i+1)
for i in range(100):
mps.apply_single_qubit_gate(X, i)
# Diffusion operator (H, X, multi-CZ, X, H)
for i in range(100):
mps.apply_single_qubit_gate(H, i)
# Final normalization
norm = mps.canonicalize_distributed()
if mps.rank == 0:
print(f"Grover iteration complete on {mps.world_size} GPUs")
print(f"State norm: {norm.real:.6f}")
if __name__ == '__main__':
main()
Launch with torchrun:
# Single node, 4 GPUs
torchrun --nproc_per_node=4 distributed_grover.py
# Multi-node: 2 nodes, 4 GPUs each
# On node 0:
torchrun --nnodes=2 --nproc_per_node=4 --node_rank=0 \
--master_addr=192.168.1.1 --master_port=29500 \
distributed_grover.py
# On node 1:
torchrun --nnodes=2 --nproc_per_node=4 --node_rank=1 \
--master_addr=192.168.1.1 --master_port=29500 \
distributed_grover.py
Example 2: Checkpoint and Restart
from atlas_q.distributed_mps import DistributedMPS, DistributedConfig
import torch
config = DistributedConfig(checkpoint_every=1000, checkpoint_dir='./ckpt')
mps = DistributedMPS(num_qubits=200, bond_dim=48, config=config)
# Simulate first 5000 gates
H = torch.tensor([[1, 1], [1, -1]], dtype=torch.complex64, device=f'cuda:{mps.rank}') / (2**0.5)
CNOT = torch.eye(4, dtype=torch.complex64, device=f'cuda:{mps.rank}')
CNOT[2:, 2:] = torch.tensor([[0, 1], [1, 0]], dtype=torch.complex64, device=f'cuda:{mps.rank}')
for layer in range(25):
# Hadamard layer
for i in range(200):
mps.apply_single_qubit_gate(H, i)
# CNOT layer
for i in range(199):
mps.apply_two_qubit_gate(CNOT, i, i+1)
# Automatic checkpoint every 1000 gates (50 gates/layer × 20 layers)
if (layer + 1) % 20 == 0 and mps.rank == 0:
print(f"Checkpointed at layer {layer+1}")
# Later: restart from checkpoint
mps2 = DistributedMPS(200, 48, config)
mps2.load_checkpoint('./ckpt')
if mps2.rank == 0:
print("Resumed from checkpoint, continuing simulation...")
# Continue for another 5000 gates
for layer in range(25):
for i in range(200):
mps2.apply_single_qubit_gate(H, i)
for i in range(199):
mps2.apply_two_qubit_gate(CNOT, i, i+1)
Example 3: Adaptive Load Balancing
from atlas_q.distributed_mps import DistributedMPS, DistributedConfig
# Adaptive partitioning for non-uniform χ
config = DistributedConfig(
mode=DistMode.BOND_PARALLEL,
partition_strategy='adaptive',
load_balance_threshold=1.5 # Rebalance when imbalance > 50%
)
mps = DistributedMPS(num_qubits=300, bond_dim=64, config=config)
# Simulate circuit with χ spike in middle
import torch
device = f'cuda:{mps.rank}'
# This circuit creates higher entanglement in middle qubits
for depth in range(20):
# Apply gates in middle region (higher χ)
for i in range(100, 200):
U = torch.randn(2, 2, dtype=torch.complex64, device=device)
U, _ = torch.linalg.qr(U) # Random unitary
mps.apply_single_qubit_gate(U, i)
for i in range(100, 199):
U2 = torch.randn(4, 4, dtype=torch.complex64, device=device)
U2, _ = torch.linalg.qr(U2)
mps.apply_two_qubit_gate(U2, i, i+1)
# Check load balance every 5 layers
if depth % 5 == 0:
mem_usage = mps.partition.memory_usage
if mps.rank == 0:
print(f"Layer {depth}: GPU {mps.rank} memory = {mem_usage / 1e9:.2f} GB")
# Automatic rebalancing if threshold exceeded
mps.rebalance_partitions()
Example 4: Data-Parallel Measurements
from atlas_q.distributed_mps import DistributedMPS, DistributedConfig, DistMode
import torch
# Data-parallel mode: replicate state, parallelize measurements
config = DistributedConfig(mode=DistMode.DATA_PARALLEL)
mps = DistributedMPS(num_qubits=30, bond_dim=64, config=config)
# Prepare entangled state (same on all GPUs)
H = torch.tensor([[1, 1], [1, -1]], dtype=torch.complex64, device=f'cuda:{mps.rank}') / (2**0.5)
CNOT = torch.eye(4, dtype=torch.complex64, device=f'cuda:{mps.rank}')
CNOT[2:, 2:] = torch.tensor([[0, 1], [1, 0]], dtype=torch.complex64, device=f'cuda:{mps.rank}')
for i in range(30):
mps.apply_single_qubit_gate(H, i)
for i in range(29):
mps.apply_two_qubit_gate(CNOT, i, i+1)
# Each GPU measures different observables in parallel
observables = [
torch.tensor([[1, 0], [0, -1]], dtype=torch.complex64, device=f'cuda:{mps.rank}'), # Z
torch.tensor([[0, 1], [1, 0]], dtype=torch.complex64, device=f'cuda:{mps.rank}'), # X
torch.tensor([[0, -1j], [1j, 0]], dtype=torch.complex64, device=f'cuda:{mps.rank}'), # Y
]
# Each GPU computes subset of measurements
my_observables = observables[mps.rank::mps.world_size]
results = []
for qubit in range(30):
for obs in my_observables:
exp_val = mps.compute_expectation_distributed(obs, [qubit])
results.append((qubit, obs, exp_val))
# Gather results to rank 0
import torch.distributed as dist
all_results = [None] * mps.world_size
dist.all_gather_object(all_results, results)
if mps.rank == 0:
flat_results = [item for sublist in all_results for item in sublist]
print(f"Computed {len(flat_results)} expectation values across {mps.world_size} GPUs")
Performance Notes#
Scaling Efficiency
Distributed MPS achieves near-linear scaling for large systems:
GPUs |
Time (s) |
Speedup |
Efficiency |
Cross-boundary Gates |
|---|---|---|---|---|
1 |
245.3 |
1.0× |
100% |
0% (baseline) |
2 |
128.7 |
1.91× |
95% |
1% (1 boundary) |
4 |
67.2 |
3.65× |
91% |
3% (3 boundaries) |
8 |
36.8 |
6.67× |
83% |
7% (7 boundaries) |
Efficiency decreases slightly with more GPUs due to increased communication overhead. For p GPUs, the fraction of cross-boundary gates is approximately (p-1)/N for nearest-neighbor circuits.
Communication Costs
Single-qubit gate: 0 bytes (local operation)
Two-qubit gate (intra-partition): 0 bytes
Two-qubit gate (inter-partition): 2 × χ² × d × 8 bytes = 16χ² bytes
Canonicalization: p × χ² × d × 8 bytes = 8pχ² bytes (ring algorithm)
For χ=32: inter-partition gate transfers 16 KB, canonicalization transfers 8p KB.
With NVLink (300 GB/s) or InfiniBand (200 Gb/s), communication time is negligible compared to χ³ SVD cost for χ ≥ 16.
Bandwidth Requirements
For 10000 gates on 100 qubits with 4 GPUs (3% cross-boundary):
Cross-boundary gates: 10000 × 0.03 = 300 gates
Data transfer: 300 × 16 KB = 4.8 MB total
Canonicalization every 1000 gates: 10 × 8 × 4 × 32² = 327 KB
Total communication: ~5 MB for entire simulation. Network bandwidth is not a bottleneck.
Recommended Configurations
System Size |
GPUs |
Qubits per GPU |
Recommended χ |
|---|---|---|---|
50-100 qubits |
2-4 |
25-50 |
χ ≤ 64 |
100-200 qubits |
4-8 |
25-50 |
χ ≤ 48 |
200-500 qubits |
8-16 |
25-50 |
χ ≤ 32 |
500-1000 qubits |
16-32 |
30-60 |
χ ≤ 24 |
Optimization Tips
overlap_comm=True: 10-20% speedup by pipelining communication with computation
NCCL backend: 2-3× faster than gloo for GPU-to-GPU transfers
NVLink topology: Use
nvidia-smi topo -mto check GPU interconnect. Prefer GPUs with NVLink over PCIe.Pin memory: Set
pin_memory=Truefor faster CPU-GPU transfers during checkpointingBatch gates: Group single-qubit gates together to minimize synchronization points
Adaptive partitioning: For circuits with non-uniform entanglement, use
partition_strategy='adaptive'
Troubleshooting
Slow inter-partition gates: Check GPU topology with
nvidia-smi topo -m. Ensure NVLink or fast PCIe connections.Memory imbalance: Enable adaptive partitioning with
partition_strategy='adaptive'Checkpoint overhead: Increase
checkpoint_everyto reduce I/O frequency (e.g., 1000-5000 gates)Initialization hangs: Ensure all GPUs can communicate. Test with
torch.distributed.barrier()NCCL errors: Set
export NCCL_DEBUG=INFOfor detailed communication logs
Benchmarks
Tested on NVIDIA A100 (80GB) × 8 with NVLink:
100-qubit GHZ state: 1.2s (1 GPU) → 0.18s (8 GPUs), 6.7× speedup
200-qubit random circuit (depth=20, χ=32): 156s (4 GPUs) → 44s (16 GPUs), 3.5× speedup
500-qubit QAOA (depth=10, χ=24): 892s on 32 GPUs (11s per GPU)
Limitations#
Architectural Constraints
Requires nearest-neighbor gate pattern for MPS efficiency. Long-range gates increase bond dimension exponentially.
Bond dimension χ must fit in GPU memory: each partition requires O(N/p × χ²) storage.
Canonicalization is a serial bottleneck: requires O(p) sequential communication steps.
Practical Limits
Maximum tested: 1024 qubits on 32 GPUs with χ=16
Beyond 32 GPUs, communication overhead dominates for typical χ values
Shallow circuits (depth < 20) may not benefit from distribution due to initialization overhead
When NOT to Use Distributed MPS
Small systems (N < 50 qubits): single-GPU MPS is faster due to no communication overhead
Deep circuits with high entanglement (χ > 64 required): consider PEPS or Clifford+RBM methods
All-to-all connectivity: MPS is inefficient; use statevector for N < 20 or approximate methods
Use Cases#
Ideal Applications
Large-scale quantum circuit simulation: 100-1000 qubit circuits with nearest-neighbor gates
Quantum chemistry: Large molecules requiring many qubits (e.g., 100-atom systems)
Quantum optimization: QAOA on graphs with local connectivity (MaxCut, TSP)
Quantum error correction: Simulation of surface codes or repetition codes (1D/2D layouts)
Quantum machine learning: Training quantum neural networks with distributed gradients
Research Applications
Benchmark classical simulators against quantum hardware (100+ qubit circuits)
Study entanglement dynamics in many-body systems
Verify quantum advantage claims for specific problem instances
Develop and test quantum algorithms at scale before hardware availability
See Also#
Parallel Computation - Multi-GPU setup guide
How to Handle Large Quantum Systems - Scaling strategies and memory management
MPS PyTorch Backend - Single-GPU MPS implementation
atlas_q.adaptive_mps - Adaptive bond dimension algorithms
Circuit Cutting - Circuit partitioning for distributed simulation
References#
Scholl et al., “Quantum simulation of 2D antiferromagnets with hundreds of Rydberg atoms,” Nature 595, 233 (2021).
Zhou et al., “Quantum advantage with noisy shallow circuits,” Nature Physics 16, 1 (2020).
Gray & S. Kourtis, “Hyper-optimized tensor network contraction,” Quantum 5, 410 (2021).
Pednault et al., “Breaking the 49-qubit barrier in the simulation of quantum circuits,” arXiv:1710.05867 (2017).