Parallel Computation#
Problem#
Quantum simulations are computationally expensive, but many tasks are parallelizable:
Large bond dimensions: Single GPU insufficient for χ > 512
Parameter sweeps: VQE/QAOA hyperparameter optimization requires many trials
Statistical sampling: Need 10,000+ measurement shots for accurate distributions
Ensemble averaging: Molecular dynamics, noise averaging, ensemble studies
Multi-task workflows: Benchmark different algorithms or Hamiltonians simultaneously
This guide covers parallel computation strategies for ATLAS-Q, including multi-GPU distributed MPS, data parallelism for independent simulations, and task parallelism for parameter sweeps and ensemble studies.
See also
How to Handle Large Quantum Systems for distributed MPS details, Integrate cuQuantum for GPU-accelerated operations, How to Optimize Performance for single-GPU performance tuning, ../reference/distributed_mps for API reference.
Prerequisites#
You need:
Multiple GPUs: 2+ NVIDIA GPUs (same or different models)
PyTorch distributed: Installed with CUDA support
Network: For multi-node, InfiniBand or 10+ Gbps Ethernet
NCCL backend: For efficient GPU-to-GPU communication
Storage: Shared filesystem for multi-node checkpoints
Strategies#
Strategy 1: Multi-GPU Distributed MPS (Bond-Parallel)#
Split MPS bond dimension across multiple GPUs for large-scale simulations.
Bond-parallel MPS setup:
from atlas_q.distributed_mps import DistributedMPS, DistributedConfig
import torch.distributed as dist
# Initialize distributed backend
dist.init_process_group(
backend='nccl',
init_method='env://', # Use environment variables
world_size=4, # 4 GPUs
rank=int(os.environ['RANK'])
)
# Configure bond-parallel distribution
config = DistributedConfig(
mode='bond_parallel', # Split bond dimension
world_size=4,
backend='nccl',
device_ids=[0, 1, 2, 3]
)
# Create distributed MPS
# Bond dimension 512 split across 4 GPUs = 128 per GPU
mps = DistributedMPS(
num_qubits=80,
bond_dim=512,
config=config
)
rank = dist.get_rank()
print(f"Rank {rank}: Managing bond dim partition of {512 // 4}")
# Apply gates - automatically distributed
for i in range(79):
mps.apply_cnot(i, i+1)
dist.barrier() # Synchronize all processes
print(f"Rank {rank}: Completed distributed gate sequence")
Launch distributed job:
# Single-node multi-GPU (4 GPUs)
torchrun \
--nproc_per_node=4 \
--master_port=29500 \
distributed_mps_script.py
# Multi-node (2 nodes, 4 GPUs each = 8 GPUs total)
# On node 0:
torchrun \
--nproc_per_node=4 \
--nnodes=2 \
--node_rank=0 \
--master_addr=node0.example.com \
--master_port=29500 \
distributed_mps_script.py
# On node 1:
torchrun \
--nproc_per_node=4 \
--nnodes=2 \
--node_rank=1 \
--master_addr=node0.example.com \
--master_port=29500 \
distributed_mps_script.py
Distributed TDVP example:
from atlas_q.distributed_mps import DistributedMPS, DistributedConfig
from atlas_q.tdvp import TDVP
import torch.distributed as dist
# Initialize distributed backend
dist.init_process_group(backend='nccl', init_method='env://')
rank = dist.get_rank()
world_size = dist.get_world_size()
# Configure distributed MPS
config = DistributedConfig(
mode='bond_parallel',
world_size=world_size,
backend='nccl'
)
# Create large distributed MPS
mps = DistributedMPS(
num_qubits=100,
bond_dim=1024, # Split across GPUs
config=config
)
# TDVP time evolution on distributed MPS
tdvp = TDVP(hamiltonian=H, mps=mps, dt=0.01, device=f'cuda:{rank}')
for step in range(1000):
E = tdvp.evolve_step()
if rank == 0 and step % 100 == 0:
print(f"[Step {step}] Energy: {E:.8f}")
dist.barrier()
dist.destroy_process_group()
Strategy 2: Data Parallelism for Independent Simulations#
Run multiple independent simulations in parallel for parameter sweeps or ensemble averaging.
Parallel VQE hyperparameter search:
import torch.multiprocessing as mp
from atlas_q.vqe_qaoa import VQE, VQEConfig
import torch
def run_vqe_trial(rank, H, learning_rates, return_dict):
"""
Run VQE with specific learning rate on assigned GPU.
Parameters
----------
rank : int
GPU index
H : Hamiltonian
Problem Hamiltonian
learning_rates : list
Learning rates to try (one per GPU)
return_dict : multiprocessing.Manager.dict
Shared dictionary for results
"""
torch.cuda.set_device(rank)
lr = learning_rates[rank]
config = VQEConfig(
max_iterations=500,
optimizer='adam',
learning_rate=lr
)
vqe = VQE(hamiltonian=H, config=config, device=f'cuda:{rank}')
energy, params = vqe.optimize()
# Store results
return_dict[rank] = {
'learning_rate': lr,
'final_energy': energy,
'num_iterations': len(vqe.energies),
'convergence': vqe.energies
}
print(f"GPU {rank}: lr={lr:.4f}, E={energy:.8f}")
if __name__ == '__main__':
# Hyperparameter grid
learning_rates = [0.001, 0.005, 0.01, 0.05] # 4 trials for 4 GPUs
num_gpus = len(learning_rates)
# Shared result storage
manager = mp.Manager()
return_dict = manager.dict()
# Launch parallel VQE trials
mp.spawn(
run_vqe_trial,
args=(H, learning_rates, return_dict),
nprocs=num_gpus,
join=True
)
# Analyze results
best_energy = float('inf')
best_lr = None
for rank, result in return_dict.items():
lr = result['learning_rate']
energy = result['final_energy']
print(f"lr={lr:.4f}: E={energy:.8f}")
if energy < best_energy:
best_energy = energy
best_lr = lr
print(f"\nBest learning rate: {best_lr:.4f} (E={best_energy:.8f})")
Parallel QAOA with different ansatz depths:
from atlas_q.vqe_qaoa import QAOA, QAOAConfig
import torch.multiprocessing as mp
def run_qaoa_depth(rank, H, depths, results):
"""Run QAOA with specific depth p."""
torch.cuda.set_device(rank)
p = depths[rank]
config = QAOAConfig(
p=p,
optimizer='adam',
learning_rate=0.02,
max_iterations=500
)
qaoa = QAOA(hamiltonian=H, config=config, device=f'cuda:{rank}')
energy, params = qaoa.optimize()
results[rank] = {
'p': p,
'energy': energy,
'params': params.cpu()
}
print(f"GPU {rank}: p={p}, E={energy:.8f}")
if __name__ == '__main__':
# Test different QAOA depths
depths = [1, 2, 3, 4, 5, 6, 7, 8] # 8 trials for 8 GPUs
num_gpus = min(8, torch.cuda.device_count())
manager = mp.Manager()
results = manager.dict()
# Run in parallel
mp.spawn(
run_qaoa_depth,
args=(H_maxcut, depths[:num_gpus], results),
nprocs=num_gpus,
join=True
)
# Plot depth vs energy
import matplotlib.pyplot as plt
ps = [results[i]['p'] for i in range(num_gpus)]
energies = [results[i]['energy'] for i in range(num_gpus)]
plt.plot(ps, energies, 'o-')
plt.xlabel('QAOA depth p')
plt.ylabel('Final energy')
plt.title('QAOA performance vs ansatz depth')
plt.savefig('qaoa_depth_comparison.png')
Strategy 3: Task Parallelism for Ensemble Studies#
Parallelize ensemble averaging or multi-molecule simulations.
Parallel molecular VQE ensemble:
from atlas_q.vqe_qaoa import VQE, VQEConfig
import torch.multiprocessing as mp
# Molecular Hamiltonians (e.g., H2 at different bond lengths)
molecules = [
('H2_0.5A', H_h2_05),
('H2_0.7A', H_h2_07),
('H2_0.9A', H_h2_09),
('H2_1.1A', H_h2_11)
]
def compute_molecule_energy(rank, molecule, results):
"""Compute ground state energy for a molecule."""
name, H = molecule
torch.cuda.set_device(rank)
config = VQEConfig(
max_iterations=1000,
optimizer='lbfgs',
convergence_threshold=1e-8
)
vqe = VQE(hamiltonian=H, config=config, device=f'cuda:{rank}')
energy, params = vqe.optimize()
results[name] = {
'energy': energy,
'params': params.cpu()
}
print(f"{name} on GPU {rank}: E={energy:.10f} Hartree")
if __name__ == '__main__':
num_gpus = len(molecules)
manager = mp.Manager()
results = manager.dict()
# Compute all molecules in parallel
processes = []
for rank, molecule in enumerate(molecules):
p = mp.Process(
target=compute_molecule_energy,
args=(rank, molecule, results)
)
p.start()
processes.append(p)
for p in processes:
p.join()
# Plot potential energy curve
import matplotlib.pyplot as plt
bond_lengths = [0.5, 0.7, 0.9, 1.1]
energies = [results[name]['energy'] for name, _ in molecules]
plt.plot(bond_lengths, energies, 'o-')
plt.xlabel('Bond length (Å)')
plt.ylabel('Energy (Hartree)')
plt.title('H2 potential energy curve')
plt.savefig('h2_pes.png')
Strategy 4: Parallel Measurement Sampling#
Parallelize measurement shots across multiple GPUs for faster sampling.
Parallel shot sampling:
from atlas_q.adaptive_mps import AdaptiveMPS
import torch.multiprocessing as mp
import torch
def sample_on_gpu(rank, mps_state, num_shots, results):
"""Sample from MPS on specific GPU."""
torch.cuda.set_device(rank)
# Reconstruct MPS on this GPU
mps = AdaptiveMPS(
num_qubits=mps_state['num_qubits'],
bond_dim=max(mps_state['bond_dims']),
device=f'cuda:{rank}'
)
mps.tensors = [t.to(f'cuda:{rank}') for t in mps_state['tensors']]
mps.bond_dims = mps_state['bond_dims']
# Sample num_shots on this GPU
samples = mps.sample(num_shots=num_shots)
results[rank] = samples.cpu()
print(f"GPU {rank}: Sampled {num_shots} shots")
if __name__ == '__main__':
# Prepare MPS state
mps = AdaptiveMPS(num_qubits=30, bond_dim=64, device='cuda:0')
# ... apply circuit ...
# Save state for distribution
mps_state = {
'num_qubits': mps.num_qubits,
'tensors': [t.cpu() for t in mps.tensors],
'bond_dims': mps.bond_dims
}
# Parallel sampling: 40,000 shots across 4 GPUs = 10,000 per GPU
total_shots = 40000
num_gpus = 4
shots_per_gpu = total_shots // num_gpus
manager = mp.Manager()
results = manager.dict()
# Launch sampling on each GPU
mp.spawn(
sample_on_gpu,
args=(mps_state, shots_per_gpu, results),
nprocs=num_gpus,
join=True
)
# Combine samples
all_samples = torch.cat([results[i] for i in range(num_gpus)], dim=0)
print(f"Total samples: {all_samples.shape[0]}")
# Compute expectation values from samples
# ... analysis ...
Strategy 5: Circuit Batching Across GPUs#
Execute different circuits on different GPUs for benchmarking or algorithm comparison.
Parallel circuit execution:
from atlas_q.adaptive_mps import AdaptiveMPS
import torch.multiprocessing as mp
def execute_circuit(rank, circuit, num_qubits, bond_dim, results):
"""Execute circuit on assigned GPU."""
torch.cuda.set_device(rank)
mps = AdaptiveMPS(
num_qubits=num_qubits,
bond_dim=bond_dim,
device=f'cuda:{rank}'
)
# Apply circuit gates
for gate in circuit:
if gate['type'] == 'H':
mps.apply_hadamard(gate['qubit'])
elif gate['type'] == 'CNOT':
mps.apply_cnot(gate['control'], gate['target'])
# ... other gates ...
# Compute final observables
samples = mps.sample(num_shots=10000)
energy = mps.expectation_value(hamiltonian)
results[rank] = {
'samples': samples.cpu(),
'energy': energy,
'final_chi': max(mps.bond_dims)
}
print(f"GPU {rank}: Circuit completed, E={energy:.6f}")
if __name__ == '__main__':
# Different circuits to compare
circuits = [
generate_qaoa_circuit(p=1),
generate_qaoa_circuit(p=2),
generate_vqe_circuit(layers=4),
generate_vqe_circuit(layers=8)
]
num_gpus = len(circuits)
manager = mp.Manager()
results = manager.dict()
# Execute in parallel
processes = []
for rank, circuit in enumerate(circuits):
p = mp.Process(
target=execute_circuit,
args=(rank, circuit, 20, 64, results)
)
p.start()
processes.append(p)
for p in processes:
p.join()
# Compare results
for rank, result in results.items():
print(f"Circuit {rank}: E={result['energy']:.6f}, χ_max={result['final_chi']}")
Strategy 6: Multi-Node Distributed Computing#
Scale to multiple compute nodes with InfiniBand or high-speed network.
Multi-node setup:
# multi_node_simulation.py
from atlas_q.distributed_mps import DistributedMPS, DistributedConfig
import torch.distributed as dist
import os
def main():
# Initialize distributed backend
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
master_addr = os.environ['MASTER_ADDR']
master_port = os.environ['MASTER_PORT']
dist.init_process_group(
backend='nccl',
init_method=f'tcp://{master_addr}:{master_port}',
world_size=world_size,
rank=rank
)
local_rank = rank % torch.cuda.device_count()
torch.cuda.set_device(local_rank)
print(f"Rank {rank}/{world_size} initialized on GPU {local_rank}")
# Configure distributed MPS
# 4 nodes × 4 GPUs/node = 16 GPUs total
config = DistributedConfig(
mode='bond_parallel',
world_size=world_size,
backend='nccl'
)
# Extremely large MPS: 150 qubits, χ=2048 split across 16 GPUs
mps = DistributedMPS(
num_qubits=150,
bond_dim=2048, # 128 per GPU
config=config
)
print(f"Rank {rank}: Created distributed MPS")
# Run simulation
from atlas_q.tdvp import TDVP
tdvp = TDVP(hamiltonian=H, mps=mps, dt=0.001, device=f'cuda:{local_rank}')
for step in range(10000):
E = tdvp.evolve_step()
if rank == 0 and step % 100 == 0:
print(f"[Step {step}] Energy: {E:.10f}")
dist.barrier()
dist.destroy_process_group()
if __name__ == '__main__':
main()
SLURM job script for multi-node:
#!/bin/bash
#SBATCH --job-name=atlas_q_multi_node
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH --time=24:00:00
# Load modules
module load cuda/12.1
module load nccl/2.18
# Set environment
export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_PORT=29500
# Launch distributed training
srun python multi_node_simulation.py
Troubleshooting#
NCCL Initialization Fails#
Problem: RuntimeError: NCCL error: unhandled system error.
Solution: Check network connectivity and NCCL installation.
# Test NCCL
$NCCL_ROOT/bin/nccl-test -g 2 # Test 2 GPUs
# Check network
ping -c 5 <other_node>
# Verify NCCL environment
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
# Re-run with verbose logging
Process Group Timeout#
Problem: RuntimeError: Timed out initializing process group.
Solution: Increase timeout or check master address/port.
import torch.distributed as dist
import datetime
# Increase timeout to 30 minutes
dist.init_process_group(
backend='nccl',
init_method='env://',
timeout=datetime.timedelta(minutes=30)
)
Out of Sync Errors#
Problem: GPUs out of sync, incorrect results.
Solution: Add explicit barriers and synchronization.
import torch.distributed as dist
# Synchronize before and after critical sections
dist.barrier()
# ... perform distributed operation ...
dist.barrier()
# Ensure CUDA operations complete
torch.cuda.synchronize()
Memory Imbalance Across GPUs#
Problem: Some GPUs run out of memory while others idle.
Solution: Balance workload or adjust partitioning.
# Check memory usage per GPU
for i in range(torch.cuda.device_count()):
allocated = torch.cuda.memory_allocated(i) / 1024**3
total = torch.cuda.get_device_properties(i).total_memory / 1024**3
print(f"GPU {i}: {allocated:.2f} / {total:.2f} GB")
# Adjust bond dimension partitioning
# E.g., assign more work to GPUs with more memory
Summary#
Parallel computation strategies for ATLAS-Q:
Bond-parallel MPS: Split bond dimension across GPUs for χ > 512
Data parallelism: Run independent simulations for hyperparameter search
Task parallelism: Parallelize ensemble studies or multi-molecule VQE
Parallel sampling: Distribute measurement shots across GPUs
Circuit batching: Execute different circuits on different GPUs
Multi-node: Scale to 10+ GPUs across multiple compute nodes
Speedup expectations:
Bond-parallel MPS: Near-linear scaling up to communication overhead (~8 GPUs)
Data parallelism: Perfect linear scaling (N GPUs = N× throughput)
Parallel sampling: Linear scaling for large shot counts
Multi-node: 70-90% efficiency with InfiniBand, 40-60% with Ethernet
When to use each strategy:
Bond-parallel: Single large simulation (n > 80, χ > 512)
Data parallelism: Hyperparameter search, ensemble averaging
Circuit batching: Algorithm comparison, benchmarking
Multi-node: Extremely large systems (n > 100, χ > 1024)
See Also#
How to Handle Large Quantum Systems: Distributed MPS implementation details
Integrate cuQuantum: GPU acceleration with cuQuantum
How to Optimize Performance: Single-GPU optimization before scaling
../reference/distributed_mps: Distributed MPS API reference
Benchmark Comparison: Parallel benchmarking strategies