Unexpected `nan` values in `TDVPSchmitt` with specific `n_samples` values #1959

Daniel-Haas-B · 2024-11-12T09:24:26Z

Daniel-Haas-B
Nov 12, 2024

I'm experiencing an issue with TDVPSchmitt time evolution in NetKet where the algorithm produces nan values for certain values of n_samples. Specifically, when n_samples is some powers of two (e.g., 128, 256, 512, 1024, and others.), the observables computed during the time evolution become nan. However, when I increment these sample sizes by 1 (e.g., 129, 257, 513, etc.), the problem disappears, and the observables converge as expected.

Code to Reproduce:

import numpy as np
import netket as nk
import netket.experimental as nkx
import copy

# System parameters
L = 2
hi = nk.hilbert.Spin(0.5, L)

# Hamiltonian setup
h = 1.0  
J = 1.0  
h_eff = h + J
H1 = nk.operator.LocalOperator(hi, dtype=np.complex128)
for i in range(L):
    H1 -= h_eff * nk.operator.spin.sigmaz(hi, i)

# Sample sizes (powers of two)
n_samples = [2**i for i in range(7, 15)] 

for samples in n_samples:
    print(f"=========== samples: {samples} ===========")
    model = nk.models.LogStateVector(hi, param_dtype=complex)

    sa = nk.sampler.ExactSampler(hi)
    vs1 = nk.vqs.MCState(
        model=model, 
        sampler=sa, 
        n_samples=samples, 
        seed=214748364,
    )
    vs2 = copy.deepcopy(vs1)

    # Observables
    obs = {
        "sum_sx": sum(nk.operator.spin.sigmax(hi, i) for i in range(L)),
        "sum_sy": sum(nk.operator.spin.sigmay(hi, i) for i in range(L)),
        "sum_sz": sum(nk.operator.spin.sigmaz(hi, i) for i in range(L)),
    }

    # Time evolution parameters
    dt = 0.001
    integrator = nkx.dynamics.Euler(dt=dt)
    qgt = nk.optimizer.qgt.QGTJacobianDense(holomorphic=True)
    total_time = 0.5

    # TDVP Schmitt time evolution
    te1 = nkx.driver.TDVPSchmitt(
        operator=H1, 
        variational_state=vs1, 
        ode_solver=integrator, 
        holomorphic=True
    )
    te1.run(T=total_time, obs=obs)

    print("sz", te1.state.expect(obs["sum_sz"]))
    print("sx", te1.state.expect(obs["sum_sx"]))

    # Standard TDVP time evolution for comparison
    te2 = nkx.driver.TDVP(
        operator=H1, 
        variational_state=vs2, 
        ode_solver=integrator, 
        qgt=qgt
    )
    te2.run(T=total_time, obs=obs)

Observed Behavior:

For several n_samples that are powers of two, the observables from TDVPSchmitt suddenly becomes nan values:

=========== samples: 128 ===========
100%|█████████████████████████████| 0.50/0.50 [00:01<00:00, 3.80s/it]
sz nan+nanj ± nan [σ²=nan]
sx nan+nanj ± nan [σ²=nan]

What is very strange to me, however, is that if we instead use n_samples incremented by 1 ([2**i + 1 for i in range(7, 15)]), the problem does not occur for any of the cases mentioned:

=========== samples: 129 ===========
100%|█████████████████████████████| 0.50/0.50 [00:01<00:00, 3.67s/it]
sz -0.05+0.00j ± 0.10 [σ²=1.77]
sx -0.825+0.042j ± 0.093 [σ²=1.465]
...

The standard TDVP evolution works correctly for all n_samples, and plotting the observables over time shows that TDVPSchmitt starts by agreeing with TDVP but then produces nan values abruptly:

To be clear, this does not happen for all n_samples powers of two, but with several (for example, for the code here, all n_samples = [2**i for i in range(7,15)] except 2048 and 8192). I simply want to understand this better, as this not happening for any of n_samples_plus_one= [2**i + 1 for i in range(7, 15)] seems very specific. I have tried to: adjust rcond values and diagonal shifts, tried different integrators and time steps, and other stuff with little to no change and certainly no intuition behind it. Different seeds do somewhat change which power of two converge or not, but the overall behaviour is the same.

I think the issue is related to the Hamiltonian in this case. We do not get this behaviour with a regular nk.operator.Ising(hi, graph=g, h=1.0, J=1.0), but I was trying to address issue #1552, and for that was testing this simplified "mean-field" of sorts.

Context/env:

NetKet 3.14.4.post1
OS: Mac M2 Pro w/ Sonoma 14.4.1
Python 3.12.6

Answered by PhilipVinc

Nov 12, 2024

@Daniel-Haas-B , in the case I was checking above, the E_loc is not nan:

Array([[-4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,
        -4.+0.j,  0.+0.j,  4.+0.j,  0.+0.j,  4.+0.j,  0.+0.j,  4.+0.j,
        -4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,
         0.+0.j, -4.+0.j, -4.+0.j,  4.+0.j, -4.+0.j,  4.+0.j,  4.+0.j,
         4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,  4.+0.j,
         4.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,  4.+0.j,  4.+0.j, -4.+0.j,
         4.+0.j,  4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,
         0.+0.j, -4.+0.j,  4.+0.j, -4.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,  0.+0…

View full answer

PhilipVinc · 2024-11-12T11:06:35Z

PhilipVinc
Nov 12, 2024
Maintainer

Oh, wow! Thank you for the clear reproducible example!

I tried to modify it by stopping it as soon as we get a nan, by inserting

def stopmecb(step, logdata, driver):
    if driver._dw is None:
        return True
    dw , _ = nk.jax.tree_ravel(driver._dw)
    return not bool(jnp.any(jnp.isnan(dw)))
...
te1.run(T=total_time, obs=obs, callback = stopmecb)

which will stop when the update is NaN. Then we can check the eigenvalues...

e,s=jnp.linalg.eigh(te1._S.to_dense())
print(e)

and I see that there is reliably a numerical zero (1e-17).
Maybe its related to that?

I then tried to run manually the algorithm in

netket/netket/experimental/driver/tdvp_schmitt.py

Line 226 in 6513139

rho = V.conj().T @ F

by hand with the variational state and samples at that point, and I saw that rho gets a zero

>>> rho
Array([ 0.00000000e+00+0.j, -5.55111512e-17+0.j, -1.54679608e+00+0.j,
        3.21964677e-15+0.j], dtype=complex128)

so snr becomes nan

>>> snr
Array([           nan, 1.36241825e-13, 1.59092599e+03, 3.78584085e-12],      dtype=float64)

So I guess it might be that we have to sanitise rho in here?

8 replies

Daniel-Haas-B Nov 12, 2024
Author

My impression is that E_loc from _impl(parameters, n_samples, E_loc, S, rhs_coeff, rcond, rcond_smooth, snr_atol): becomes nan first. At least that is what I see If I in a dirty way just comment out the jit decorator and print stuff...

PhilipVinc Nov 12, 2024
Maintainer

How can an Eloc be nan?
This can only happen if a sample is such that the wave function is zero, but it can't be sampled...

PhilipVinc Nov 12, 2024
Maintainer

@Daniel-Haas-B , in the case I was checking above, the E_loc is not nan:

Array([[-4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,
        -4.+0.j,  0.+0.j,  4.+0.j,  0.+0.j,  4.+0.j,  0.+0.j,  4.+0.j,
        -4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,
         0.+0.j, -4.+0.j, -4.+0.j,  4.+0.j, -4.+0.j,  4.+0.j,  4.+0.j,
         4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,  4.+0.j,
         4.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,  4.+0.j,  4.+0.j, -4.+0.j,
         4.+0.j,  4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,
         0.+0.j, -4.+0.j,  4.+0.j, -4.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,
         4.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,  4.+0.j, -4.+0.j, -4.+0.j,
         4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,
        -4.+0.j, -4.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,  4.+0.j,  4.+0.j,
        -4.+0.j, -4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,  4.+0.j,  4.+0.j,
         4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  4.+0.j,  4.+0.j,  4.+0.j,  4.+0.j, -4.+0.j,  0.+0.j,
         4.+0.j,  4.+0.j,  4.+0.j,  0.+0.j,  4.+0.j,  0.+0.j,  0.+0.j,
        -4.+0.j,  0.+0.j]], dtype=complex128)

So, the actual problem is that

# Compute the SNR according to Eq. 21
>>> jnp.sqrt(stats.var(QEdata, axis=0))
Array([0.        , 0.00460972, 0.01099988, 0.00962168], dtype=float64)

the variance of QE is 0, and since we divide by that, it breaks down everything.

This comes from inverting this equation

which of course is wrong if the variance is zero, which essentially means that this direction is not relevant?
Possibly in this case we can simply put $$\dot\eta_k = 0$$, so maybe a better implementation of

    snr = jnp.abs(rho) * jnp.sqrt(n_samples) / jnp.sqrt(stats.var(QEdata, axis=0))

might be

    sigma_k = jnp.sqrt(stats.var(QEdata, axis=0))
    snr = jnp.where(jnp.iszero(sigma_k), jnp.zeros_like(rho),  jnp.abs(rho) * jnp.sqrt(n_samples) / sigma_k)

?
cc @markusschmitt

Answer selected by Daniel-Haas-B

Daniel-Haas-B Nov 12, 2024
Author

@Daniel-Haas-B , in the case I was checking above, the E_loc is not nan:

Array([[-4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,
        -4.+0.j,  0.+0.j,  4.+0.j,  0.+0.j,  4.+0.j,  0.+0.j,  4.+0.j,
        -4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,
         0.+0.j, -4.+0.j, -4.+0.j,  4.+0.j, -4.+0.j,  4.+0.j,  4.+0.j,
         4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,  4.+0.j,
         4.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,  4.+0.j,  4.+0.j, -4.+0.j,
         4.+0.j,  4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,
         0.+0.j, -4.+0.j,  4.+0.j, -4.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,
         4.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,  4.+0.j, -4.+0.j, -4.+0.j,
         4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,
        -4.+0.j, -4.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,  4.+0.j,  4.+0.j,
        -4.+0.j, -4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j,  4.+0.j, -4.+0.j,  4.+0.j,  4.+0.j,
         4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  0.+0.j,  0.+0.j, -4.+0.j,  0.+0.j,  0.+0.j,  0.+0.j,
         0.+0.j,  4.+0.j,  4.+0.j,  4.+0.j,  4.+0.j, -4.+0.j,  0.+0.j,
         4.+0.j,  4.+0.j,  4.+0.j,  0.+0.j,  4.+0.j,  0.+0.j,  0.+0.j,
        -4.+0.j,  0.+0.j]], dtype=complex128)

Indeed, I was sloppy in my debugging!

Daniel-Haas-B Nov 12, 2024
Author

... maybe a better implementation of

    snr = jnp.abs(rho) * jnp.sqrt(n_samples) / jnp.sqrt(stats.var(QEdata, axis=0))

might be

    sigma_k = jnp.sqrt(stats.var(QEdata, axis=0))
    snr = jnp.where(jnp.iszero(sigma_k), jnp.zeros_like(rho),  jnp.abs(rho) * jnp.sqrt(n_samples) / sigma_k)

I guess jnp.zeros_like(rho) will explode the soft cutoff from

    regularizer2 = regularizer * (1.0 / (1.0 + (snr_atol / snr) ** 6))

but jnp.ones_like works, if that makes sense - and using isclose instead of iszero :)

    snr = jnp.where(
        jnp.isclose(sigma_k, 0), jnp.ones_like(rho), jnp.abs(rho) * jnp.sqrt(n_samples) / sigma_k
    )

Thanks for the fast response by the way!

PhilipVinc Nov 12, 2024
Maintainer

Nice.

You should think a moment what is the right way to regularise this...

Is it putting this to 1?
I thought that if the SNR is 0, then we should not move along that direction..?

PhilipVinc Nov 12, 2024
Maintainer

another issue seems to be that rho is complex.
I think it should be real but should double check...

Daniel-Haas-B · 2024-11-14T10:37:48Z

Daniel-Haas-B
Nov 14, 2024
Author

Marking as answered since the reason behind the unexpected behaviour was found and explained. @PhilipVinc let me know if you want me to open an issue or PR attempt :)

2 replies

PhilipVinc Nov 14, 2024
Maintainer

I'd highly appreciate a PR to fix the underlying issue.
If you open something (with a test) I'll help you get it merged

gcarleo Nov 14, 2024
Maintainer

just out of curiosity, is this driver widely used ? could we think of moving it in a separate repository if it is not the case ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NetKet

Unexpected `nan` values in `TDVPSchmitt` with specific `n_samples` values #1959

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

NetKet

Unexpected nan values in TDVPSchmitt with specific n_samples values #1959

Daniel-Haas-B Nov 12, 2024

Code to Reproduce:

Observed Behavior:

Context/env:

Replies: 2 comments · 10 replies

PhilipVinc Nov 12, 2024 Maintainer

Daniel-Haas-B Nov 12, 2024 Author

PhilipVinc Nov 12, 2024 Maintainer

PhilipVinc Nov 12, 2024 Maintainer

Daniel-Haas-B Nov 12, 2024 Author

Daniel-Haas-B Nov 12, 2024 Author

PhilipVinc Nov 12, 2024 Maintainer

PhilipVinc Nov 12, 2024 Maintainer

Daniel-Haas-B Nov 14, 2024 Author

PhilipVinc Nov 14, 2024 Maintainer

gcarleo Nov 14, 2024 Maintainer

Unexpected `nan` values in `TDVPSchmitt` with specific `n_samples` values #1959

Daniel-Haas-B
Nov 12, 2024

Replies: 2 comments 10 replies

PhilipVinc
Nov 12, 2024
Maintainer

Daniel-Haas-B Nov 12, 2024
Author

PhilipVinc Nov 12, 2024
Maintainer

PhilipVinc Nov 12, 2024
Maintainer

Daniel-Haas-B Nov 12, 2024
Author

Daniel-Haas-B Nov 12, 2024
Author

PhilipVinc Nov 12, 2024
Maintainer

PhilipVinc Nov 12, 2024
Maintainer

Daniel-Haas-B
Nov 14, 2024
Author

PhilipVinc Nov 14, 2024
Maintainer

gcarleo Nov 14, 2024
Maintainer