Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler fails with Mongo server timeout error #321

Open
mbthornton-lbl opened this issue Dec 5, 2024 · 1 comment
Open

Scheduler fails with Mongo server timeout error #321

mbthornton-lbl opened this issue Dec 5, 2024 · 1 comment
Assignees

Comments

@mbthornton-lbl
Copy link
Contributor

Log

INFO:root:Initializing Scheduler
INFO:root:Found 1 new jobs for nmdc:wfrqc-12-3m7yhn78.1
INFO:root:JOB RECORD: nmdc:34c45a92-ad12-11ef-9831-eee1651c58aa
INFO:root:Found 1 new jobs for nmdc:wfmgas-12-1khd9q66.1
INFO:root:JOB RECORD: nmdc:419b4a96-ad12-11ef-9831-eee1651c58aa
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/src/nmdc_automation/workflow_automation/sched.py", line 348, in <module>
    main(site_conf=sys.argv[1], wf_file=sys.argv[2])
  File "/src/nmdc_automation/workflow_automation/sched.py", line 339, in main
    sched.cycle(dryrun=dryrun, skiplist=skiplist, allowlist=allowlist)
  File "/src/nmdc_automation/workflow_automation/sched.py", line 286, in cycle
    wfp_nodes = load_workflow_process_nodes(self.db, self.workflows, allowlist)
  File "/src/nmdc_automation/workflow_automation/workflow_process.py", line 257, in load_workflow_process_nodes
    data_object_map = get_required_data_objects_map(db, workflows)
  File "/src/nmdc_automation/workflow_automation/workflow_process.py", line 29, in get_required_data_objects_map
    for rec in db.data_object_set.find({"data_object_type": {"$ne": None}}):
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/cursor.py", line 1281, in __next__
    return self.next()
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/cursor.py", line 1257, in next
    if len(self._data) or self._refresh():
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/cursor.py", line 1205, in _refresh
    self._send_message(q)
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/cursor.py", line 1100, in _send_message
    response = client._run_operation(
  File "/usr/local/lib/python3.9/site-packages/pymongo/_csot.py", line 119, in csot_wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/mongo_client.py", line 1754, in _run_operation
    return self._retryable_read(
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/mongo_client.py", line 1863, in _retryable_read
    return self._retry_internal(
  File "/usr/local/lib/python3.9/site-packages/pymongo/_csot.py", line 119, in csot_wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/mongo_client.py", line 1819, in _retry_internal
    return _ClientConnectionRetryable(
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/mongo_client.py", line 2554, in run
    return self._read() if self._is_read else self._write()
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/mongo_client.py", line 2689, in _read
    self._server = self._get_server()
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/mongo_client.py", line 2645, in _get_server
    return self._client._select_server(
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/mongo_client.py", line 1649, in _select_server
    server = topology.select_server(
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/topology.py", line 398, in select_server
    server = self._select_server(
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/topology.py", line 376, in _select_server
    servers = self.select_servers(
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/topology.py", line 283, in select_servers
    server_descriptions = self._select_servers_loop(
  File "/usr/local/lib/python3.9/site-packages/pymongo/synchronous/topology.py", line 333, in _select_servers_loop
    raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: mongo:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connectTimeoutMS: 20000.0ms), Timeout: 30s, Topology Description: <TopologyDescription id: 6747a26e54
a5daaaddddb172, topology_type: Single, servers: [<ServerDescription ('mongo', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('mongo:27017: [Errno 111] Connection refused (configured timeouts: socketTimeoutMS: 20000.0ms, connec
tTimeoutMS: 20000.0ms)')>]>
@aclum
Copy link
Contributor

aclum commented Dec 13, 2024

@mbthornton-lbl Was this addressed by any of the PRs this week?
SPIN cycles through nodes enough, it just happened to me today on prod that we do need to be able to handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants