Shared Server Deployment (HPC / Slurm)¶
Overview¶
The app supports a shared-server deployment on Slurm-managed HPC clusters. Any team member can run the connect script. It checks for an existing server, starts one if needed, and creates the SSH tunnel. No dedicated admin needed.
This replaces the earlier model, where every annotator launched their own Slurm job. That wasted cluster resources and was harder to coordinate.
System Architecture¶
flowchart LR
subgraph Local Machine
B[Browser<br/>localhost:7860]
CS[connect.sh]
end
subgraph Login Node
SSH[SSH Gateway]
end
subgraph Compute Node
P[Panel Server<br/>port 7860]
CR[credentials.json]
D[(Accelerometry<br/>Data Files)]
end
CS -- "1. SSH in &<br/>check/submit job" --> SSH
SSH -- "2. sbatch<br/>(if no job running)" --> P
CS -- "3. SSH tunnel<br/>localhost:7860 → node:7860" --> SSH
SSH -- "internal network" --> P
B -- "4. HTTP over tunnel" --> SSH
P --> CR
P --> D
Connection Flow¶
The connect.sh script handles everything automatically:
flowchart TD
A[User runs<br/>bash hpc_utils/connect.sh] --> B[SSH into login node]
B --> C{Panel server<br/>job running?}
C -- Yes --> E[Read server_info.txt<br/>get NODE & PORT]
C -- No --> D[Submit sbatch job]
D --> F{Wait for job<br/>to start}
F -- "Pending<br/>(poll every 5s)" --> F
F -- Running --> E
F -- "Failed / Timeout<br/>(120s)" --> ERR[Exit with error]
E --> G[Return to<br/>local machine]
G --> H{Local port<br/>available?}
H -- Yes --> J[Create SSH tunnel]
H -- "No (stale tunnel)" --> I[Kill stale SSH tunnel<br/>and reclaim port]
I --> H
J --> K{Tunnel<br/>connected?}
K -- Yes --> L["Open browser at<br/>localhost:PORT/app"]
K -- "No (10s timeout)" --> ERR2[Exit with error<br/>& show manual command]
L --> M[Wait for Ctrl+C]
M --> N[Close tunnel<br/>& exit]
Prerequisites¶
Requirement |
Details |
|---|---|
HPC cluster with Slurm |
The compute nodes must be reachable from a login/gateway node. |
SSH access |
Every user needs SSH access to the cluster login node. |
Python environment |
A Python environment with the project dependencies must be available on the cluster (see |
|
A JSON file mapping usernames to passwords, used by Panel’s |
Connecting¶
1. One-time setup¶
Each user edits the variables at the top of hpc_utils/connect.sh:
Variable |
Default |
Description |
|---|---|---|
|
Current local username |
Your username on the HPC cluster. |
|
|
SSH gateway / login node hostname. |
|
(project path on cluster) |
Path to the project directory on the cluster. |
|
|
Preferred local port; auto-increments if busy. |
2. Run¶
bash hpc_utils/connect.sh
The script will:
SSH into the login node
Check if a
py_accel_viewerjob is already runningIf not, submit one via
sbatchand wait for it to startRetrieve the compute node and port from
server_info.txtFind a free local port
Create an SSH tunnel
Verify the tunnel and open the browser
Keep running until you press Ctrl+C
Manual alternative¶
If the automated script does not work in your environment, first find the server info:
ssh youruser@randi.cri.uchicago.edu "cat /path/to/project/hpc_utils/server_info.txt"
Then create the tunnel:
ssh -N -L 7860:<compute_node>:7860 youruser@randi.cri.uchicago.edu
Open http://localhost:7860/app in your browser.
Server Configuration¶
The Slurm job is configured in hpc_utils/start_server.sh:
Variable |
Default |
Description |
|---|---|---|
|
|
Port the Panel server listens on. |
|
|
Path to the credentials JSON file. |
|
|
Path to the Panel application. |
|
|
Where connection details are written. |
Slurm resource directives (editable in the script):
Time limit —
--time=36:00:00(36 hours)Memory —
--mem-per-cpu=1500(1500 MB per CPU)CPUs —
--ntasks-per-node=16Partition — not set by default (uses cluster default)
Stopping the Server¶
bash hpc_utils/stop_server.sh
Reads hpc_utils/server_info.txt, cancels the Slurm job with scancel, and
removes the status file.
Troubleshooting¶
Problem |
Solution |
|---|---|
Job times out waiting to start |
The cluster may be busy. Check queue status or adjust the partition in |
Local port already in use |
The script auto-increments. To force a specific port: |
SSH tunnel did not come up |
Verify you can SSH to the login node manually. The script prints the manual |
Connection refused in browser |
The tunnel may have dropped. Re-run |
Blank page after login |
Server may be starting up — wait 10–15 seconds. Check |