Shared Server Deployment (HPC / Slurm)¶
Overview¶
The application supports a shared server deployment model on HPC clusters managed by Slurm. Any team member can run the connect script — it automatically checks for an existing server, starts one if needed, and creates an SSH tunnel. No dedicated admin is required.
This replaces the earlier model where every annotator launched a separate Slurm job, which was wasteful of cluster resources and harder to coordinate.
System Architecture¶
flowchart LR
subgraph Local Machine
B[Browser<br/>localhost:7860]
CS[connect.sh]
end
subgraph Login Node
SSH[SSH Gateway]
end
subgraph Compute Node
P[Panel Server<br/>port 7860]
CR[credentials.json]
D[(Accelerometry<br/>Data Files)]
end
CS -- "1. SSH in &<br/>check/submit job" --> SSH
SSH -- "2. sbatch<br/>(if no job running)" --> P
CS -- "3. SSH tunnel<br/>localhost:7860 → node:7860" --> SSH
SSH -- "internal network" --> P
B -- "4. HTTP over tunnel" --> SSH
P --> CR
P --> D
Connection Flow¶
The connect.sh script handles everything automatically:
flowchart TD
A[User runs<br/>bash hpc_utils/connect.sh] --> B[SSH into login node]
B --> C{Panel server<br/>job running?}
C -- Yes --> E[Read server_info.txt<br/>get NODE & PORT]
C -- No --> D[Submit sbatch job]
D --> F{Wait for job<br/>to start}
F -- "Pending<br/>(poll every 5s)" --> F
F -- Running --> E
F -- "Failed / Timeout<br/>(120s)" --> ERR[Exit with error]
E --> G[Return to<br/>local machine]
G --> H{Local port<br/>available?}
H -- Yes --> J[Create SSH tunnel]
H -- "No (stale tunnel)" --> I[Kill stale SSH tunnel<br/>and reclaim port]
I --> H
J --> K{Tunnel<br/>connected?}
K -- Yes --> L["Open browser at<br/>localhost:PORT/app"]
K -- "No (10s timeout)" --> ERR2[Exit with error<br/>& show manual command]
L --> M[Wait for Ctrl+C]
M --> N[Close tunnel<br/>& exit]
Prerequisites¶
Requirement |
Details |
|---|---|
HPC cluster with Slurm |
The compute nodes must be reachable from a login/gateway node. |
SSH access |
Every user needs SSH access to the cluster login node. |
Python environment |
A Python environment with the project dependencies must be available on the cluster (see |
|
A JSON file mapping usernames to passwords, used by Panel’s |
Connecting¶
1. One-time setup¶
Each user edits the variables at the top of hpc_utils/connect.sh:
Variable |
Default |
Description |
|---|---|---|
|
Current local username |
Your username on the HPC cluster. |
|
|
SSH gateway / login node hostname. |
|
(project path on cluster) |
Path to the project directory on the cluster. |
|
|
Preferred local port; auto-increments if busy. |
2. Run¶
bash hpc_utils/connect.sh
The script will:
SSH into the login node
Check if a
py_accel_viewerjob is already runningIf not, submit one via
sbatchand wait for it to startRetrieve the compute node and port from
server_info.txtFind a free local port
Create an SSH tunnel
Verify the tunnel and open the browser
Keep running until you press Ctrl+C
Manual alternative¶
If the automated script does not work in your environment, first find the server info:
ssh youruser@randi.cri.uchicago.edu "cat /path/to/project/hpc_utils/server_info.txt"
Then create the tunnel:
ssh -N -L 7860:<compute_node>:7860 youruser@randi.cri.uchicago.edu
Open http://localhost:7860/app in your browser.
Server Configuration¶
The Slurm job is configured in hpc_utils/start_server.sh:
Variable |
Default |
Description |
|---|---|---|
|
|
Port the Panel server listens on. |
|
|
Path to the credentials JSON file. |
|
|
Path to the Panel application. |
|
|
Where connection details are written. |
Slurm resource directives (editable in the script):
Time limit —
--time=36:00:00(36 hours)Memory —
--mem-per-cpu=1500(1500 MB per CPU)CPUs —
--ntasks-per-node=16Partition — not set by default (uses cluster default)
Stopping the Server¶
bash hpc_utils/stop_server.sh
Reads hpc_utils/server_info.txt, cancels the Slurm job with scancel, and
removes the status file.
Troubleshooting¶
Problem |
Solution |
|---|---|
Job times out waiting to start |
The cluster may be busy. Check queue status or adjust the partition in |
Local port already in use |
The script auto-increments. To force a specific port: |
SSH tunnel did not come up |
Verify you can SSH to the login node manually. The script prints the manual |
Connection refused in browser |
The tunnel may have dropped. Re-run |
Blank page after login |
Server may be starting up — wait 10–15 seconds. Check |