Shared Server Deployment (HPC / Slurm)¶

Overview¶

The app supports a shared-server deployment on Slurm-managed HPC clusters. Any team member can run the connect script. It checks for an existing server, starts one if needed, and creates the SSH tunnel. No dedicated admin needed.

This replaces the earlier model, where every annotator launched their own Slurm job. That wasted cluster resources and was harder to coordinate.

System Architecture¶

        flowchart LR
    subgraph Local Machine
        B[Browser<br/>localhost:7860]
        CS[connect.sh]
    end

    subgraph Login Node
        SSH[SSH Gateway]
    end

    subgraph Compute Node
        P[Panel Server<br/>port 7860]
        CR[credentials.json]
        D[(Accelerometry<br/>Data Files)]
    end

    CS -- "1. SSH in &<br/>check/submit job" --> SSH
    SSH -- "2. sbatch<br/>(if no job running)" --> P
    CS -- "3. SSH tunnel<br/>localhost:7860 → node:7860" --> SSH
    SSH -- "internal network" --> P
    B -- "4. HTTP over tunnel" --> SSH
    P --> CR
    P --> D

Connection Flow¶

The connect.sh script handles everything automatically:

        flowchart TD
    A[User runs<br/>bash hpc_utils/connect.sh] --> B[SSH into login node]
    B --> C{Panel server<br/>job running?}
    C -- Yes --> E[Read server_info.txt<br/>get NODE & PORT]
    C -- No --> D[Submit sbatch job]
    D --> F{Wait for job<br/>to start}
    F -- "Pending<br/>(poll every 5s)" --> F
    F -- Running --> E
    F -- "Failed / Timeout<br/>(120s)" --> ERR[Exit with error]
    E --> G[Return to<br/>local machine]
    G --> H{Local port<br/>available?}
    H -- Yes --> J[Create SSH tunnel]
    H -- "No (stale tunnel)" --> I[Kill stale SSH tunnel<br/>and reclaim port]
    I --> H
    J --> K{Tunnel<br/>connected?}
    K -- Yes --> L["Open browser at<br/>localhost:PORT/app"]
    K -- "No (10s timeout)" --> ERR2[Exit with error<br/>& show manual command]
    L --> M[Wait for Ctrl+C]
    M --> N[Close tunnel<br/>& exit]

Prerequisites¶

Requirement	Details
HPC cluster with Slurm	The compute nodes must be reachable from a login/gateway node.
SSH access	Every user needs SSH access to the cluster login node.
Python environment	A Python environment with the project dependencies must be available on the cluster (see `requirements.txt`).
`credentials.json`	A JSON file mapping usernames to passwords, used by Panel’s `--basic-auth` flag.

Connecting¶

1. One-time setup¶

Each user edits the variables at the top of hpc_utils/connect.sh:

Variable	Default	Description
`SSH_USER`	Current local username	Your username on the HPC cluster.
`LOGIN_NODE`	`randi.cri.uchicago.edu`	SSH gateway / login node hostname.
`REMOTE_DIR`	(project path on cluster)	Path to the project directory on the cluster.
`LOCAL_PORT`	`7860`	Preferred local port; auto-increments if busy.

2. Run¶

bash hpc_utils/connect.sh

The script will:

SSH into the login node
Check if a py_accel_viewer job is already running
If not, submit one via sbatch and wait for it to start
Retrieve the compute node and port from server_info.txt
Find a free local port
Create an SSH tunnel
Verify the tunnel and open the browser
Keep running until you press Ctrl+C

Manual alternative¶

If the automated script does not work in your environment, first find the server info:

ssh youruser@randi.cri.uchicago.edu "cat /path/to/project/hpc_utils/server_info.txt"

Then create the tunnel:

ssh -N -L 7860:<compute_node>:7860 youruser@randi.cri.uchicago.edu

Open http://localhost:7860/app in your browser.

Server Configuration¶

The Slurm job is configured in hpc_utils/start_server.sh:

Variable	Default	Description
`PORT`	`7860`	Port the Panel server listens on.
`CREDENTIALS`	`credentials.json`	Path to the credentials JSON file.
`APP_PATH`	`visualize_accelerometry/app.py`	Path to the Panel application.
`STATUS_FILE`	`hpc_utils/server_info.txt`	Where connection details are written.

Slurm resource directives (editable in the script):

Time limit — --time=36:00:00 (36 hours)
Memory — --mem-per-cpu=1500 (1500 MB per CPU)
CPUs — --ntasks-per-node=16
Partition — not set by default (uses cluster default)

Stopping the Server¶

bash hpc_utils/stop_server.sh

Reads hpc_utils/server_info.txt, cancels the Slurm job with scancel, and removes the status file.

Troubleshooting¶

Problem	Solution
Job times out waiting to start	The cluster may be busy. Check queue status or adjust the partition in `start_server.sh`.
Local port already in use	The script auto-increments. To force a specific port: `LOCAL_PORT=8080 bash hpc_utils/connect.sh`
SSH tunnel did not come up	Verify you can SSH to the login node manually. The script prints the manual `ssh -N -L ...` command to try.
Connection refused in browser	The tunnel may have dropped. Re-run `connect.sh`. Confirm the Slurm job is still running.
Blank page after login	Server may be starting up — wait 10–15 seconds. Check `hpc_utils/logs/panel-server-<job_id>.log`.