Tinkergpu:Newuser

From tinkergpu
Jump to: navigation, search

How to use lab cluster

Preparation

If you are new to Linux: https://ryanstutorials.net/linuxtutorial/

 

 

Lab cluster

login

You need an account to login to lab cluster to use various software and perform calcualtions. We can no longer log in remotely using password. SSH key is required. You have to come to the lab to do the below in person. If not, ask admin to do it for you. Once you receivd an account and temp password, log in at any workstation in the lab, and set up your ssh key:

cd ~
ssh-keygen -t rsa -b 4096 (then hit enter all the way)
cd .ssh 
cat id_rsa.pub >> authorized_keys
chmod 600  authorized_keys

Then "ssh bme-nova”. Use "passwd" to update your password to something very secure. The bare command is sufficient; prompts will open up for old and new passwords. Your changes will be propagated to all nodes automatically. 

After you have logged into nova, obtain a list of all known nodes (hostnames, https://github.com/bdw2292/Ren-Lab-Daemon/blob/main/nodes.txt) and ssh from nova to each node. There will be a prompt that opens asking if you would like to add that hostname to list of known hosts, just type yes for each prompt.

login from your laptop on and off campus

You will need a ssh client to remote log into our computer cluster. If you use Mac or Linux, ssh client is built-in (via terminal). If you use Windows, you can install ssh clients like mobaxterm https://mobaxterm.mobatek.net/, Windows subsystem for linux (ubutu/Debian etc), or cygwin. After finishing the step above "login", you can create ssh keys (private and public pair) on your laptop and append the pub key from your laptop to "autheorized_keys" on lab workstation in your .ssh folder. Again, if you can not come to lab in person, send us the public key (via link to clound drive, not explicitly in email) and we will append for you.

Remotely, you can log into bme-black.bme.utexas.edu, bme-water.bme.utexas.edu or bme-jupiter.bme.utexas.edu or All the other nodes/server are blocked. From there, you can ssh into nova and nodes behind nova (see cluster structure at the bottom of this page). If the username you have on your device does not match your cluster username, you will have to preface this address with your username (e.g. longhorn42@bme-water.bme.utexas.edu).

your home dir is on nova

bme-nova.bme.utexas.edu or nova.bme.utexas.edu is the header node of our computer cluster where all your home directory and files are stored. Your home directory (echo $HOME) is located either in /home or /users which is phsyically locate don nova. It can be accessed from any nodes and workstations in the lab. From nova, you can start your job by doing "ssh node120 xxx” where xxx is your simulation script. You can write your own script to submit a bunch of job to these nodes (check the availability and skip if the node is busy). Your home directory is shared among nova and all nodes vis NFS.

(More about lab cluster structure at the end of page)

Your home directory has a limit (quota) on space and file number and you won't be able to create new files once it is reached.

Even you see your home folder on all WS/nodes in the lab but they are actually doing so remotely (hence use /scratch that is local on each node for serious jobs).

 

 

where to run simulations & analysis

The home dir on nova is accessible on all nodes. But when you perform calculations on the nodes/WS in your home directory, it is actually accessing the files on nova through network. It is convenient to run relatively short simulations, store the output files and perform quick analysis directly in your home directory. But because all members in the lab share home dir on nova, it could stress out the system if several demanding simulations/trajectory analyses are performed at the same time on files in your home dir on nova.

So for demanding calculations, please use /work (mounted from sun.bme.utexas.edu) and /work2 (from uranus.bme.utexas.edu) as working directories. "sun and uranus" are also shared among all nodes like nova. But this way we spread out the burden among 3 servers. If you ahve an active projecta nd expect to produce sunstaitnal amount files, let me know and I will created folder for you on /work and /work2. Keep nova (your home directory) clean and efficient for everyone including yourself.

For QM and highly disk intensive tasks, we also use local "/scratch" disk. This space is large (100s GB) but is "local" meaning only accessible on specific node (after you ssh into it). 

When submitting a command, use the following. This will run it in the background and redirect output that would print to the screen to a file "nohup.out". This prevents it from terminating when you logout or close the terminal.

"nohup your_command &"

 

"less +F file_name" to have constantly updated end of file on screen

 

Lab cluster usage monitoring and signup

The node activities can be monitored here:

CPU nodes monitor: http://biomol.bme.utexas.edu/ganglia/?c=NOVA&m=load_one&r=hour&s=by%20name&hc=4 
GPU monior: https://biomol.bme.utexas.edu/~rq875/checkgpu.log 

Complete CPU/GPU usage & sign up spreadsheet (email pren to request permission to edit): https://docs.google.com/spreadsheets/d/1EOlUwFpdNU2uBZ5XrHSvZnRYCUCOw5tCSkFisTm3big/edit?usp=sharing

 

 

Utilities

We don’t have a job queuing system (yet) so you just log into a node to run your job.

Use "top" to check the load on a node. 600% means 6 cores are fully loaded.
"less /proc/cpuinfo" to check how many CPU cores/threads on a node
"free -g" to check free memory (esp. for QM jobs). The line "-/+ buffers/cache:" has the real number for free space
"echo > large_file_name" to empty a large file quickly. rm can be slow
"df -h" to check available storage (disk space)

 

More tips about Lnux commands and uitlies:

https://docs.google.com/document/d/1cnmSItdRXDBcpVBGhwDahJVJ2jeh4l2oDVpMBLtlySE/edit?usp=sharing 

Backups

All home directories (/home, /users, /opt, /work, /work2) are backed up twice a week for ~ 2 weeks on bigdata.bme.utexas.edu. If you need to recover any files from last couple of weeks, you should be able to find them there. Log into bigdata, cd /bigdata/renlab/.


Cluster structure

nova.bme.utexas.edu is the header node for all computing nodes (nodexxx) that can only be accessed from nova. The /home, /users, /opt folders are all part of nova file system. The computing nodes have provate IP address so can not be seen from outside including your WS or laptop. Each mode has a large local /scratch" folder that writable by everyone. "sun and uranus" are another two large servers mounted as /work and /work2 on each WS and computing node. You can use them just like the home dir. bigdata is the backup server.

Read the pdf below for an illustration

node venus is no longer an entry node into the cluster (see above for nodes that can be used to enter cluster)

Lab Cluster.pdf