Economics Research Computing Cluster Resources

Econ Research Computing Cluster Documentation

The systems described in this document constitute a set of linux-based resources whose goal is to facilitate computational requirements of faculty members in the Economics Department or affiliated Social Sciences Research groups.  This document is describes how to access and begin to use this collection of systems.  The systems consist of three major components: 1) storage, 2) interactive nodes, and 3) batch nodes.

 

These systems are the NFS/SLURM backed research systems. If you are trying to utilize the older AFS/SGE backed systems, please visit the documentation here.

 

Super Short Version

  • Use ssh to connect to login.econ.duke.edu
  • Jobs can be run interactively with the following executables:
    • matlab
    • stata
    • R
    • Any other generally available RHEL application.
  • Jobs can be submitted to the cluster with the following submission scripts:
    • matbatch
    • statabatch
    • Rbatch
    • Custom scripts can be written to be used with sbatch
  • Our batch queue management system is SLURM
  • An snapshot of cluster stats and performance can be found here.

 

Table of Contents

How to use the Economics Cluster

Notable Software

Requesting Access

How to log in

How to log in from OS X – More detail

How to log in from Windows – More detail

Managing FIles on the Cluster

Transferring Files to the cluster 

Transfering Files From an OS X Client

Transfering Files From a Windows Client

Managing Jobs

Submitting Jobs

Checking on Jobs

Deleting Jobs

Some Useful Linux Commands

Other useful bits

Terms and Definitions

 

Notable Software

The OS version of the software is Scientific Linux versus 6.  This is binary compatible, and contains the same packages, as Red Hat Enterprise Linux 6.  Any software generally available from RHEL can be installed easily upon request.

Additionally, the following software packages are available:

  • Matlab version 2016-a
  • Stata version 14 (SE and MP)
  • R version 3.3.0

Older versions of some software may be available under /econ/sw.  For example, /econ/sw/matlab/R2014b is available in that path.

Other software not available through RPM package management will be installed under /econ/sw into a directory named after the package.

Requesting Access

The username for the Economics Research Computing Cluster is going to be the same as your University NetID username.  The password is a different password.  If you do not know your password, please contact ECS: helpdesk@econ.duke.edu.  All faculty have accounts on this system.  Master's and PhD students should request access by sending email to helpdesk@econ.duke.edu. Guest access may be provided for others outside of the department, contact helpdesk@econ.duke.edu for more information.

The Economics Research Computing Cluster currently has one front-end node.  This may be referred to as a login node, front-end node, or interactive node.  This system performs two main functions, to allow jobs to be run interactively, and to submit jobs to batch nodes.

The front-end node on the Economics Research Computing Cluster can be accessed using the host name faculty.econ.duke.edu, or, after 7/4/2016, login.econ.duke.edu.  

To access faculty.econ.duke.edu, utilize an ssh client to connect to faculty.econ.duke.edu.  Further details for OS X and windows are available below.

Please note that in the examples provided, the hostname login-01.econ.duke.edu is used in the screenshots.  That will also work, however, faculty.econ.duke.edu is the preferred hostname to use, as additional login nodes may be available.  Using faculty.econ.duke.edu as the host name will allow the system to distribute logins across the nodes in a somewhat balanced fashion.

All versions of OSX have access to an application called Terminal.  This will get you command line access to the interactive node.  It is possible to run applications like matlab and stata with their graphical interfaces, but later versions of OSX require the installation of XQuartz to enable the graphical user interface.

XQuartz can be found here

Assuming X11/XQuartz is installed, following are the required steps to access the interactive cluster.

1.     Open terminal (this can be found in the Applications directory).

2.    At the command line prompt, type `ssh -X -Y username@faculty.econ.duke.edu’, where `username’ is your Economics Department user ID (not your Duke user ID, unless the two are the same)

 

 

For Windows, at mimimum, you will need a terminal application.  To utilize the GUI interfaces of applications such as Matlab, Stata and SAS, you will need to install an X11 emulator.

 

Accessing the interactive linux servers from a computer running Windows requires X-Win32, available for free on the Duke OIT Website.  Assuming X-Win32 is installed, following are the required steps to access the interactive cluster:

1.     Open X-Config by navigating through `Start > All Programs > X-Win32 > X-Config’.

2.     In the main X-Config window pane, select `Autostart’, and then click the `Manual’ button in the right-hand column with the heading `New Connection’.    

3. Select `ssh’ in the ‘Connection Method’ window that pops up and click `Next’.

 

4.     A window titled `X-Win32 – Edit Connection (ssh)’ will pop up. Fill the fields in the following manner:

1.     Session Name: Econ Interactive (or whatever name you choose)

2.     Host:faculty.econ.duke.edu

3.     Port: 22

4.     Login: username (where `username’ is your Economics Department user ID (not your Duke user ID, unless the two are the same))

5.     Command: /usr/bin/xterm -ls

6.     Password: (type your Economics Department password)

7.     Confirm Password: (type your password again)

 

 
   

If you need further help, please submit a ticket by emailing helpdesk@econ.duke.edu.

One conceptualization of the cluster is to divide it into two parts: storage and computational resources.  Storage is where the files go, computational resources would refer the CPU’s and system memory (distinct from storage) that are used to process data.

By and large, there are two main places you will store your files as relates to the cluster.  The first would be your home directory, the second would be in a research directory.  Home directories have more finite controls placed on the limits of sizes.  We are currently limiting home directory quotas to 10G.  This eases administrative burdens of the cluster to keep home directory sizes more controlled.  Research directories can be more generous in sizes.

The paths for home directories follow this pattern:

/econ/home/$first_letter_of_username/$username

For example, if your username is tsefhg123, then your home directory would be located at the following path:

/econ/home/t/tsefhg123

Research directories are located as subdirectories of /econ/research.  The more specific path will vary according to function, but generally, the next level of the directory would use the netid of the primary researcher for that space.  For example, if the netid of the primary researcher were ‘j_rt32’ the research space would be:

/econ/research/j_rt32

For broader projects, the space would take on a directory name reflective of the project.

The best way to transfer files to the cluster is to utilize an scp or sftp client.  If you are familiar with command line usage, on OS X you can open up the terminal application and use scp directly from there.  Otherwise, you will need to obtain a client such as filezilla which has the ability to do SFTP.

 

Transfering Files From an OS X Client

Cyberduck is the recommended GUI client for transfering files from an OS X client.  It can be found at the following link: https://cyberduck.io/

1.    Double-click the Cyberduck icon. When Cyberduck opens, at the upper left, click Open Connection...; alternatively, from the File menu, select Open Connection....

2.    At the top of the sheet that appears, from the drop-down menu, select SFTP (SSH File Transfer Protocol).

3.    In the "Server:" field, type the address of the remote host to which you wish to connect (e.g., login-01.econ.duke.edu)

4.    In the "Username:" and "Password:" fields, type your NETID and econ password

              5.    To save your password to the Keychain, check Add to Keychain.

 

Note: The first time you connect to a host, Cyberduck will display a warning such as "Unknown host key for login-01.econ.duke.edu". Click Allow to continue.

6.    To save this connection for future use, click the Bookmark button in the upper left section of the page.

7.    Then click the add button (the plus sign) in the bottom left corner of the window.

8.    Once you are logged in you will notice that your Home Directory is your default location. To change this, click the arrow to choose another directory.

9.    A window will open displaying the list of files on the remote host. To download files or folders, drag them from Cyberduck into your desire folder in the Finder window.

 

Transfering Files From a Windows Client

SSH can be found at the OIT website

1. Open the SSH Secure Shell File Transfer Client. (Skip this step if the program is already open.)

2.  To log on to login-01.econ.duke.edu, click on Quick Connect.

 

3.  From the Operation Menu, select Upload. (You can also start the upload process by clicking on the white upward pointing arrow on the toolbar.) You should see a dialog box that looks similar to the following:

4. Select the file(s) you want to transfer

5. Note: You can select multiple files by holding down the CONTROL key and clicking the name of each file.

6. Click Upload.

7. The file(s) should now be transferred to the appropriate folder on login-01.econ.duke.edu.

8. To close the connection to login-01.econ.duke.edu, from the File Menu, select Disconnect. You should see a dialog box that looks similar to the following:

9. In the Confirm Disconnect dialog box, click Yes.

10. To exit the application, from the File Menu, select Exit. 

Managing Jobs

There are two ways to submit jobs, either by building your own submit script and then running that via ‘sbatch’ or using one of the prebuilt scripts for matlab or stata.

To submit matlab jobs, type the following at the linux command line, from the directory where your matlab program resides:

  • matbatch $matlab_program

$matlab_program should be changed to the name of your .m file.

 

To submit stata jobs, type the following at the linux command line, from the directory where your stata program resides:

  • statabatch $stata_program

$stata_program should be changed to the name of your .do file.

 

To submit R jobs, type the following at the linux command line, from the directory where your stata program resides:

  • Rbatch $stata_program

$Rbatch should be changed to the name of your R program.

 

To view a list of your running jobs, type the following command at the linux command prompt:

·       squeue –u $USER

For a more general view of what is going on with the cluster at large, and what nodes are in what state:

·       sinfo

For more details on a specific job, either running or completed, use the sacct command:

·       sacct –j $JOBID

Where job ID can be found from the output of squeue or from the information on your completed job.  A more verbose output for sacct can be invoked as follows:

·       sacct_ec –j $JOBID

To delete a job from the queue, once you know the job ID number, issue the following command at the linux command prompt:

·       scancel $JOBID

·      man $foo – For most unix commands, you can type ‘man command’ and get more information on the command.

·      ls - list directory

·      cat - output file contents

·      nano - simple text editor that can be used through shell

·      matbatch - matlab submission script

·      statabatch – stata submission script

·      pwd - what directory am I in

·      squeue - list what is going on with running jobs

·      sinfo - status of the cluster

·      cd - change directory

·      cd ~ - change directory to my home directory

~ is a short hand notation for “My home directory location”

 

·       cd - = change directory to the previous directory I was in

Here is how to start matlab up so it runs without the gui:

·       matlab -nosplash –nodesktop

Here is how to have matlab run a script you’ve written.  Assuming the file name is simple.m:

·      matlab -nosplash -nodesktop -r "simple;quit"

 

Sometimes it is useful to start up screen or tmux.  These applications let you disconnect from a running command line session on the remote system and then reconnect at a later time or from another system.  After you log into an interactive node, you invoke them with either the ‘screen’ or ‘tmux’ command.  The full scope of using these applications is beyond this documentation, but googling ‘screen linux tutorial’ or ‘tmux linux tutorial’ should get you started.

 

 

batch node

A computer that is not accessed directly, but rather runs jobs that are distributed by SLURM from interactive nodes.

cluster

A collection of computers logically grouped to meet a goal.  In the context of the faculty cluster, this collection is targeted to facilitate faculty computational designs.

command line / command prompt

The command line is a user interface typically used on UNIX and UNIX-like operating systems.  It consists of a window in which you type commands to execute programs and perform tasks.  This is in contrast to Graphical User Interfaces (GUI), which are better known and typical of desktop operating systems.  A GUI typically uses a combination of mouse and keyboard to input instructions to the computer or program the computer is using.  The command prompt or just prompt is where you input commands for the linux system.

GUI

Graphical User Interface – when a program presents an interface that utilized not just typed commands but also mouse and menu driven methods of interacting with the program, it is referred to a having a GUI or graphical user interface.

interactive node / login node A system that is configured to allow direct logins and has the ability to run jobs directly on its own resources.  Typically, in Econ, interactive nodes are also where jobs are sumitted to batch nodes.

job

Specifically, the code needing to be run on computational systems.  Loosely, the term can also include the data that such a program needs in order to perform its calculation and (even more loosely) the results/output the program generates.

memory, or system memory, or RAM

On each individual computer, small amounts of very fast storage are used directly by CPU’s to cache results and store intermediary computations until it results in a form to be displayed on the display.  This can be referred to as memory, system memory, or RAM.  This is not to be confused with storage, or any discussion regarding home directory or research directory storage.

node

 

 

An individual computer that is a part of a set of systems.  In our case, any individual computer in the cluster is referred to as a node.  It may be a head node/login node, computational node, management node, etc.

storage

Storage refers to where files are saved for long term reference.  Within the economics cluster, storage is a networked resource with certain redundancies built in to ensure against a single disk failure resulting in a loss of data.  Storage is distinct from system memory.  Can also be referred to as disk space.  If you are speaking about quota, you are speaking about storage, not memory.