Economics Research Computing Cluster

Econ Research Computing Cluster Documentation

The systems described in this document constitute a set of linux-based resources whose goal is to facilitate computational requirements of faculty members in the Economics Department or affiliated Social Sciences Research groups.  This document is describes how to access and begin to use this collection of systems.  The systems consist of three major components: 1) storage, 2) interactive nodes, and 3) batch nodes.

 

These systems are the NFS/SLURM backed research systems. If you are trying to utilize the older AFS/SGE backed systems, please visit the documentation here.

 

Super Short Version

  • Use ssh to connect to login.econ.duke.edu
  • Jobs can be run interactively with the following executables:
    • matlab
    • stata
    • R
    • Any other generally available RHEL application.
  • Jobs can be submitted to the cluster with the following submission scripts:
    • matbatch
    • statabatch
    • Rbatch
    • Custom scripts can be written to be used with sbatch
  • Our batch queue management system is SLURM
  • An snapshot of cluster stats and performance can be found here.

 

Table of Contents

How to use the Economics Cluster

Notable Software

Requesting Access

How to log in

How to log in from OS X – More detail

How to log in from Windows – More detail

Managing FIles on the Cluster

Transferring Files to the cluster 

Transfering Files From an OS X Client

Transfering Files From a Windows Client

Managing Jobs

Submitting Jobs

Checking on Jobs

Deleting Jobs

Some Useful Linux Commands

Other useful bits

Terms and Definitions

 

Notable Software

The OS version of the software is Scientific Linux versus 6.  This is binary compatible, and contains the same packages, as Red Hat Enterprise Linux 6.  Any software generally available from RHEL can be installed easily upon request.

Additionally, the following software packages are available:

  • Matlab version 2016-a
  • Stata version 14 (SE and MP)
  • R version 3.3.0

Older versions of some software may be available under /econ/sw.  For example, /econ/sw/matlab/R2014b is available in that path.

Other software not available through RPM package management will be installed under /econ/sw into a directory named after the package.

Requesting Access

The username for the Economics Research Computing Cluster is going to be the same as your University NetID username.  The password is a different password.  If you do not know your password, please contact ECS: helpdesk@econ.duke.edu.  All faculty have accounts on this system.  Master's and PhD students should request access by sending email to helpdesk@econ.duke.edu. Guest access may be provided for others outside of the department, contact helpdesk@econ.duke.edu for more information.

The Economics Research Computing Cluster currently has one front-end node.  This may be referred to as a login node, front-end node, or interactive node.  This system performs two main functions, to allow jobs to be run interactively, and to submit jobs to batch nodes.

The front-end node on the Economics Research Computing Cluster can be accessed using the host name faculty.econ.duke.edu, or, after 7/4/2016, login.econ.duke.edu.  

To access faculty.econ.duke.edu, utilize an ssh client to connect to faculty.econ.duke.edu.  Further details for OS X and windows are available below.

Please note that in the examples provided, the hostname login-01.econ.duke.edu is used in the screenshots.  That will also work, however, faculty.econ.duke.edu is the preferred hostname to use, as additional login nodes may be available.  Using faculty.econ.duke.edu as the host name will allow the system to distribute logins across the nodes in a somewhat balanced fashion.

All versions of OSX have access to an application called Terminal.  This will get you command line access to the interactive node.  It is possible to run applications like matlab and stata with their graphical interfaces, but later versions of OSX require the installation of XQuartz to enable the graphical user interface.

XQuartz can be found here

Assuming X11/XQuartz is installed, following are the required steps to access the interactive cluster.

1.  Once Xming is installed, run the application called 'XLaunch' and verify that the settings are as shown:

    SET SERVER to login.econ.duke.edu or faculty.econ.duke.edu

  1. Run Applications > Utilities > XQuartz.app
  2. Right click on the XQuartz icon in the dock and select Applications > Terminal. 
  3. In this xterm windows, ssh into the linux system of your choice using the -X argument (secure X11 forwarding).
    1. Log into the Econ Cluster using  login.econ.duke.edu or faculty.econ.duke.edu
    2. Example:  ssh -X -Y(NetID)@login.econ.duke.edu  or   ssh -X -Y (NetID)@faculty.econ.duke.edu

Save the configuration and close XLaunch

Once you are logged into the Econ Cluster, you can just run the GUI program of your choice, such as matlab

                           Example:   NetID@login-01 ~]$ matlab

For Windows, at mimimum, you will need a terminal application.  To utilize the GUI interfaces of applications such as Matlab, Stata and SAS, you will need to install an X11 emulator.

 

Accessing the interactive linux servers from a computer running Windows requires PuTTy, available for free on the Duke OIT Website.  Assuming PuTTy  is installed, following are the required steps to access the interactive cluster:

After installing PuTTy:

  1. Double click on Putty to launch.
  2. Enter the following connection settings:
    1. Host Name: login.econ.duke.edu  or  faculty.econ.duke.edu
    2. Port: 22
    3. Connection Type: SSH (default)
  3. On the left side Under Category
    1. Click the “+” next to SSH
    2. Click the “+” next to Auth
    3. Select X11
      1. Check box: Enable X11 Forwarding
  4. Save Settings
    1. Go back to Session on the left pane
    2. Under Saved Sessions, Type “Econ” or any name of your choosing
    3. Click Save

 

The most important setting is the X11 Forwarding. Without that set, the X-window system cannot find your PC for display. Save the configuration by typing a name (i.e. econ) in the box under 'Saved Sessions' on the Sessions screen. Press the Save button to save the configuration. Click Open to open the terminal window or Cancel to close PuTTy.

One conceptualization of the cluster is to divide it into two parts: storage and computational resources.  Storage is where the files go, computational resources would refer the CPU’s and system memory (distinct from storage) that are used to process data.

By and large, there are two main places you will store your files as relates to the cluster.  The first would be your home directory, the second would be in a research directory.  Home directories have more finite controls placed on the limits of sizes.  We are currently limiting home directory quotas to 10G.  This eases administrative burdens of the cluster to keep home directory sizes more controlled.  Research directories can be more generous in sizes.

The paths for home directories follow this pattern:

/econ/home/$first_letter_of_username/$username

For example, if your username is tsefhg123, then your home directory would be located at the following path:

/econ/home/t/tsefhg123

Research directories are located as subdirectories of /econ/research.  The more specific path will vary according to function, but generally, the next level of the directory would use the netid of the primary researcher for that space.  For example, if the netid of the primary researcher were ‘j_rt32’ the research space would be:

/econ/research/j_rt32

For broader projects, the space would take on a directory name reflective of the project.

The best way to transfer files to the cluster is to utilize an scp or sftp client.  If you are familiar with command line usage, on OS X you can open up the terminal application and use scp directly from there.  Otherwise, you will need to obtain a client such as filezilla which has the ability to do SFTP.

 

Transfering Files From an OS X Client

Cyberduck is the recommended GUI client for transfering files from an OS X client.  It can be found at the following link: https://cyberduck.io/

1.    Double-click the Cyberduck icon. When Cyberduck opens, at the upper left, click Open Connection...; alternatively, from the File menu, select Open Connection....

 

2.    At the top of the sheet that appears, from the drop-down menu, select SFTP (SSH File Transfer Protocol).

3.    In the "Server:" field, type the address of the remote host to which you wish to connect (e.g., login-01.econ.duke.edu)

4.    In the "Username:" and "Password:" fields, type your NETID and econ password

              5.    To save your password to the Keychain, check Add to Keychain.

 

 

Note: The first time you connect to a host, Cyberduck will display a warning such as "Unknown host key for login-01.econ.duke.edu". Click Allow to continue.

6.    To save this connection for future use, click the Bookmark button in the upper left section of the page.

 

7.    Then click the add button (the plus sign) in the bottom left corner of the window.

 

8.    Once you are logged in you will notice that your Home Directory is your default location. To change this, click the arrow to choose another directory.

 

9.    A window will open displaying the list of files on the remote host. To download files or folders, drag them from Cyberduck into your desire folder in the Finder window.

 

Transfering Files From a Windows Client

FileZilla can be found at the OIT website

Example:

  1. Enter the correct Host/Domain/IP address into the 'Host:' field.
    1. sftp://login.econ.duke.edu
    2. sftp://faculty.econ.duke.edu
  2. Enter Your NetID into the 'Username:' field.
  3. Enter the correct password your NetID entered.
  4. Enter port 22 for SFTP.
  5. Click the 'Quickconnect' button.
  6. You can connect to your Econ Home or Econ Research Directory
    1. Remote site:  /net/storage-01/econ/home/t/ttj5
    2. Remote site: /net/storage-01/econ/research

Managing Jobs

There are two ways to submit jobs, either by building your own submit script and then running that via ‘sbatch’ or using one of the prebuilt scripts for matlab or stata.

To submit matlab jobs, type the following at the linux command line, from the directory where your matlab program resides:

  • matbatch $matlab_program

$matlab_program should be changed to the name of your .m file.

 

To submit stata jobs, type the following at the linux command line, from the directory where your stata program resides:

  • statabatch $stata_program

$stata_program should be changed to the name of your .do file.

 

To submit R jobs, type the following at the linux command line, from the directory where your stata program resides:

  • Rbatch $stata_program

$Rbatch should be changed to the name of your R program.

 

To view a list of your running jobs, type the following command at the linux command prompt:

·       squeue –u $USER

For a more general view of what is going on with the cluster at large, and what nodes are in what state:

·       sinfo

For more details on a specific job, either running or completed, use the sacct command:

·       sacct –j $JOBID

Where job ID can be found from the output of squeue or from the information on your completed job.  A more verbose output for sacct can be invoked as follows:

·       sacct_ec –j $JOBID

To delete a job from the queue, once you know the job ID number, issue the following command at the linux command prompt:

·       scancel $JOBID

·      man $foo – For most unix commands, you can type ‘man command’ and get more information on the command.

·      ls - list directory

·      cat - output file contents

·      nano - simple text editor that can be used through shell

·      matbatch - matlab submission script

·      statabatch – stata submission script

·      pwd - what directory am I in

·      squeue - list what is going on with running jobs

·      sinfo - status of the cluster

·      cd - change directory

·      cd ~ - change directory to my home directory

~ is a short hand notation for “My home directory location”

 

·       cd - = change directory to the previous directory I was in

Here is how to start matlab up so it runs without the gui:

·       matlab -nosplash –nodesktop

Here is how to have matlab run a script you’ve written.  Assuming the file name is simple.m:

·      matlab -nosplash -nodesktop -r "simple;quit"

 

Sometimes it is useful to start up screen or tmux.  These applications let you disconnect from a running command line session on the remote system and then reconnect at a later time or from another system.  After you log into an interactive node, you invoke them with either the ‘screen’ or ‘tmux’ command.  The full scope of using these applications is beyond this documentation, but googling ‘screen linux tutorial’ or ‘tmux linux tutorial’ should get you started.

 

 

batch node

A computer that is not accessed directly, but rather runs jobs that are distributed by SLURM from interactive nodes.

cluster

A collection of computers logically grouped to meet a goal.  In the context of the faculty cluster, this collection is targeted to facilitate faculty computational designs.

command line / command prompt

The command line is a user interface typically used on UNIX and UNIX-like operating systems.  It consists of a window in which you type commands to execute programs and perform tasks.  This is in contrast to Graphical User Interfaces (GUI), which are better known and typical of desktop operating systems.  A GUI typically uses a combination of mouse and keyboard to input instructions to the computer or program the computer is using.  The command prompt or just prompt is where you input commands for the linux system.

GUI

Graphical User Interface – when a program presents an interface that utilized not just typed commands but also mouse and menu driven methods of interacting with the program, it is referred to a having a GUI or graphical user interface.

interactive node / login node A system that is configured to allow direct logins and has the ability to run jobs directly on its own resources.  Typically, in Econ, interactive nodes are also where jobs are sumitted to batch nodes.

job

Specifically, the code needing to be run on computational systems.  Loosely, the term can also include the data that such a program needs in order to perform its calculation and (even more loosely) the results/output the program generates.

memory, or system memory, or RAM

On each individual computer, small amounts of very fast storage are used directly by CPU’s to cache results and store intermediary computations until it results in a form to be displayed on the display.  This can be referred to as memory, system memory, or RAM.  This is not to be confused with storage, or any discussion regarding home directory or research directory storage.

node

 

 

An individual computer that is a part of a set of systems.  In our case, any individual computer in the cluster is referred to as a node.  It may be a head node/login node, computational node, management node, etc.

storage

Storage refers to where files are saved for long term reference.  Within the economics cluster, storage is a networked resource with certain redundancies built in to ensure against a single disk failure resulting in a loss of data.  Storage is distinct from system memory.  Can also be referred to as disk space.  If you are speaking about quota, you are speaking about storage, not memory.