Quantcast
Channel: Clusters and HPC Technology
Viewing all 936 articles
Browse latest View live

Omnipath support in Intel MPI 4.1

$
0
0

The CAE software we develop is using Intel MPI version 4.1 and is typically deployed on clusters with infiniband.
One of our customers is considering the purchase of a cluster that uses omni-path.
Can anyone comment on whether or not I should expect the software to work as-is with omni-path?  Or would we need to upgrade our Intel MPI, or perhaps make other code changes?

Thank you,

Eric Marttila
ThermoAnalytics, Inc.
eam@thermoanalytics.com


HPC Orchestrator integration tests

$
0
0

 

I am trying to run the HPC orchestrator integration tests. The Fortran tests fails as I do not have a license for the Fortran compiler from Intel - I only have a C++ compiler license. Does anybody know how I could disable the Fortran compiler tests? I did not find anything in the documentation or the test scripts.
 

IMPI and DAPL fabrics on Infiniband cluster

$
0
0

Hello, I have been trying to submit a job in our cluster for a intel17 compiled and impi enabled code. I keep getting trouble at startup when running through PBS.

This is the submission script:

#!/bin/bash
#PBS -N propane_XO2_ramp_dx_p3125cm(IMPI)
#PBS -W umask=0022
#PBS -e /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind/propane_XO2_ramp_dx_p3125cm.err
#PBS -o /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind/propane_XO2_ramp_dx_p3125cm.log
#PBS -l nodes=16:ppn=12
#PBS -l walltime=999:0:0
module purge
module load null modules torque-maui intel/17
export OMP_NUM_THREADS=1
export I_MPI_FABRICS=shm:dapl
export I_MPI_DAPL_PROVIDER=OpenIB-cma
export I_MPI_FALLBACK_DEVICE=0
export I_MPI_DEBUG=100
cd /home4/mnv/FIREMODELS_ISSUES/fds/Validation/UMD_Line_Burner/Test_Valgrind
echo
echo $PBS_O_HOME
echo `date`
echo "Input file: propane_XO2_ramp_dx_p3125cm.fds"
echo " Directory: `pwd`"
echo "      Host: `hostname`"
/opt/intel17/compilers_and_libraries/linux/mpi/bin64/mpiexec   -np 184 /home4/mnv/FIREMODELS_ISSUES/fds/Build/impi_intel_linux_64/fds_impi_intel_linux_64 propane_XO2_ramp_dx_p3125cm.fds

As you can see I'm invoking DAPL and OpenIB-cma as dapl provider. This is what I see on my login node /etc/dat.conf

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib0 0"""
OpenIB-cma-1 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib1 0"""
OpenIB-cma-2 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib2 0"""
OpenIB-cma-3 u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "ib3 0"""
OpenIB-bond u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "bond0 0"""
ofa-v2-ib0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib0 0"""
ofa-v2-ib1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib1 0"""
ofa-v2-ib2 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib2 0"""
ofa-v2-ib3 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "ib3 0"""
ofa-v2-bond u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "bond0 0"""

Now logging in to the actual compute nodes I don't see an /etc/dat.conf on these. I don't know if this is normal or there is an issue there.

Anyways, when I submit the job I get the following attached stdout file, where it seems some of the nodes fail to load OpenIB-cma (with no fallback fabrics).

To be sure, some nodes on the cluster use Qlogic infiniband cards and others use Mellanox.

At this point I've tried several combinations, either specifying or not ib fabrics, without success. I'd really appreciate if you help me troubleshooting this.

Thank you,

Marcos

 

 

version incompatibility inside a cluster

$
0
0

Hey, good morning. My name is Eliomar, im from venezuela. im a bit new working with MPI, im making a masters project with this technology. But im facing a problem with my implementation that i dont know how to face it.

The problem is, i have a cluster, composed of 4 skylakes and 1 knl. im trying to run a program in knl from skylake. In knl i have installed the 2017 version and in skylakes i have 2015 version. in the beggining it crashes with a bash error saying that the file or directory does not exist. thats correct because i dont have same versions on the knl and when i was doing MPI_Comm_spawn to the knl it should be right that error. i thought i solved that error setting the root environmental variables, but then when i run mi program in the cluster at the moment of spawn a new process but now in the knl. the programs just hangs there. the errors that it says are: (after i push ctrl+c)

HYDU_sock_write (../../utils/sock/sock.c:417): write error (Bad file descriptor)

HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:246): unable to write data to proxy

ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:172): unable to send signal downstream

HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status

HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:480): error waiting for event

main (../../ui/mpich/mpiexec.c:945): process manager error waiting for completion

Any help would be great. if you have a related problem with mine.

Regards from venezuela.

MPI_Mprobe() makes no progress for internode communicator

$
0
0

Hi all,

My understanding (correct me if I'm wrong), is that MPI_Mprobe() has to guarantee progress if a matching send has been posted. The minimal working example below, however, runs to completion on a single Phi node of stampede2, while deadlocking on more than one node.

Thanks,
Toby

impi version:
Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)

mwe.c (attached)

slurm-mwe-stampede2-two-nodes.sh
~~~
#!/bin/sh
#SBATCH -J mwe # Job name
#SBATCH -p development # Queue (development or normal)
#SBATCH -N 2 # Number of nodes
#SBATCH --tasks-per-node 1 # Number of tasks per node
#SBATCH -t 00:01:00 # Time limit hrs:min:sec
#SBATCH -o mwe-%j.out # Standard output and error log
~~~

mwe-341107.out
~~~
TACC: Starting up job 341107
TACC: Starting parallel tasks...
[0]: post Isend
[1]: post Isend
slurmstepd: error: *** JOB 341107 ON c455-084 CANCELLED AT 2017-10-16T10:59:26 DUE TO TIME LIMIT ***
[mpiexec@c455-084.stampede2.tacc.utexas.edu] control_cb (../../pm/pmiserv/pmiserv_cb.c:857): assert (!closed) failed
[mpiexec@c455-084.stampede2.tacc.utexas.edu] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@c455-084.stampede2.tacc.utexas.edu] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@c455-084.stampede2.tacc.utexas.edu] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion
~~~

AttachmentSize
Downloadtext/x-csrcmwe.c1.03 KB

Obtaining the total throughput & Latency using IMB-benchmark

$
0
0

Hello, currently I am new to using IMB benchmark and wanted to make sure 

whether getting the total throughput and the latency from IMB-benchmark is possible

Currently the IMB-benchmark provides the throughput per sec based on the fixed size of message with fixed size of duration 

For example like this one below. 

 

$ mpirun -np 64 -machinefile hosts_infin ./IMB-MPI1 -map 32x2 Sendrecv

#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000         0.76         0.76         0.76         0.00
            1         1000         0.85         0.85         0.85         2.35
            2         1000         0.79         0.79         0.79         5.06
            4         1000         0.80         0.80         0.80        10.00
            8         1000         0.78         0.78         0.78        20.45
           16         1000         0.79         0.80         0.80        40.16
           32         1000         0.79         0.79         0.79        80.61
           64         1000         0.79         0.79         0.79       162.59
          128         1000         0.82         0.82         0.82       311.41
          256         1000         0.91         0.91         0.91       565.42
          512         1000         0.95         0.95         0.95      1082.13
         1024         1000         0.99         0.99         0.99      2076.87
         2048         1000         1.27         1.27         1.27      3229.91
         4096         1000         1.71         1.71         1.71      4802.87
         8192         1000         2.49         2.50         2.50      6565.97
        16384         1000         4.01         4.01         4.01      8167.28
        32768         1000         7.08         7.09         7.08      9249.23
        65536          640        22.89        22.89        22.89      5725.50
       131072          320        37.45        37.45        37.45      6999.22
       262144          160        65.74        65.76        65.75      7972.53
       524288           80       120.10       120.15       120.12      8727.37
      1048576           40       228.63       228.73       228.68      9168.57
      2097152           20       445.38       445.69       445.53      9410.86
      4194304           10       903.77       905.97       904.87      9259.29

#-----------------------------------------------------------------------------

However, what I want is to know the total throughput or the latency when I use the varying number of cores.

Would this be possible in IMB-benchmark? or do I need to use the traditional benchmark like FIO to do this? 

If obtaining the total throughput from the IMB-benchmark is possible 

probably there is a way to fix the time limit while giving unlimited iterations of running the benchmark. 

But since I am not familiar with using MPI-benchmarks I cannot find the smart way of doing this. 

Register for the Intel® HPC Developer Conference

$
0
0

The Intel® HPC Developer Conference, taking place November 11–12, 2017 in Denver, Colorado, is a great opportunity for developers who want to gain technical knowledge and hands-on experience with HPC luminaries to help take full advantage of today’s and tomorrow’s technology. Topics this year will include parallel programming, high productivity languages, artificial intelligence, systems, enterprise, visualization development... and much more.

We invite you to come gain hands-on experience with Intel platforms, network with Intel and industry experts, and gain insights on recent technology advances to maximize software efficiency and accelerate your path to discovery. Register now for this free conference.*

Encourage your network of colleagues to attend too.  When leveraging social media use hashtag #HPCDevCon.

Intel multinode Run Problem

$
0
0

Hi There,

I have a system with 6 computenodes, /opt folder is nfs shared and intel parallel studio cluster version installed on nfs server.

I am using slurm as workload manager. When i run a vasp job on 1 node there is no problem, But when i start to run the job on 2 or more nodes i am getting the following errors;

rank = 28, revents = 29, state = 1
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0

 

I tested the ssh between computenodes with sshconnectivity.exp /nodefile

The user information is shared over ldap server which is headnode.

I couldn't find a working solution in the net. Do anyone has ever had this error?

Thanks.

 

 


Intel Cluster Checker collection issue

$
0
0

Dear All,

I'm using Intel(R) Cluster Checker 2017 Update 2 (build 20170117), installed locally on master node in /opt/intel as part of Intel Parallel Studio XE.

However, when running clck-collect I get the following error for all connected computenodes.

[root@master ~]# clck-collect -a -f nodelist
computenode02: bash: /opt/intel/clck/2017.2.019/libexec/clck_run_provider.sh: No such file or directory
pdsh@master: computenode02: ssh exited with exit code 127

computenode01: bash: /opt/intel/clck/2017.2.019/libexec/clck_run_provider.sh: No such file or directory
pdsh@master: computenode01: ssh exited with exit code 127

Please guide if this is an issue with installation of parllel studio, or I'm missing something.

How do disable intra-node comminucation

$
0
0

I would like to test the network latency/bandwidth of each node that I am running on in parallel. I think the simplest way to do this would be to have each node test itself. 

 

My question is: How can I force all the IntelMPI TCP communication to go through the network adapter, and not use the optimized node-local communication?
 

Any advice would be greatly appreciated.

 

Best Regards,

John

MPI ISend/IRecv deadlock on AWS EC2

$
0
0

Hi, I'm encountering an unexpected deadlock in this Fortran test program, compiled using Parallel Studio XE 2017 Update 4 on an Amazon EC2 cluster (Linux system).

$ mpiifort -traceback nbtest.f90 -o test.x

 

On one node, the program runs just fine, but any more and it deadlocks, leading me to suspect a internode comm failure, but my knowledge in this area is lacking. FYI, the test code is hardcoded to be run on 16 cores.

Any help or insight is appreciated!

Danny

Code

program nbtest

  use mpi
  implicit none

  !***____________________ Definitions _______________
  integer, parameter :: r4 = SELECTED_REAL_KIND(6,37)
  integer :: irank

  integer, allocatable :: gstart1(:)
  integer, allocatable :: gend1(:)
  integer, allocatable :: gstartz(:)
  integer, allocatable :: gendz(:)
  integer, allocatable :: ind_fl(:)
  integer, allocatable :: blen(:),disp(:)

  integer, allocatable :: ddt_recv(:),ddt_send(:)

  real(kind=r4), allocatable :: tmp_array(:,:,:)
  real(kind=r4), allocatable :: tmp_in(:,:,:)

  integer :: cnt, i, j
  integer :: count_send, count_recv

  integer :: ssend
  integer :: srecv
  integer :: esend
  integer :: erecv
  integer :: erecv2, srecv2

  integer :: mpierr, ierr, old, typesize, typesize2,typesize3
  integer :: mpi_requests(2*16)
  integer :: mpi_status_arr(MPI_STATUS_SIZE,2*16)

  character(MPI_MAX_ERROR_STRING) :: string
  integer :: resultlen
  integer :: errorcode
!***________Code___________________________
  !*_________initialize MPI__________________
  call MPI_INIT(ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,irank,ierr)
  call MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN,ierr)

  allocate(gstart1(0:15), &
       gend1(0:15), &
       gstartz(0:15), &
       gendz(0:15))


  gstart1(0) = 1
  gend1(0) = 40
  gstartz(0) = 1
  gendz(0) = 27

  do i = 2, 16
     gstart1(i-1) = gend1(i-2) + 1
     gend1(i-1)   = gend1(i-2) + 40
     gstartz(i-1) = gendz(i-2) + 1
     gendz(i-1)   = gendz(i-2) + 27
  end do

  allocate(ind_fl(15))
  cnt = 1
  do i = 1, 16
     if ( (i-1) == irank ) cycle
     ind_fl(cnt) = (i - 1)
     cnt = cnt + 1
  end do
  cnt = 1
  do i = 1, 16
     if ( (i-1) == irank ) cycle
     ind_fl(cnt) = (i - 1)
     cnt = cnt + 1
  end do

  !*_________new datatype__________________
  allocate(ddt_recv(16),ddt_send(16))
  allocate(blen(60), disp(60))
  call mpi_type_size(MPI_REAL,typesize,ierr)

  do i = 1, 15
     call mpi_type_contiguous(3240,MPI_REAL, &
          ddt_send(i),ierr)
     call mpi_type_commit(ddt_send(i),ierr)

     srecv2 = (gstartz(ind_fl(i))-1)*2+1
     erecv2 = gendz(ind_fl(i))*2
     blen(:) = erecv2 - srecv2 + 1
     do j = 1, 60
        disp(j) = (j-1)*(852) + srecv2 - 1
     end do

     call mpi_type_indexed(60,blen,disp,MPI_REAL, &
          ddt_recv(i),ierr)
     call mpi_type_commit(ddt_recv(i),ierr)
     old = ddt_recv(i)
     call mpi_type_create_resized(old,int(0,kind=MPI_ADDRESS_KIND),&
          int(51120*typesize,kind=MPI_ADDRESS_KIND),&
          ddt_recv(i),ierr)
     call mpi_type_free(old,ierr)
     call mpi_type_commit(ddt_recv(i),ierr)

  end do


  allocate(tmp_array(852,60,40))
  allocate(tmp_in(54,60,640))
  tmp_array = 0.0_r4
  tmp_in = 0.0_r4

  ssend = gstart1(irank)
  esend = gend1(irank)
  cnt = 0

  do i = 1, 15
     srecv = gstart1(ind_fl(i))
     erecv = gend1(ind_fl(i))

     ! Calculate the number of bytes to send (for MPI_SEND)
     count_send = erecv - srecv + 1
     count_recv = esend - ssend + 1
     cnt = cnt + 1

     call mpi_irecv(tmp_array,count_recv,ddt_recv(i), &
          ind_fl(i),ind_fl(i),MPI_COMM_WORLD,mpi_requests(cnt),ierr)

     cnt = cnt + 1
     call mpi_isend(tmp_in(:,:,srecv:erecv), &
          count_send,ddt_send(i),ind_fl(i), &
          irank,MPI_COMM_WORLD,mpi_requests(cnt),ierr)

  end do

  call mpi_waitall(cnt,mpi_requests(1:cnt),mpi_status_arr(:,1:cnt),ierr)

  if (ierr /=  MPI_SUCCESS) then
     do i = 1,cnt
        errorcode = mpi_status_arr(MPI_ERROR,i)
        if (errorcode /= 0 .AND. errorcode /= MPI_ERR_PENDING) then
           call MPI_Error_string(errorcode,string,resultlen,mpierr)
           print *, "rank: ",irank, string
           !call MPI_Abort(MPI_COMM_WORLD,errorcode,ierr)
        end if
     end do
  end if

  deallocate(tmp_array)
  deallocate(tmp_in)

  print *, "great success"

  call MPI_FINALIZE(ierr)

end program nbtest

 

Running gdb on one of the processors during the deadlock:

 

(gdb) bt

#0  0x00002acb4c6bf733 in __select_nocancel () from /lib64/libc.so.6

#1  0x00002acb4b496a2e in MPID_nem_tcp_connpoll () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12

#2  0x00002acb4b496048 in MPID_nem_tcp_poll () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12

#3  0x00002acb4b350020 in MPID_nem_network_poll () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12

#4  0x00002acb4b0cc5f2 in PMPIDI_CH3I_Progress () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12

#5  0x00002acb4b50328f in PMPI_Waitall () from /opt/intel/psxe_runtime_2017.4.196/linux/mpi/intel64/lib/libmpi.so.12

#6  0x00002acb4ad1d53f in pmpi_waitall_ (v1=0x1e, v2=0xb0c320, v3=0x0, ierr=0x2acb4c6bf733 <__select_nocancel+10>) at ../../src/binding/fortran/mpif_h/waitallf.c:275

#7  0x00000000004064b0 in MAIN__ ()

#8  0x000000000040331e in main ()

 

Output log after I kill the job:

$ mpirun -n 16 ./test.x

forrtl: error (78): process killed (SIGTERM)

Image              PC                Routine            Line        Source

test.x             000000000040C12A  Unknown               Unknown  Unknown

libpthread-2.17.s  00002BA8B42F95A0  Unknown               Unknown  Unknown

libmpi.so.12       00002BA8B3303EBF  PMPIDI_CH3I_Progr     Unknown  Unknown

libmpi.so.12       00002BA8B373B28F  PMPI_Waitall          Unknown  Unknown

libmpifort.so.12.  00002BA8B2F5553F  pmpi_waitall          Unknown  Unknown

test.x             00000000004064B0  MAIN__                    129  nbtest.f90

test.x             000000000040331E  Unknown               Unknown  Unknown

libc-2.17.so       00002BA8B4829C05  __libc_start_main     Unknown  Unknown

test.x             0000000000403229  Unknown               Unknown  Unknown

(repeated 15 times, once for each processor)

Output with I_MPI_DEBUG = 6

[0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 3  Build 20170405 (id: 17193)

[0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation.  All rights reserved.

[0] MPI startup(): Multi-threaded optimized library

[12] MPI startup(): cannot open dynamic library libdat2.so.2

[7] MPI startup(): cannot open dynamic library libdat2.so.2

[10] MPI startup(): cannot open dynamic library libdat2.so.2

[13] MPI startup(): cannot open dynamic library libdat2.so.2

[4] MPI startup(): cannot open dynamic library libdat2.so.2

[9] MPI startup(): cannot open dynamic library libdat2.so.2

[14] MPI startup(): cannot open dynamic library libdat2.so.2

[5] MPI startup(): cannot open dynamic library libdat2.so.2

[11] MPI startup(): cannot open dynamic library libdat2.so.2

[15] MPI startup(): cannot open dynamic library libdat2.so.2

[6] MPI startup(): cannot open dynamic library libdat2.so.2

[8] MPI startup(): cannot open dynamic library libdat2.so.2

[0] MPI startup(): cannot open dynamic library libdat2.so.2

[3] MPI startup(): cannot open dynamic library libdat2.so.2

[2] MPI startup(): cannot open dynamic library libdat2.so.2

[4] MPI startup(): cannot open dynamic library libdat2.so

[7] MPI startup(): cannot open dynamic library libdat2.so

[8] MPI startup(): cannot open dynamic library libdat2.so

[9] MPI startup(): cannot open dynamic library libdat2.so

[6] MPI startup(): cannot open dynamic library libdat2.so

[10] MPI startup(): cannot open dynamic library libdat2.so

[13] MPI startup(): cannot open dynamic library libdat2.so

[0] MPI startup(): cannot open dynamic library libdat2.so

[15] MPI startup(): cannot open dynamic library libdat2.so

[3] MPI startup(): cannot open dynamic library libdat2.so

[12] MPI startup(): cannot open dynamic library libdat2.so

[4] MPI startup(): cannot open dynamic library libdat.so.1

[14] MPI startup(): cannot open dynamic library libdat2.so

[7] MPI startup(): cannot open dynamic library libdat.so.1

[5] MPI startup(): cannot open dynamic library libdat2.so

[8] MPI startup(): cannot open dynamic library libdat.so.1

[1] MPI startup(): cannot open dynamic library libdat2.so.2

[6] MPI startup(): cannot open dynamic library libdat.so.1

[9] MPI startup(): cannot open dynamic library libdat.so.1

[10] MPI startup(): cannot open dynamic library libdat.so.1

[0] MPI startup(): cannot open dynamic library libdat.so.1

[12] MPI startup(): cannot open dynamic library libdat.so.1

[4] MPI startup(): cannot open dynamic library libdat.so

[11] MPI startup(): cannot open dynamic library libdat2.so

[3] MPI startup(): cannot open dynamic library libdat.so.1

[13] MPI startup(): cannot open dynamic library libdat.so.1

[5] MPI startup(): cannot open dynamic library libdat.so.1

[15] MPI startup(): cannot open dynamic library libdat.so.1

[5] MPI startup(): cannot open dynamic library libdat.so

[7] MPI startup(): cannot open dynamic library libdat.so

[1] MPI startup(): cannot open dynamic library libdat2.so

[9] MPI startup(): cannot open dynamic library libdat.so

[8] MPI startup(): cannot open dynamic library libdat.so

[11] MPI startup(): cannot open dynamic library libdat.so.1

[6] MPI startup(): cannot open dynamic library libdat.so

[10] MPI startup(): cannot open dynamic library libdat.so

[14] MPI startup(): cannot open dynamic library libdat.so.1

[11] MPI startup(): cannot open dynamic library libdat.so

[13] MPI startup(): cannot open dynamic library libdat.so

[15] MPI startup(): cannot open dynamic library libdat.so

[12] MPI startup(): cannot open dynamic library libdat.so

[0] MPI startup(): cannot open dynamic library libdat.so

[14] MPI startup(): cannot open dynamic library libdat.so

[1] MPI startup(): cannot open dynamic library libdat.so.1

[3] MPI startup(): cannot open dynamic library libdat.so

[1] MPI startup(): cannot open dynamic library libdat.so

[2] MPI startup(): cannot open dynamic library libdat2.so

[2] MPI startup(): cannot open dynamic library libdat.so.1

[2] MPI startup(): cannot open dynamic library libdat.so

[4] MPI startup(): cannot load default tmi provider

[7] MPI startup(): cannot load default tmi provider

[5] MPI startup(): cannot load default tmi provider

[9] MPI startup(): cannot load default tmi provider

[0] MPI startup(): cannot load default tmi provider

[6] MPI startup(): cannot load default tmi provider

[10] MPI startup(): cannot load default tmi provider

[3] MPI startup(): cannot load default tmi provider

[15] MPI startup(): cannot load default tmi provider

[8] MPI startup(): cannot load default tmi provider

[1] MPI startup(): cannot load default tmi provider

[14] MPI startup(): cannot load default tmi provider

[11] MPI startup(): cannot load default tmi provider

[2] MPI startup(): cannot load default tmi provider

[12] MPI startup(): cannot load default tmi provider

[13] MPI startup(): cannot load default tmi provider

[12] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[4] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

[9] ERROR - load_iblibrary(): [15] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[5] ERROR - load_iblibrary(): [0] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[10] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

[1] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

 

[3] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[13] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[7] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[2] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[6] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[8] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[11] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[14] ERROR - load_iblibrary(): Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

Can't open IB verbs library: libibverbs.so.1: cannot open shared object file: No such file or directory

 

[0] MPI startup(): shm and tcp data transfer modes

[1] MPI startup(): shm and tcp data transfer modes

[2] MPI startup(): shm and tcp data transfer modes

[3] MPI startup(): shm and tcp data transfer modes

[4] MPI startup(): shm and tcp data transfer modes

[5] MPI startup(): shm and tcp data transfer modes

[7] MPI startup(): shm and tcp data transfer modes

[9] MPI startup(): shm and tcp data transfer modes

[8] MPI startup(): shm and tcp data transfer modes

[6] MPI startup(): shm and tcp data transfer modes

[10] MPI startup(): shm and tcp data transfer modes

[11] MPI startup(): shm and tcp data transfer modes

[12] MPI startup(): shm and tcp data transfer modes

[13] MPI startup(): shm and tcp data transfer modes

[14] MPI startup(): shm and tcp data transfer modes

[15] MPI startup(): shm and tcp data transfer modes

[0] MPI startup(): Device_reset_idx=1

[0] MPI startup(): Allgather: 4: 1-4 & 0-4

[0] MPI startup(): Allgather: 1: 5-11 & 0-4

[0] MPI startup(): Allgather: 4: 12-28 & 0-4

[0] MPI startup(): Allgather: 1: 29-1694 & 0-4

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}

[0] MPI startup(): Allgather: 4: 1695-3413 & 0-4

[0] MPI startup(): Allgather: 1: 3414-513494 & 0-4

[0] MPI startup(): Allgather: 3: 513495-1244544 & 0-4

[0] MPI startup(): Allgather: 4: 0-2147483647 & 0-4

[0] MPI startup(): Allgather: 4: 1-16 & 5-16

[0] MPI startup(): Allgather: 1: 17-38 & 5-16

[0] MPI startup(): Allgather: 3: 0-2147483647 & 5-16

[0] MPI startup(): Allgather: 4: 1-8 & 17-2147483647

[0] MPI startup(): Allgather: 1: 9-23 & 17-2147483647

[0] MPI startup(): Allgather: 4: 24-35 & 17-2147483647

[0] MPI startup(): Allgather: 3: 0-2147483647 & 17-2147483647

[0] MPI startup(): Allgatherv: 1: 0-3669 & 0-4

[0] MPI startup(): Allgatherv: 4: 3669-4949 & 0-4

[0] MPI startup(): Allgatherv: 1: 4949-17255 & 0-4

[0] MPI startup(): Allgatherv: 4: 17255-46775 & 0-4

[0] MPI startup(): Allgatherv: 3: 46775-836844 & 0-4

[0] MPI startup(): Allgatherv: 4: 0-2147483647 & 0-4

[0] MPI startup(): Allgatherv: 4: 0-10 & 5-16

[0] MPI startup(): Allgatherv: 1: 10-38 & 5-16

[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 5-16

[0] MPI startup(): Allgatherv: 4: 0-8 & 17-2147483647

[0] MPI startup(): Allgatherv: 1: 8-21 & 17-2147483647

[0] MPI startup(): Allgatherv: 3: 0-2147483647 & 17-2147483647

[0] MPI startup(): Allreduce: 5: 0-6 & 0-8

[0] MPI startup(): Allreduce: 7: 6-11 & 0-8

[0] MPI startup(): Allreduce: 5: 11-26 & 0-8

[0] MPI startup(): Allreduce: 4: 26-43 & 0-8

[0] MPI startup(): Allreduce: 5: 43-99 & 0-8

[0] MPI startup(): Allreduce: 1: 99-176 & 0-8

[0] MPI startup(): Allreduce: 6: 176-380 & 0-8

[0] MPI startup(): Allreduce: 2: 380-2967 & 0-8

[0] MPI startup(): Allreduce: 1: 2967-9460 & 0-8

[0] MPI startup(): Allreduce: 2: 0-2147483647 & 0-8

[0] MPI startup(): Allreduce: 5: 0-95 & 9-16

[0] MPI startup(): Allreduce: 1: 95-301 & 9-16

[0] MPI startup(): Allreduce: 2: 301-2577 & 9-16

[0] MPI startup(): Allreduce: 6: 2577-5427 & 9-16

[0] MPI startup(): Allreduce: 1: 5427-10288 & 9-16

[0] MPI startup(): Allreduce: 2: 0-2147483647 & 9-16

[0] MPI startup(): Allreduce: 6: 0-6 & 17-2147483647

[0] MPI startup(): Allreduce: 5: 6-11 & 17-2147483647

[0] MPI startup(): Allreduce: 6: 11-452 & 17-2147483647

[0] MPI startup(): Allreduce: 2: 452-2639 & 17-2147483647

[0] MPI startup(): Allreduce: 6: 2639-5627 & 17-2147483647

[0] MPI startup(): Allreduce: 1: 5627-9956 & 17-2147483647

[0] MPI startup(): Allreduce: 2: 9956-2587177 & 17-2147483647

[0] MPI startup(): Allreduce: 3: 0-2147483647 & 17-2147483647

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}

[0] MPI startup(): Alltoall: 4: 1-16 & 0-8

[0] MPI startup(): Alltoall: 1: 17-69 & 0-8

[0] MPI startup(): Alltoall: 2: 70-1024 & 0-8

[0] MPI startup(): Alltoall: 2: 1024-52228 & 0-8

[0] MPI startup(): Alltoall: 4: 52229-74973 & 0-8

[0] MPI startup(): Alltoall: 2: 74974-131148 & 0-8

[0] MPI startup(): Alltoall: 3: 131149-335487 & 0-8

[0] MPI startup(): Alltoall: 4: 0-2147483647 & 0-8

[0] MPI startup(): Alltoall: 4: 1-16 & 9-16

[0] MPI startup(): Alltoall: 1: 17-40 & 9-16

[0] MPI startup(): Alltoall: 2: 41-497 & 9-16

[0] MPI startup(): Alltoall: 1: 498-547 & 9-16

[0] MPI startup(): Alltoall: 2: 548-1024 & 9-16

[0] MPI startup(): Alltoall: 2: 1024-69348 & 9-16

[0] MPI startup(): Alltoall: 4: 0-2147483647 & 9-16

[0] MPI startup(): Alltoall: 4: 0-1 & 17-2147483647

[0] MPI startup(): Alltoall: 1: 2-4 & 17-2147483647

[0] MPI startup(): Alltoall: 4: 5-24 & 17-2147483647

[0] MPI startup(): Alltoall: 2: 25-1024 & 17-2147483647

[0] MPI startup(): Alltoall: 2: 1024-20700 & 17-2147483647

[0] MPI startup(): Alltoall: 4: 20701-57414 & 17-2147483647

[0] MPI startup(): Alltoall: 3: 57415-66078 & 17-2147483647

[0] MPI startup(): Alltoall: 4: 0-2147483647 & 17-2147483647

[0] MPI startup(): Alltoallv: 2: 0-2147483647 & 0-2147483647

[0] MPI startup(): Alltoallw: 0: 0-2147483647 & 0-2147483647

[0] MPI startup(): Barrier: 0: 0-2147483647 & 0-2147483647

[0] MPI startup(): Bcast: 4: 1-29 & 0-8

[0] MPI startup(): Bcast: 7: 30-37 & 0-8

[0] MPI startup(): Bcast: 4: 38-543 & 0-8

[0] MPI startup(): Bcast: 6: 544-1682 & 0-8

[0] MPI startup(): Bcast: 4: 1683-2521 & 0-8

[0] MPI startup(): Bcast: 6: 2522-30075 & 0-8

[0] MPI startup(): Bcast: 7: 30076-34889 & 0-8

[0] MPI startup(): Bcast: 4: 34890-131072 & 0-8

[0] MPI startup(): Bcast: 6: 131072-409051 & 0-8

[0] MPI startup(): Bcast: 7: 0-2147483647 & 0-8

[0] MPI startup(): Bcast: 4: 1-13 & 9-2147483647

[0] MPI startup(): Bcast: 1: 14-25 & 9-2147483647

[0] MPI startup(): Bcast: 4: 26-691 & 9-2147483647

[0] MPI startup(): Bcast: 6: 692-2367 & 9-2147483647

[0] MPI startup(): Bcast: 4: 2368-7952 & 9-2147483647

[0] MPI startup(): Bcast: 6: 7953-10407 & 9-2147483647

[0] MPI startup(): Bcast: 4: 10408-17900 & 9-2147483647

[0] MPI startup(): Bcast: 6: 17901-36385 & 9-2147483647

[0] MPI startup(): Bcast: 7: 36386-131072 & 9-2147483647

[0] MPI startup(): Bcast: 7: 0-2147483647 & 9-2147483647

[0] MPI startup(): Exscan: 0: 0-2147483647 & 0-2147483647

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}

[0] MPI startup(): Gather: 2: 1-3 & 0-8

[0] MPI startup(): Gather: 3: 4-4 & 0-8

[0] MPI startup(): Gather: 2: 5-66 & 0-8

[0] MPI startup(): Gather: 3: 67-174 & 0-8

[0] MPI startup(): Gather: 2: 175-478 & 0-8

[0] MPI startup(): Gather: 3: 479-531 & 0-8

[0] MPI startup(): Gather: 2: 532-2299 & 0-8

[0] MPI startup(): Gather: 3: 0-2147483647 & 0-8

[0] MPI startup(): Gather: 2: 1-141 & 9-16

[0] MPI startup(): Gather: 3: 142-456 & 9-16

[0] MPI startup(): Gather: 2: 457-785 & 9-16

[0] MPI startup(): Gather: 3: 786-70794 & 9-16

[0] MPI startup(): Gather: 2: 70795-254351 & 9-16

[0] MPI startup(): Gather: 3: 0-2147483647 & 9-16

[0] MPI startup(): Gather: 2: 1-89 & 17-2147483647

[0] MPI startup(): Gather: 3: 90-472 & 17-2147483647

[0] MPI startup(): Gather: 2: 473-718 & 17-2147483647

[0] MPI startup(): Gather: 3: 719-16460 & 17-2147483647

[0] MPI startup(): Gather: 2: 0-2147483647 & 17-2147483647

[0] MPI startup(): Gatherv: 2: 0-2147483647 & 0-16

[0] MPI startup(): Gatherv: 2: 0-2147483647 & 17-2147483647

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff}
span.s1 {font-variant-ligatures: no-common-ligatures}

[0] MPI startup(): Reduce_scatter: 5: 0-5 & 0-4

[0] MPI startup(): Reduce_scatter: 1: 5-192 & 0-4

[0] MPI startup(): Reduce_scatter: 3: 192-349 & 0-4

[0] MPI startup(): Reduce_scatter: 1: 349-3268 & 0-4

[0] MPI startup(): Reduce_scatter: 3: 3268-71356 & 0-4

[0] MPI startup(): Reduce_scatter: 2: 71356-513868 & 0-4

[0] MPI startup(): Reduce_scatter: 5: 513868-731452 & 0-4

[0] MPI startup(): Reduce_scatter: 2: 731452-1746615 & 0-4

[0] MPI startup(): Reduce_scatter: 5: 1746615-2485015 & 0-4

[0] MPI startup(): Reduce_scatter: 2: 0-2147483647 & 0-4

[0] MPI startup(): Reduce_scatter: 5: 0-5 & 5-16

[0] MPI startup(): Reduce_scatter: 1: 5-59 & 5-16

[0] MPI startup(): Reduce_scatter: 5: 59-99 & 5-16

[0] MPI startup(): Reduce_scatter: 3: 99-198 & 5-16

[0] MPI startup(): Reduce_scatter: 1: 198-360 & 5-16

[0] MPI startup(): Reduce_scatter: 3: 360-3606 & 5-16

[0] MPI startup(): Reduce_scatter: 2: 3606-4631 & 5-16

[0] MPI startup(): Reduce_scatter: 3: 0-2147483647 & 5-16

[0] MPI startup(): Reduce_scatter: 5: 0-22 & 17-2147483647

[0] MPI startup(): Reduce_scatter: 1: 22-44 & 17-2147483647

[0] MPI startup(): Reduce_scatter: 5: 44-278 & 17-2147483647

[0] MPI startup(): Reduce_scatter: 3: 278-3517 & 17-2147483647

[0] MPI startup(): Reduce_scatter: 5: 3517-4408 & 17-2147483647

[0] MPI startup(): Reduce_scatter: 3: 0-2147483647 & 17-2147483647

[0] MPI startup(): Reduce: 4: 4-5 & 0-4

[0] MPI startup(): Reduce: 1: 6-59 & 0-4

[0] MPI startup(): Reduce: 2: 60-188 & 0-4

[0] MPI startup(): Reduce: 6: 189-362 & 0-4

[0] MPI startup(): Reduce: 2: 363-7776 & 0-4

[0] MPI startup(): Reduce: 5: 7777-151371 & 0-4

[0] MPI startup(): Reduce: 1: 0-2147483647 & 0-4

[0] MPI startup(): Reduce: 4: 4-60 & 5-16

[0] MPI startup(): Reduce: 3: 61-88 & 5-16

[0] MPI startup(): Reduce: 4: 89-245 & 5-16

[0] MPI startup(): Reduce: 3: 246-256 & 5-16

[0] MPI startup(): Reduce: 4: 257-8192 & 5-16

[0] MPI startup(): Reduce: 3: 8192-1048576 & 5-16

[0] MPI startup(): Reduce: 3: 0-2147483647 & 5-16

[0] MPI startup(): Reduce: 4: 4-8192 & 17-2147483647

[0] MPI startup(): Reduce: 3: 8192-1048576 & 17-2147483647

[0] MPI startup(): Reduce: 3: 0-2147483647 & 17-2147483647

[0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647

[0] MPI startup(): Scatter: 2: 1-7 & 0-16

[0] MPI startup(): Scatter: 3: 8-9 & 0-16

[0] MPI startup(): Scatter: 2: 10-64 & 0-16

[0] MPI startup(): Scatter: 3: 65-372 & 0-16

[0] MPI startup(): Scatter: 2: 373-811 & 0-16

[0] MPI startup(): Scatter: 3: 812-115993 & 0-16

[0] MPI startup(): Scatter: 2: 115994-173348 & 0-16

[0] MPI startup(): Scatter: 3: 0-2147483647 & 0-16

[0] MPI startup(): Scatter: 1: 1-1 & 17-2147483647

[0] MPI startup(): Scatter: 2: 2-76 & 17-2147483647

[0] MPI startup(): Scatter: 3: 77-435 & 17-2147483647

[0] MPI startup(): Scatter: 2: 436-608 & 17-2147483647

[0] MPI startup(): Scatter: 3: 0-2147483647 & 17-2147483647

[0] MPI startup(): Scatterv: 1: 0-2147483647 & 0-2147483647

[5] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[1] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[7] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[2] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[6] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[3] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[13] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[4] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[9] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[14] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[11] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[15] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[8] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[12] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[10] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[0] MPI startup(): Rank    Pid      Node name      Pin cpu

[0] MPI startup(): 0       10691    ip-10-0-0-189  0

[0] MPI startup(): 1       10692    ip-10-0-0-189  1

[0] MPI startup(): 2       10693    ip-10-0-0-189  2

[0] MPI startup(): 3       10694    ip-10-0-0-189  3

[0] MPI startup(): 4       10320    ip-10-0-0-174  0

[0] MPI startup(): 5       10321    ip-10-0-0-174  1

[0] MPI startup(): 6       10322    ip-10-0-0-174  2

[0] MPI startup(): 7       10323    ip-10-0-0-174  3

[0] MPI startup(): 8       10273    ip-10-0-0-104  0

[0] MPI startup(): 9       10274    ip-10-0-0-104  1

[0] MPI startup(): 10      10275    ip-10-0-0-104  2

[0] MPI startup(): 11      10276    ip-10-0-0-104  3

[0] MPI startup(): 12      10312    ip-10-0-0-158  0

[0] MPI startup(): 13      10313    ip-10-0-0-158  1

[0] MPI startup(): 14      10314    ip-10-0-0-158  2

[0] MPI startup(): 15      10315    ip-10-0-0-158  3

[0] MPI startup(): Recognition=2 Platform(code=32 ippn=2 dev=5) Fabric(intra=1 inter=6 flags=0x0)

[0] MPI startup(): I_MPI_DEBUG=6

[0] MPI startup(): I_MPI_HYDRA_UUID=bb290000-2b37-e5b2-065d-050000bd0a00

[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=1

[0] MPI startup(): I_MPI_PIN_MAPPING=4:0 0,1 1,2 2,3 3

shared memory initialization failure

$
0
0

Hi all,

Running our MPI application on a newly setup RHEL 7.3 system using SGE, we obtain the following error:

Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(805): fail failed
MPID_Init(1817)......: fail failed
MPIR_Comm_commit(711): fail failed
(unknown)(): Other MPI error

With I_MPI_DEBUG=1000 the following error is reported:

[0] I_MPI_Init_shm_colls_space(): Cannot create shm object: /shm-col-space-69142-2-55D0EBDD4B46E errno=Permission denied
[0] I_MPI_Init_shm_colls_space(): Something goes wrong in shared memory initialization (Permission denied)

Usually Intel MPI creates shm objects in /dev/shm. Does anybody know why the library tries to create them in /?

Cheers,
Pieter

mpirun command does not distribute jobs to compute nodes

$
0
0

Dear Folks,

I have Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.1.117 Build 20121010 in my system. I am trying to submit a job using mpirun to my machine having following hosts: 

weather
compute-0-0
compute-0-1
compute-0-2
compute-0-3
compute-0-4
compute-0-5
compute-0-6
compute-0-7

after running mpdboot (as mpdboot -v -n 9 -f ~/hostfile -r ssh) I am using the command: mpirun -np 72 -f ~/hostfile ./wrf.exe &

after submitting the job, it fails with some error after 10-15 min. I checked the top command on the compute nodes and did not see any process running as wrf.exe in the mean time. Please suggest if I am making any mistake or there is something else which is inhibiting me to submit jobs on the compute nodes.

Thank you in anticipation.

Dhirendra

ITAC -- Naming generated .stf file to differentiate runs

$
0
0

Hello, 

I am using ITAC from the 2017.05 Intel Parallel Cluster Studio. I issue a number of mpirun command lines with ITAC tracing enabled. I am trying though to assign specific names to the generated .stf files so that I can associate the .stf files of a particular run with the corresponding mpirun command.  

How can I do this? 

Is there any option as we have for the statistics with the I_MPI_STATS_FILE?

Can I do something like  

mpiexec.hydra ... -stf-file-name MPIapp_$(date +%F_%T) ... ./MPIapp 

Thank you!

Michael

mpirun command does not send jobs to compute nodes

$
0
0

Dear Folks,

I have Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.1.117 Build 20121010 in my system. I am trying to submit a job using mpirun to my machine having following hosts: 

weather
compute-0-0
compute-0-1
compute-0-2
compute-0-3
compute-0-4
compute-0-5
compute-0-6
compute-0-7

after running mpdboot (as mpdboot -v -n 9 -f ~/hostfile -r ssh) I am using the command: mpirun -np 72 -f ~/hostfile ./wrf.exe &

after submitting the job, it fails with some error after 10-15 min. I checked the top command on the compute nodes and did not see any process running as wrf.exe in the mean time. Please suggest if I am making any mistake or there is something else which is inhibiting me to submit jobs on the compute nodes.

Thank you in anticipation.

Dhirendra


intel mpi cross os launch error

$
0
0

Env:

node1 : window 10                             (192.168.137.1)

node2 : debian8  virtual machine.      (192.168.137.3)

 

test app: the test.cpp included with intel mpi package

 

1,  Launch from windows side(node1),  1 process (just node 1):   

mpiexec -demux select -bootstrap=service -genv I_MPI_FABRICS=shm:tcp -n 1 -host localhost test

get output:

node1:

 

Hello world: rank 0 of 1 running on DESKTOP-J4KRVVD

2,  Launch from windows side(node1),  1 process (just node 2):   

mpiexec -demux select -bootstrap=service -genv I_MPI_FABRICS=shm:tcp -host 192.168.137.3 -hostos linux -n 1 -path /opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/test test

get output:

node1:

Hello world: rank 0 of 1 running on vm-build-debian8

3, Launch from windows side(node1),  two processes(1 at node1, 1 at node2):   

mpiexec -demux select -bootstrap=service -genv I_MPI_FABRICS=shm:tcp -host 192.168.137.3 -hostos linux -n 1 -path /opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/test test : -n 1 -host localhost test

get error:

node1:

rank = 1, revents = 29, state = 1
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 2988: (it_plfd->revents & POLLERR) == 0
internal ABORT - process 0

 

node2:

[hydserv@vm-build-debian8] stdio_cb (../../tools/bootstrap/persist/persist_server.c:170): assert (!closed) failed
[hydserv@vm-build-debian8] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[hydserv@vm-build-debian8] main (../../tools/bootstrap/persist/persist_server.c:339): demux engine error waiting for event

 

If i try to turn on verbose output with -v or -genv I_MPI_HYDRA_DEBUG=on,  even test 2 will fail with errors below,  so don't know what's wrong?   or how to find out what's wrong?      

node1:

[mpiexec@DESKTOP-J4KRVVD] STDIN will be redirected to 1 fd(s): 4

[mpiexec@DESKTOP-J4KRVVD] ..\hydra\utils\sock\sock.c (420): write error (Unknown error)
[mpiexec@DESKTOP-J4KRVVD] ..\hydra\tools\bootstrap\persist\persist_launch.c (52): assert (sent == hdr.buflen) failed
[mpiexec@DESKTOP-J4KRVVD] ..\hydra\tools\demux\demux_select.c (103): callback returned error status
[mpiexec@DESKTOP-J4KRVVD] ..\hydra\pm\pmiserv\pmiserv_pmci.c (501): error waiting for event
[mpiexec@DESKTOP-J4KRVVD] ..\hydra\ui\mpich\mpiexec.c (1147): process manager error waiting for completion

node2:

[hydserv@vm-build-debian8] stdio_cb (../../tools/bootstrap/persist/persist_server.c:170): assert (!closed) failed
[hydserv@vm-build-debian8] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[hydserv@vm-build-debian8] main (../../tools/bootstrap/persist/persist_server.c:339): demux engine error waiting for event

 

pbs system said: 'MPI startup(): ofa fabric is not available and fallback fabric is not enabled''

$
0
0

I've been using PBS system for testing my code. I have a PBS script to run my binary code.  But when I get:

> [0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled

And I read this site: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technolog...

However, those methods can not sovle the problem. My code can run in host node and other node, but the code can not run by PBS system.

What can I do for this problem?

Thanks.

P.S.

This my PBS script:

#!/bin/sh

#PBS -N job_1
#PBS -l nodes=1:ppn=12
#PBS -o example.out
#PBS -e example.err
#PBS -l walltime=3600:00:00
#PBS -q default_queue

echo -e --------- `date` ----------

echo HomeDirectory is $PWD
echo
echo Current Dir is $PBS_O_WORKDIR
echo


cd $PBS_O_WORKDIR

echo "------------This is the node file -------------"
cat $PBS_NODEFILE
echo "-----------------------------------------------"

np=$(cat $PBS_NODEFILE | wc -l)
echo The number of core is $np
echo
echo

cat $PBS_NODEFILE > $PBS_O_WORKDIR/mpd.host

mpdtrace  >/dev/null 2>&1
if [ "$?" != "0" ]
then
        echo -e
        mpdboot -n 1 -f mpd.host -r ssh
fi

mpirun -np 12 ./run_test

 

Fata Error using MPI in Linux

$
0
0

Hi,

I'm using a virtual Linux Ubuntu machine (Linux-VirtualBox 4.4.0-101-generic #124-Ubuntu SMP Fri Nov 10 18:29:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux), with 8GB RAM.

For a process on Matlab, the software requires Intel MPI runtime package v4.1.3.045 or superior. Instead, I've installed the 2018.1.163 version, being not sure about the 2018 number version.

Using 8 cores in the processing, the software went in error, with the following error:
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(224)...................: MPI_Recv(buf=0x7f566d59c040, count=9942500, MPI_FLOAT, src=3, tag=5, MPI_COMM_WORLD, status=0x7ffc43a72b60) failed
PMPIDI_CH3I_Progress(658).......: fail failed
MPID_nem_handle_pkt(1450).......: fail failed
pkt_RTS_handler(317)............: fail failed
do_cts(662).....................: fail failed
MPID_nem_lmt_dcp_start_recv(302): fail failed
dcp_recv(165)...................: Internal MPI error!  Cannot read from remote process
 Two workarounds have been identified for this issue:
 1) Enable ptrace for non-root users with:
    echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
 2) Or, use:
    I_MPI_SHM_LMT=shm

 

Reducing the number of cores to 4, the process hangs for more than 3 hours and I'm not sure it is still working.

What could be the problem?

thank you

Pietro

 

drastic reduction in performance when compute node running at half load

$
0
0

We have compute nodes with 24 cores( 48 threads) and 64 GB RAM (2x32GB). When I run a sample code (matrix multiplication)in one of the compute node in one thread, it takes only 4 seconds. But when I starting more runs (copy of the same program) in the same compute node, the time taken increases drastically. When the number of programs running reaches 24 (I gave maximum 24 since physically only 24 cores are present), the time taken becomes like around 40 seconds ( 10 times less). When I checked the temperature, it is below 40 deg Celsius.

When I searched in the Internet about this issue, I found some people saying that it may be due to slowing down of transfer of data from ram to processor when we run many programs. I was not satisfied with this comment, because the compute nodes are designed to run at maximum load with out much decrease in performance. Also, we are using only 1GB of memory even with 24 programs running. Since we are getting performance reduction of about 1/10, I guess the problem is something else.

HPCC benchmark HPL results degrade as more cores are used

$
0
0

I have a 6-node cluster consisting of 12 cores per node with a total of 72 cores.

When running the HPCC benchmark on 6 cores - 1 core per node, 6 nodes - HPL results is 1198.87 GFLOPS.  However, running HPCC on all available cores of the 6-node cluster, for a total of 72 cores, HPL results is 847.421 GFLOPS.

MPI Library Used: Intel(R) MPI Library for Linux* OS, Version 2018 Update 1 Build 20171011 (id: 17941)

Options to mpiexec.hydra:
-print-rank-map
-pmi-noaggregate
-nolocal
-genvall
-genv I_MPI_DEBUG 5
-genv I_MPI_HYDRA_IFACE ens2f0
-genv I_MPI_FABRICS shm:tcp
-n 72
-ppn 12
-ilp64
--hostname filename

Any ideas?

Thanks in advance.

 

Viewing all 936 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>