Quantcast
Channel: Clusters and HPC Technology
Viewing all 936 articles
Browse latest View live

Intel MPI installation problem

$
0
0

Hi,

I am trying to install Intel MPI on Windows Server 2012 R2 SERVERSTANDARDCORE but during installation occurs error 1603 connected with installation error 0x80040154: wixCreateInternetShortcuts: failed to create an instance of IUniformResourceLocatorW, and failed to create Internet shortcut. Do you have any idea how I can troubleshoot this?

Thanks for help,

Patrycja

 


No mpiicc or mpiifort with composer_xe/2016.0.109 ?

$
0
0

I started a new job and our company has composer_xe/2016.0.109 .When I load the module I do not get any mpiicc or mpiifort compiler? Does one need to have cluster edition for those?

Measuring data movement from DRAM to KNL memory

$
0
0

Dear All,

I am implementing and testing LOBPCG algorithm on KNL machine for some big sparse matrices. For the performance report, I need to measure how much data is transferred from DRAM to KNL memory. I am wondering if there is a simple way of doing this. Any help or idea is appreciated.

Regards,

Fazlay

BCAST error for message size greater than 2 GB

$
0
0

Hello,

I'm using Intel Fortran 16.0.1 and Intel MPI 5.1.3 and I'm getting an error with bcast as follows:

Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(2231)........: MPI_Bcast(buf=0x2b460bcc0040, count=547061260, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1798)...:
MPIR_Bcast(1826)........:
I_MPIR_Bcast_intra(2007): Failure during collective
MPIR_Bcast_intra(1592)..:
MPIR_Bcast_binomial(253): message sizes do not match across processes in the collective routine: Received -32766 but expected -2106722256

I'm broadcasting Integer array (4 byte) of size 547061260. Is there an upper limit on the message size? The bcast works fine for smaller counts.

Thanks!

MPI_Alltoall error when running more than 2 cores per node

$
0
0

We have 6 Intel(R) Xeon(R) CPU D-1557 @ 1.50GHz nodes, each containing 12 cores.  hpcc version 1.5.0 has been compiled with Intel's MPI and MLK.  We are able to run hpcc successfully when configuring mpirun for 6 nodes and 2 cores per node.  However, attempting to specify more than 2 cores per nodes (we have 12) causes the error "invalid error code ffffffff (Ring Index out of range) in MPIR_Alltoall_intra:204"

Any ideas as to what could be causing this issue?

The following environment variables have been set:
I_MPI_FABRICS=tcp
I_MPI_DEBUG=5
I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,4,5,6,7,8,9,10,11

The MPI library version is:
Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)

hosts.txt contains a list of 6 hostnames

The line below shows how mpirun is specified to execute hpcc on all 6 nodes, 3 cores per node:
mpirun -print-rank-map -n 18 -ppn 3  --hostfile hosts.txt  hpcc

INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in MPIR_Alltoall_intra:204
Fatal error in PMPI_Alltoall: Other MPI error, error stack:
PMPI_Alltoall(974)......: MPI_Alltoall(sbuf=0x7fcdb107f010, scount=2097152, dtype=USER<contig>, rbuf=0x7fcdd1080010, rcount=2097152, dtype=USER<contig>, comm=0x84000004) failed
MPIR_Alltoall_impl(772).: fail failed
MPIR_Alltoall(731)......: fail failed
MPIR_Alltoall_intra(204): fail failed

Thanks!

 

cluster error: /mpi/intel64/bin/pmi_proxy: No such file or directory found

$
0
0

Hi,

I've installed Intel parallel studio cluster edition in single node installation configuration on the master node cluster of 8 nodes with 8 processors each. I've performed the pre-requisite steps before installation and verified shell connectivity also running the .sshconnectivity and creating machines.LINUX file which gave the result as suggesting all 8 nodes are found as follows:

*******************************************************************************
Node count = 8
Secure shell connectivity was established on all nodes.
See the log output listing "/tmp/sshconnectivity.aditya.log" for details.
Version number: $Revision: 259 $
Version date: $Date: 2012-06-11 23:26:12 +0400 (Mon, 11 Jun 2012) $
*******************************************************************************

machines.LINUX file has the following hostnames:

octopus100.ubi.pt
compute-0-0.local 
compute-0-1.local 
compute-0-2.local 
compute-0-3.local 
compute-0-4.local 
compute-0-5.local 
compute-0-6.local 

I started the installation and installed all the modules in /export/apps/intel directory which can be accessed by all nodes as suggested by the administrator of the cluster. After completing the installation I've added the compilers environmental variable psxevar.sh and mpivars.sh to the bash script as advised in the getting started manual. I then prepared the hostfile with all the nodes of the cluster for running in the mpi environment and verifies the shell connectivity by running .sshconnectivity form the installation directory and it worked like earlier and detected all nodes successfully.

i wanted to check the cluster configuration, so I compiled and executed the test.c program in the mpi/test directory of the instalation. I compiled well but when I executed myprog it returned the error: /mpi/intel64/bin/pmi_proxy: No such file or directory found as follows: 

[aditya@octopus100 Desktop]$ mpiicc -o myprog test.c
[aditya@octopus100 Desktop]$ mpirun -n 2 -ppn 1 -f ./hostfile ./myprog
Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
bash: /export/apps/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/pmi_proxy: No such file or directory
^C[mpiexec@octopus100.ubi.pt] Sending Ctrl-C to processes as requested
[mpiexec@octopus100.ubi.pt] Press Ctrl-C again to force abort
[mpiexec@octopus100.ubi.pt] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@octopus100.ubi.pt] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@octopus100.ubi.pt] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@octopus100.ubi.pt] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@octopus100.ubi.pt] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@octopus100.ubi.pt] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

later I referred trouble shooting manual then it suggested running a non-mpi  for hostname and it returned the same error as follows:

[aditya@octopus100 Desktop]$ mpirun -ppn 1 -n 2 -hosts compute-0-0.local, compute-0-1.local hostname
Intel(R) Parallel Studio XE 2017 Update 4 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.
bash: /export/apps/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin/pmi_proxy: No such file or directory
^C[mpiexec@octopus100.ubi.pt] Sending Ctrl-C to processes as requested
[mpiexec@octopus100.ubi.pt] Press Ctrl-C again to force abort
[mpiexec@octopus100.ubi.pt] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@octopus100.ubi.pt] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@octopus100.ubi.pt] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@octopus100.ubi.pt] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@octopus100.ubi.pt] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@octopus100.ubi.pt] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

When I included the master ode octopus100.ubi.pt it worked only for that node but the rest nodes are not able to run the mpi commands I guess. I think may it is an environmental problem as the cluster nodes are not able to perform mpi communications with the master node.

Please help me resolve this issue so that I can perform some simulations on the cluster.

Thanks,

Aditya

 

Scalapack raise error under certain circumstance

$
0
0

Dear All,

      I am using IntelMPI + ifort + MKL to compile Quantum-Espresso 6.1. Everthing works fine except invoking scalapack routines. Calls to PDPOTRF may exit with non-zero error code under certain circumstance. In an example, with 2 nodes * 8 processors per node the program works but with 4 nodes * 4 processors per node the program fails. If I_MPI_DEBUG is used,  for the failed case there are following messages just before the call exit with code 970, while for the working case there is no such messages:

[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676900, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675640, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x26742b8, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676b58, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x26769c8, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676c20, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675fa0, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676068, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676a90, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676e78, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2678778, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675898, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675a28, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675bb8, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2674f38, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676ce8, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2676130, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2674768, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2674448, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2674b50, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675e10, operation 2, size 12272, lkey 1879682311
[10#18754:18754@node09] MRAILI_Ext_sendq_send(): rail 0,vbuf 0x2675708, operation 2, size 2300, lkey 1879682311

         Could you provide any suggestion about what is the possible cause here? Thanks very much.

Feng

Regarding cluster_sparse_solver

$
0
0

I am Mehdi and this is my first time using this forum.

I need to used cluster_sparse_solver in my FORTRAN Finite Element program. Because the degree of freedom of my system is very high (1^6), the number of nonzero members in the stiffness matrix (A in Ax=B) will be also very high in a way that I can not store the number of non-zero in an integer  number with type 4 and I must use integer(8). Therefore, the parameter ia (row indexing of sparse matrix) must be integer(8). 

In this situation, how I should compile my program. I have tried to use 4 bit and 8 bit libraries, during compiling of my program and none of them are working. Shall I use all of the integers in my program with type integer(8)? When ia in integer(8) and ja is integer(4), is it possible to compile the program?

Please help me. I can provide any more information you may need.

Bests

Mehdi


install Intel Studio Cluster Edition after installing Composer Edition

$
0
0

Hi all,

I am a student and was using Intel Studio XE Composer edition for the past year. Recently I realized Intel® Trace Analyzer and Collector is also available for students with Cluster Edition. I wish to install only this tool without having to uninstall my previous installation of Composer Edition. When I attempt to customize my installation I get the following error msg:

Product files will be installed here:

/opt/intel/

"The install directory path cannot be changed because at least one software product component was detected as having already been installed on the system".

Any help how can I solve this issue and install only Trace Analyzer is highly appreciated. 

Note that I have many jobs running currently, which I assume will be killed if I uninstall current Studio version. So I highly prefer to not kill my jobs. 

 

 

Notification of a failed dead node existence using the PSM2

$
0
0

Hello everyone,

I am writing because I am currently implementing a failure recovery system for a cluster with Intel OmniPath that will be designated for handling computations in a physical experiment. What I want to implement is a mechanism to detect a node that failed and to notify rest of the nodes. I tried to check the node failure by invoking psm2_poll. Unfortunately, as I saw in the Intel ® Performance ScaledMessaging 2 (PSM2) Programmer’s Guide, this function does not return errors (values) other than OK or OK_NO_PROGRESS (this is at least what I have observed in my application - the poll on a dead node behaves as if the node did not fail/disconnect and did not send any message). 

So the question is: What are the methods of notifying other nodes after node failure ? Is there a lightweight function that I can invoke along with poll to check if the node from whom I am trying to get messages exists ?

In worst case, I can implement this using a counter and a timeout, but if there is a mechanism supported by the API, I am wide open.

Best Regards

compatibility Intel Parallel Studio XE 2017 and rocks cluster 6.2

$
0
0

Hi, I would like to know if Intel Parallel Studio XE 2017 (Cluster edition) and rocks cluster 6.2 are compatibles (without having problems to install). I have a cluster with 1 Head node and 6 nodes. And.. is enought 1 license for all the cluster? (to run things with Intel MPI).

Thanks!

Pablo

how to run coarray programs with SLURM

$
0
0

I'm trying to port old (working) PBS scripts to SLURM, but not succeeding.

Can anybody share a working SLURM subscription script for coarray distributed memory please.

The best I can do is:

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --tasks-per-node=28
#SBATCH --time=00:30:00
#SBATCH -p test

EXE=testABW.x
export I_MPI_COLL_INTRANODE=pt2pt
module load intel/2017.01
cd $SLURM_SUBMIT_DIR
echo -genvall -genv -np ${SLURM_NTASKS} ./$EXE > xx14.conf
srun -N ${SLURM_NNODES} --nodelist=${SLURM_NODELIST} /bin/hostname > nodes
./$EXE

 

which gives me:

Lmod has detected the following error: The following module(s) are unknown:"languages/intel/2017.01"

Please check the spelling or version number. Also try "module spider ..."

Error in system call pthread_mutex_unlock: Operation not permitted
    ../../src/mpid/ch3/channels/nemesis/netmod/tmi/tmi_poll.c:629
Error in system call pthread_mutex_unlock: Operation not permitted
    ../../src/mpid/ch3/channels/nemesis/netmod/tmi/tmi_poll.c:629
Error in system call pthread_mutex_unlock: Operation not permitted
    ../../src/mpid/ch3/channels/nemesis/netmod/tmi/tmi_poll.c:629
Error in system call pthread_mutex_unlock: Operation not permitted
    ../../src/mpid/ch3/channels/nemesis/netmod/tmi/tmi_poll.c:629

 

Thanks!

intel MPI codes fails with more than 1 node

$
0
0

For a simple mpi program compiled with Intel compiler studio 2016 and 2017, with intel compiler and mpi, the jobs fail with the following debug errors. The code will run extremely slowly, and get stuck for about 30 seconds at one stage if run on 2 or more nodes. It runs smoothly without any problems on a single node. The same code compiled with gcc and openmpi runs smoothly without any problem on any number of nodes.

Do you know what might be the problem? Thanks.

AttachmentSize
Downloadapplication/octet-streamerror.log35.2 KB

ssh console output stops working

$
0
0

This particular problem is likely not an Intel problem, but may be one experienced here by someone, and who has some advice for resolution.

I have a compute intensive application that is written as MPI distributed, OpenMP threaded. I can run this program in my office directly (no MPI) or distributed (mpirun), I can run on 1 or 2 nodes using MPI. The systems locally are Xeon host (Cent OS) and KNL host (Cent OS). I've also have run this successfully by ssh-ing into the Colfax Cluster using 1 to 8 KNL's (couldn't get 16 KNLs to schedule).

I am now running (attempting to run) test on a hardware vendor's setup.

After resolving configuration issues and installation issues I can

ssh into their login server (Xeon host)
su to super user
ssh to KNL node (Xeon KNL)
mpirun ... using two KNL's

When I mpirun'd the application, it started up as expected (periodically emitting progress information to the console). Several minutes into the run it hung. I thought this was a programming error resulting in deadlock, or maybe a watchdog timer killed a thread or process without killing the mpi process manager.

To eliminate possible causes I started the application as stand alone (without mpirun).

Several minutes into this, the program hung as well. So not mpi messaging issue.

Pressing Ctrl-C on the keyboard (through two ssh connections) yielded no response (application not killed). I thought one of the systems in the ssh connections went down. Prior to doing anything on those ssh connections, I wrote an email to my client explaining the hang issue. Several minutes passed.

Now for the interesting part.

After this several minutes "hang" 100 to 150 lines of progress output from the application came out on the console window then the program terminated by Ctrl-C message came out. What appears to have happened was the application was running fine during the console hang, but the terminal output was suspended (as if flow control instructed it to stop). And no, I did not press Ctrl-S or the Pause key.

Anyone have information on this and how to avoid the hang, or at least how to resume the console output without killing the application.

Jim Dempsey

ERR: corrupt dat.conf entry field

$
0
0

 

We recently installed a cluster using Rocks 6.2, OFED-3.18-3, and Intel Parallel Studio XE Cluster Edition 2016.4.258.

My parallel code can run and the performance is good, but the code output says

 ERR: corrupt dat.conf entry field: EOR, EOF, file offset=5885
 ERR: corrupt dat.conf entry field: EOR, EOF, file offset=5885
 ERR: corrupt dat.conf entry field: api_ver, file offset=5974
 ERR: corrupt dat.conf entry field: api_ver, file offset=6061
 ERR: corrupt dat.conf entry field: api_ver, file offset=6148
 ERR: corrupt dat.conf entry field: api_ver, file offset=5974
 ERR: corrupt dat.conf entry field: api_ver, file offset=6061
 ERR: corrupt dat.conf entry field: api_ver, file offset=6148
 ERR: corrupt dat.conf entry field: EOR, EOF, file offset=5885

...

I tried to google it, but can not find useful information. Thanks in advance for any help.

Best wishes,
Ding


Using Intel MPI with PBSPro and Kerberos

$
0
0

Hello,

We have some troubles on our cluster to use Intel MPI with PBSPro under a Kerberized environment.

The thing is PBSPro doesn't forward Kerberos tickets which prevents us to have a password-less ssh. Security officers rejects ssh keys without a passphrase, beside, we are expected to rely on Kerberos in order to connect through ssh.

As you can expect, a simple

mpirun -l -v -n $nb_procs "${PBS_O_WORKDIR}/echo-node.sh" # that simply calls bash builtin echo

fails because of pmi_proxy that hangs, and in the end the walltime is exceeded, and we observe:

[...]
[mpiexec@node028.sis.cnes.fr] Launch arguments: /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port node028.sis.cnes.fr:41735 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 0
[mpiexec@node028.sis.cnes.fr] Launch arguments: /bin/ssh -x -q node029.sis.cnes.fr /work/logiciels/rhall/intel/parallel_studio_xe_2017_u2/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin/pmi_proxy --control-port node028.sis.cnes.fr:41735 --debug --pmi-connect alltoall --pmi-aggregate -s 0 --rmk pbs --launcher ssh --demux poll --pgid 0 --enable-stdin 1 --retries 10 --control-code 1939201911 --usize -2 --proxy-id 1
[proxy:0:0@node028.sis.cnes.fr] Start PMI_proxy 0
[proxy:0:0@node028.sis.cnes.fr] STDIN will be redirected to 1 fd(s): 17
[0] node: 0 /  /
=>> PBS: job killed: walltime 23 exceeded limit 15
[mpiexec@node028.sis.cnes.fr] HYDU_sock_write (../../utils/sock/sock.c:418): write error (Bad file descriptor)
[mpiexec@node028.sis.cnes.fr] HYD_pmcd_pmiserv_send_signal (../../pm/pmiserv/pmiserv_cb.c:252): unable to write data to proxy
[mpiexec@node028.sis.cnes.fr] ui_cmd_cb (../../pm/pmiserv/pmiserv_pmci.c:174): unable to send signal downstream
[mpiexec@node028.sis.cnes.fr] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@node028.sis.cnes.fr] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@node028.sis.cnes.fr] main (../../ui/mpich/mpiexec.c:1147): process manager error waiting for completion

If instead we log onto the master node, execute kinit, and then run mpirun, everything works fine. Except this isn't exactly an acceptable workaround.

I've tried to play with the fabrics as the nodes are also connected with infiband, but I had no luck there. If I'm not mistaken, pmi_proxy does require password-less ssh whatever fabrics we have. Am I right ?

BTW, I've also tried to play with Altair PBSPro's pbsdsh. I've observed that the parameters it expects are not compatible with the one fed by mpirun. Besides, even if I encapsulate pbsdsh, pmi_proxy still fails with a

[proxy:0:0@node028.sis.cnes.fr] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "node028.sis.cnes.fr" to "node028.sis.cnes.fr" (Connection refused)
[proxy:0:0@node028.sis.cnes.fr] main (../../pm/pmiserv/pmip.c:461): unable to connect to server node028.sis.cnes.fr at port 49813 (check for firewalls!)

So. My question, is there a workaround? Something that I've missed? Every clue I can gather googling and experimenting points me towards "password-less ssh". So far the only workaround we've found consist in using another MPI framework :(

Regards,

help run Fortran coarray programs with SLURM

$
0
0

I'm using ifort and mpiifort 17.0.1. Previously I was able to run Fortran coarray programs with PBS. Now I need to use SLURM, and I cannot  adjust my old PBS scripts for this to work. I get lots of errors like this:

Error in system call pthread_mutex_unlock: Operation not permitted
    ../../src/mpid/ch3/channels/nemesis/netmod/tmi/tmi_poll.c:629

My coarray SLURM script is:

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --tasks-per-node=28
#SBATCH --time=00:30:00
#SBATCH -p test

EXE=testABW.x
export I_MPI_COLL_INTRANODE=pt2pt
module load intel/2017.01
cd $SLURM_SUBMIT_DIR
echo -genvall -genv -np ${SLURM_NTASKS} ./$EXE > xx14.conf
srun -N ${SLURM_NNODES} --nodelist=${SLURM_NODELIST} /bin/hostname > nodes
./$EXE

The executable and the Intel coarray distributed memory config file are created according to Intel's instructions:

https://software.intel.com/en-us/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential

Again, all that's changed is that I need to use SLURM instead of PBS. The executable hasn't changed.

Thanks

Unexpected DAPL event 0x4003

$
0
0

Hello,

I try to start an MPI job on with the following settings.

I have two nodes, workstation1 and workstation2.
I can ssh from workstation1 (10.0.0.1) to workstation2 (10.0.0.') without password. I've already arranged rsa keys.
I can ssh from both workstation1 and workstation2 to themselves without password.
I can ping from 10.0.0.1 to 10.0.0.2 and from 10.0.0.2 to 10.0.0.1

workstation 1 & workstation2 are connected via Mellanox inifiniband.
I'm running Intel(R) MPI Library, Version 2017 Update 2  Build 20170125
I've installed MLNX_OFED_LINUX-4.1-1.0.2.0-ubuntu16.04-x86_64

workstation1 /etc/hosts :

127.0.0.1    localhost
10.0.0.1    workstation1

# The following lines are desirable for IPv6 capable hosts
#::1     ip6-localhost ip6-loopback
#fe00::0 ip6-localnet
#ff00::0 ip6-mcastprefix
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters

# mpi nodes
10.0.0.2 workstation2

-------------------------------------------------------------
workstation2 /etc/hosts :

127.0.0.1    localhost
10.0.0.2    workstation2

# The following lines are desirable for IPv6 capable hosts
#::1     ip6-localhost ip6-loopback
#fe00::0 ip6-localnet
#ff00::0 ip6-mcastprefix
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters

#mpi nodes
10.0.0.1 workstation1

--------------------------------------------------------------
Here's my application start command, (simplified app names and params)

#!/bin/bash
export PATH=$PATH:$PWD:/opt/intel/compilers_and_libraries_2017.2.174/linux/mpi/intel64/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$I_MPI_ROOT/intel64/lib:../program1/bin:../program2/bin
export I_MPI_FABRICS=dapl:dapl
export I_MPI_DEBUG=6
export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1

# Due to the bug in IntelMPI, -genv I_MPI_ADJUST_BCAST "9" flags has been added.
# Mode detailed information is available : https://software.intel.com/en-us/articles/intel-mpi-library-2017-known-issue-mpi-bcast-hang-on-large-user-defined-datatypes

mpirun -l -genv I_MPI_ADJUST_BCAST "9" -genv I_MPI_PIN_DOMAIN=omp
: -n 1 -host 10.0.0.1 ../program1/bin/program1 master stitching stitching \
: -n 1 -host 10.0.0.2 ../program1/bin/program1 slave dissemination \
: -n 1 -host 10.0.0.1 ../program1/bin/program2 param1 param2

-------------------------------------------

I can start my application in dual node with export I_MPI_FABRICS=tcp:tcp, but when I start with dapl:dapl it gives the following error :

OUTPUT :

0] [0] MPI startup(): Intel(R) MPI Library, Version 2017 Update 2  Build 20170125 (id: 16752)
[0] [0] MPI startup(): Copyright (C) 2003-2017 Intel Corporation.  All rights reserved.
[0] [0] MPI startup(): Multi-threaded optimized library
[0] [0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[1] [1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[2] [2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[0] [0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[0] [0] MPI startup(): dapl data transfer mode
[1] [1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[2] [2] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[1] [1] MPI startup(): dapl data transfer mode
[2] [2] MPI startup(): dapl data transfer mode
[0] [0:10.0.0.1] unexpected DAPL event 0x4003
[0] Fatal error in PMPI_Init_thread: Internal MPI error!, error stack:
[0] MPIR_Init_thread(805): fail failed
[0] MPID_Init(1831)......: channel initialization failed
[0] MPIDI_CH3_Init(147)..: fail failed
[0] (unknown)(): Internal MPI error!
[1] [1:10.0.0.2] unexpected DAPL event 0x4003
[1] Fatal error in PMPI_Init_thread: Internal MPI error!, error stack:
[1] MPIR_Init_thread(805): fail failed
[1] MPID_Init(1831)......: channel initialization failed
[1] MPIDI_CH3_Init(147)..: fail failed
[1] (unknown)(): Internal MPI error!

Do you have any idea what could be the cause? By the way, on single node with dapl, I can start my application on both computers separately (meaning -host 10.0.0.1 for all application for workstation1, never attaching 10.0.0.2 related apps).

IMB-MPI1 pingpong job hangs for minutes when using dapl

$
0
0

Hello,

I was running the IMB-MPI pingpong job by using following scripts:

mpirun -hosts node1,node2 -ppn 1 -n 2 -env I_MPI_FABRICS=dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-mlx5_0-1s -env I_MPI_DEBUG=100 -env I_MPI_DYNAMIC_CONNECTION=0 IMB-MPI1 pingpong

the debug info is as follows:

could you please give some suggestions how to fix this problem. thanks in advance.

job error message

Intel MPI license confusion

$
0
0

Hi,

I observed that Intel MPI is now part of the Intel® Performance Libraries, which I can freely download via https://software.intel.com/en-us/performance-libraries. However, following https://software.intel.com/en-us/articles/end-user-license-agreement, unlike the other Performance Libraries,  it is not licensed under the https://software.intel.com/en-us/license/intel-simplified-software-license. I assume in most cases it would be a "Named User" license then, for the user who downloads the software. Furthermore under https://software.intel.com/en-us/articles/free-mkl Intel MPI is not listed under "Community Licensing for Everyone", but it *is* listed under "Use as an Academic Researcher".

However my setting is a little different (analyst at academic HPC centre). My questions are as follows then:

1. Can an academic supercomputer center with a license to Intel Parallel Studio XE Professional/Composer edition (so not the Cluster edition!) still download Intel MPI via https://software.intel.com/en-us/performance-libraries and make it available to its users (without Premier Support obviously).

2. Can an academic supercomputer center without any Intel Parallel Studio XE license still download Intel MPI via https://software.intel.com/en-us/performance-libraries and make it available to its users (using GCC)

3. Can individual academic researchers (so they are registered themselves, instead of the cluster admins) download Intel MPI via https://software.intel.com/en-us/performance-libraries and use it, using Intel Parallel Studio XE Professional/Composer edition installed by cluster admins?

4. Can individual academic researchers (so they are registered themselves, instead of the cluster admins) download Intel MPI via https://software.intel.com/en-us/performance-libraries and use it, using GCC on a cluster?

Viewing all 936 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>