[Unix-admins] Fwd: [AstroIT #17096] presto:/data4

Patrick Shopbell pls at astro.caltech.edu
Wed Dec 2 10:28:44 PST 2020



-------- Forwarded Message --------
Subject: 	Re: [AstroIT #17096] presto:/data4
Date: 	Tue, 1 Dec 2020 12:06:35 -0800
From: 	Tim Pearson via RT <help at astro.caltech.edu>
Reply-To: 	help at astro.caltech.edu
To: 	carlsmay at astro.caltech.edu, pls at astro.caltech.edu, 
vam at astro.caltech.edu



Hi Patrick,

I was expecting it might be slow; this is a test to find out how slow. 
But I wasn't expecting the NFS mount to crash!

Tim

> On Dec 1, 2020, at 11:21 AM, Patrick Shopbell via RT 
> <help at astro.caltech.edu> wrote:
>
>
> Hi Tim,
> Have you ever tried that before?
>
> The NFS mount will not be fast, even over 10 Gbit. And it could be
> that the I/O speeds on presto would be a limit too... I am not sure
> if data4 and data8 use different RAID controllers, for example.
> --
> Patrick
>
>
> On 12/1/20 11:17 AM, Tim Pearson via RT wrote:
>> I guess I was pounding on the link. Usually I run our COMAP pipeline 
>> on the machine that is directly connected to the disk, but today I 
>> have been running it on the other machine.
>>
>> This is to see if we can run two pipelines in parallel, one on 
>> allegro using /data4 and the other on presto using /data8. If this is 
>> giving problems, then we will need to rethink our strategy.
>>
>> Thanks
>>
>> Tim
>>
>>> On Dec 1, 2020, at 11:04 AM, Patrick Shopbell via RT 
>>> <help at astro.caltech.edu> wrote:
>>>
>>>
>>> Well, there are a lot of network timeouts this morning on the
>>> 10 gbit link between allegro and presto:
>>>
>>> Dec 1 09:45:18 allegro kernel: nfs: server presto-fast not responding,
>>> still trying
>>>
>>> Mostly these were between 9:45 and 10:25 or so. Is it possible
>>> that multiple users were pounding on the link heavily during that
>>> time?
>>>
>>> I don't see any errors in the interfaces, and they are synced at
>>> 10 gbit speeds.
>>>
>>> It seems to be a very transient thing; there are no such messages
>>> in the log files for the entire month of November.
>>> --
>>> Patrick
>>>
>>>
>>>
>>> On 12/1/20 10:51 AM, Tim Pearson via RT wrote:
>>>> Hi Patrick
>>>>
>>>> Thanks! Do you know why it went away in the middle of my job?
>>>>
>>>> Tim
>>>>
>>>>> On Dec 1, 2020, at 10:30 AM, Patrick Shopbell via RT 
>>>>> <help at astro.caltech.edu> wrote:
>>>>>
>>>>> Hi all,
>>>>> I have reset the presto mount on allegro, so I think this
>>>>> should be working now.
>>>>> --
>>>>> Patrick
>>>>>
>>>>>
>>>>> On Tue Dec 01 10:21:05 2020, rh at ovro.caltech.edu wrote:
>>>>>> Hi Anu and Patrick,
>>>>>>
>>>>>> Related to this are the URL links for data4 which the comap and
>>>>>> myself(/home/rh) can no longer access:
>>>>>>
>>>>>> The URL is used by COMAP's data viewer. All other similar URL's work
>>>>>> just fine.
>>>>>>
>>>>>> (base) [comap at presto backupScripts]$ curl
>>>>>> http://presto.caltech.edu:88/static_pd4/
>>>>>> <html>
>>>>>> <head><title>403 Forbidden</title></head>
>>>>>> <body>
>>>>>> <center><h1>403 Forbidden</h1></center>
>>>>>> <hr><center>nginx/1.18.0</center>
>>>>>> </body>
>>>>>> </html>
>>>>>>
>>>>>>
>>>>>> -- rick
>>>>>>
>>>>>> On Tue, Dec 01, 2020 at 10:17:08AM -0800, Tim Pearson wrote:
>>>>>>> Dear Anu and Patrick,
>>>>>>>
>>>>>>> I just ran into a problem on /data4:
>>>>>>>
>>>>>>> Traceback (most recent call last):
>>>>>>> File "run_level1.py", line 102, in <module>
>>>>>>> run_level1(platform=args.platform, disk=args.disk,
>>>>>> month=args.month, outdisk=args.outdisk)
>>>>>>> File "run_level1.py", line 90, in run_level1
>>>>>>> create_level1(dada, level1_dir, arc_dir, 'reglist.txt',
>>>>>> plotdir=plot_dir, database=True)
>>>>>>> File "/home/comap/tjp/level1/create_level1.py", line 103, in
>>>>>> create_level1
>>>>>>> (status, level1) = dada_to_level1(dada_files, attrib,
>>>>>> output=output, check=check, verbose=verbose, logfile=logfile)
>>>>>>> File "/home/comap/tjp/level1/dada_tools.py", line 585, in
>>>>>> dada_to_level1
>>>>>>> hdf.close()
>>>>>>> File "/home/comap/tjp/level1p2/lib/python2.7/site-
>>>>>> packages/h5py/_hl/files.py", line 443, in close
>>>>>>> h5i.dec_ref(id_)
>>>>>>> File "h5py/_objects.pyx", line 54, in
>>>>>> h5py._objects.with_phil.wrapper
>>>>>>> File "h5py/_objects.pyx", line 55, in
>>>>>> h5py._objects.with_phil.wrapper
>>>>>>> File "h5py/h5i.pyx", line 150, in h5py.h5i.dec_ref
>>>>>>> RuntimeError: Can't decrement id ref count (unable to close file,
>>>>>> errno = 5, error message = 'Input/output error')
>>>>>>> ./run_level1_data4.sh: line 15: 10496 Segmentation fault (core
>>>>>> dumped) python run_level1.py --disk /comapdata4 --outdisk /comapdata4
>>>>>> --month 2019-01
>>>>>>> I think this is an NFS problem as I am running the code on allegro
>>>>>> but /data4 is on presto. I can no longer see the /data4 files from
>>>>>> allegro.
>>>>>>> [comap at allegro level1]$ ls /comapdata4/pathfinder/Backend/2019-
>>>>>> 01/*_0000000000000000*.dada
>>>>>>> ls: cannot access /comapdata4/pathfinder/Backend/2019-
>>>>>> 01/*_0000000000000000*.dada: No such file or directory
>>>>>>> The same command works on presto.
>>>>>>>
>>>>>>> Tim
>>>
>>> -- 
>>> *--------------------------------------------------------------------*
>>> | Patrick Shopbell Department of Astronomy |
>>> | pls at astro.caltech.edu Mail Code 249-17 |
>>> | (626) 395-4097 California Institute of Technology |
>>> | (626) 568-9352 (FAX) Pasadena, CA 91125 |
>>> | WWW: http://www.astro.caltech.edu/~pls/ |
>>> *--------------------------------------------------------------------*
>>>
>>>
>>
>
>
> -- *--------------------------------------------------------------------*
> | Patrick Shopbell Department of Astronomy |
> | pls at astro.caltech.edu Mail Code 249-17 |
> | (626) 395-4097 California Institute of Technology |
> | (626) 568-9352 (FAX) Pasadena, CA 91125 |
> | WWW: http://www.astro.caltech.edu/~pls/ |
> *--------------------------------------------------------------------*
>
>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.astro.caltech.edu:88/pipermail/unix-admins/attachments/20201202/68c9ea5b/attachment-0001.html>


More information about the Unix-admins mailing list