Maintenance

Failure of Home Disk rshare1 (July – October 2022)

by support · Published 2022-7-21 · Updated 2022-11-30

Since Wednesday 13th July 2022, rshare1, where the home directory resides, has been experiencing a number of failures.

Fault log

July 13th 10:38 to 20:10
July 16th 8:42 to 9:04
July 17th 0:18 to 19th 10:09
July 19th 18:28 to 18:42
July 20th 9:50 to 12:16
July 20th 18:26 to 18:42
July 20th 20:12 to 20:24
July 21st 7:00 to 10:20
July 23rd 7:24 to 7:30
July 26th 5:30 to 10:00
August 15th 0:00 to 9:38
August 17th 6:45 to 10:02
September 25th 1:53 to 10:03
September 26th 9:00 to 11:36
October 6th 13:00 to 14:06

The latest faults are listed in the ‘Fault reports‘ page.

Inaccessibility of the home directory or long access times have caused problems for many SHIROKANE services.

Examples of problems occurring

Cannot log in to login nodes such as slogin.hgc.jp
Cannot register public key
Cannot qlogin
No response after executing xx command. Display of xx command is different from usual
Cannot execute jobs

Action to recover from the disability

July 13th 10:38 to 20:10
- This is due to a problem with the Lustre file system, which caused a delay in I/O processing to the file system.As a temporary countermeasure, we have performed maintenance on some of the servers that make up the file system and disabled the page cache function of the file system.

July 16th 8:42 to 9:04, July 17th 0:18 to 19th 10:09
- Some of the servers that make up the file system stopped due to a problem with the Lustre file system. After that, input/output processing to the file system was stalled on the server that was switched to a redundant configuration. The situation has been restored by restarting some of the servers that make up the file system.
July 19th 18:28 to 18:42
- This was due to a malfunction of some HDDs that make up the file system and the fact that I/O processing to the file system was stalled due to the continuous execution of high-load jobs to the file system. The recovery was achieved by removing the HDDs in question and restarting some of the servers that make up the file system.
July 20th 9:50 to 12:16
- The cause was that I/O processing to the file system was stalled due to the execution of high-load jobs to the file system continuously since July 19. We have restored the system by restarting some of the servers that make up the file system.

July 20th 18:26 to 18:42, July 20th 20:12 to 20:24
- Some of the servers that make up the file system stopped due to a problem with the Lustre file system. The cause was that input/output processing to the file system was stalled on the server that was switched to a redundant configuration. The administrator temporarily stopped the jobs with a heavy load on the file system and contacted the users who had executed the jobs. After implementing countermeasures, the administrator restored the switched server.
July 21st 7:00 to 10:20
- The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.

July 23th 7:24 to 7:30
- Some of the servers that make up the file system stopped due to a problem with the Lustre file system. The cause was that input/output processing to the file system was stalled on the server that was switched to a redundant configuration. The problem has been resolved by restarting the servers in question.

July 26th 5:30 to 10:00
- The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.

August 15th 0:00 to 9:38
- The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.

August 17th 6:45 to 10:02
- The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.

September 25th 1:53 to 10:03
- The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.

September 26th 9:00 to 11:36
- The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.

October 6th 13:00 to 14:06
- The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.

Cause

We assume that the main reason was a combination of the following three factors.

Lustre File System Issues
Overload on rshare1 by jobs, etc.
Capacity exhaustion of rshare1

Measures

Lustre file system updates at Periodical inspection of October 2022
Suspend jobs with high I/O load to rshare1 at any time
Organize data stored in rshare1

We would appreciate your cooperation on the following two points.

I/O load reduction to rshare1
Organize data stored in rshare1

The above measures have been implemented and the problem has been corrected. We sincerely apologize for any inconvenience caused.

SHIROKANE SC

Failure of Home Disk rshare1 (July – October 2022)

Fault log

Examples of problems occurring

Action to recover from the disability

Cause

Measures

You may also like...

Language

Categories

Archives

Failure of Home Disk rshare1 (July – October 2022)

Fault log

Examples of problems occurring

Action to recover from the disability

Cause

Measures

You may also like...

Human Genome Center (HGC)NoMachine and sutil.hgc.jp Temporary Maintenance Notice

Parabricks Pipeline License Renewal Notice

Notice of Network Maintenance on June 22

Language

Categories

Archives

Human Genome Center (HGC)
NoMachine and sutil.hgc.jp Temporary Maintenance Notice