Failure of Home Disk rshare1 (July – October 2022)

Since Wednesday 13th July 2022, rshare1, where the home directory resides, has been experiencing a number of failures.

Fault log

  • July 13th 10:38 to 20:10
  • July 16th 8:42 to 9:04
  • July 17th 0:18 to 19th 10:09
  • July 19th 18:28 to 18:42
  • July 20th 9:50 to 12:16
  • July 20th 18:26 to 18:42
  • July 20th 20:12 to 20:24
  • July 21st 7:00 to 10:20
  • July 23rd 7:24 to 7:30
  • July 26th 5:30 to 10:00
  • August 15th 0:00 to 9:38
  • August 17th 6:45 to 10:02
  • September 25th 1:53 to 10:03
  • September 26th 9:00 to 11:36
  • October 6th 13:00 to 14:06

The latest faults are listed in the ‘Fault reports‘ page.

Inaccessibility of the home directory or long access times have caused problems for many SHIROKANE services.

Examples of problems occurring

  • Cannot log in to login nodes such as slogin.hgc.jp
  • Cannot register public key
  • Cannot qlogin
  • No response after executing xx command. Display of xx command is different from usual
  • Cannot execute jobs

Action to recover from the disability

  • July 13th 10:38 to 20:10
    • This is due to a problem with the Lustre file system, which caused a delay in I/O processing to the file system.As a temporary countermeasure, we have performed maintenance on some of the servers that make up the file system and disabled the page cache function of the file system.
  • July 16th 8:42 to 9:04, July 17th 0:18 to 19th 10:09
    • Some of the servers that make up the file system stopped due to a problem with the Lustre file system. After that, input/output processing to the file system was stalled on the server that was switched to a redundant configuration. The situation has been restored by restarting some of the servers that make up the file system.
  • July 19th 18:28 to 18:42
    • This was due to a malfunction of some HDDs that make up the file system and the fact that I/O processing to the file system was stalled due to the continuous execution of high-load jobs to the file system. The recovery was achieved by removing the HDDs in question and restarting some of the servers that make up the file system.
  • July 20th 9:50 to 12:16
    • The cause was that I/O processing to the file system was stalled due to the execution of high-load jobs to the file system continuously since July 19. We have restored the system by restarting some of the servers that make up the file system.
  • July 20th 18:26 to 18:42, July 20th 20:12 to 20:24
    • Some of the servers that make up the file system stopped due to a problem with the Lustre file system. The cause was that input/output processing to the file system was stalled on the server that was switched to a redundant configuration. The administrator temporarily stopped the jobs with a heavy load on the file system and contacted the users who had executed the jobs. After implementing countermeasures, the administrator restored the switched server.
  • July 21st 7:00 to 10:20
    • The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.
  • July 23th 7:24 to 7:30
    • Some of the servers that make up the file system stopped due to a problem with the Lustre file system. The cause was that input/output processing to the file system was stalled on the server that was switched to a redundant configuration. The problem has been resolved by restarting the servers in question.
  • July 26th 5:30 to 10:00
    • The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.
  • August 15th 0:00 to 9:38
    • The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.
  • August 17th 6:45 to 10:02
    • The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.
  • September 25th 1:53 to 10:03
    • The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.
  • September 26th 9:00 to 11:36
    • The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.
  • October 6th 13:00 to 14:06
    • The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.

Cause

We assume that the main reason was a combination of the following three factors.

  • Lustre File System Issues
  • Overload on rshare1 by jobs, etc.
  • Capacity exhaustion of rshare1

Measures

We would appreciate your cooperation on the following two points.

  • I/O load reduction to rshare1
  • Organize data stored in rshare1

The above measures have been implemented and the problem has been corrected. We sincerely apologize for any inconvenience caused.

You may also like...