Failure of Home Disk rshare1 (July 2022)

Since Wednesday 13th July 2022, rshare1, where the home directory resides, has been experiencing a number of failures. As of noon on Friday 22nd, the system was operating normally.

Fault log

  • July 13th 10:38 to 20:10
  • July 16th 8:42 to 9:04
  • July 17th 0:18 to 19th 10:09
  • July 19th 18:28 to 18:42
  • July 20th 9:50 to 12:16
  • July 20th 18:26 to 18:42
  • July 20th 20:12 to 20:24
  • July 21st 7:00 to 10:20
  • July 23rd 7:24 to 7:30
  • July 26th 5:30 to 10:00

The latest faults are listed in the ‘Fault reports‘ page.

Inaccessibility of the home directory or long access times have caused problems for many SHIROKANE services.

Examples of problems occurring

  • Cannot log in to login nodes such as slogin.hgc.jp
  • Cannot register public key
  • Cannot qlogin
  • No response after executing xx command. Display of xx command is different from usual
  • Cannot execute jobs

Action to recover from the disability

  • July 13th 10:38 to 20:10
    • This is due to a problem with the Lustre file system, which caused a delay in I/O processing to the file system.As a temporary countermeasure, we have performed maintenance on some of the servers that make up the file system and disabled the page cache function of the file system.
  • July 16th 8:42 to 9:04, July 17th 0:18 to 19th 10:09
    • Some of the servers that make up the file system stopped due to a problem with the Lustre file system. After that, input/output processing to the file system was stalled on the server that was switched to a redundant configuration. The cause of the stalled I/O processing is currently under investigation. The situation has been restored by restarting some of the servers that make up the file system.
  • July 19th 18:28 to 18:42
    • This was due to a malfunction of some HDDs that make up the file system and the fact that I/O processing to the file system was stalled due to the continuous execution of high-load jobs to the file system. The recovery was achieved by removing the HDDs in question and restarting some of the servers that make up the file system.
  • July 20th 9:50 to 12:16
    • The cause was that I/O processing to the file system was stalled due to the execution of high-load jobs to the file system continuously since July 19. We have restored the system by restarting some of the servers that make up the file system.
  • July 20th 18:26 to 18:42, July 20th 20:12 to 20:24
    • Some of the servers that make up the file system stopped due to a problem with the Lustre file system. The cause was that input/output processing to the file system was stalled on the server that was switched to a redundant configuration. The administrator temporarily stopped the jobs with a heavy load on the file system and contacted the users who had executed the jobs. After implementing countermeasures, the administrator restored the switched server.
  • July 21st 7:00 to 10:20
    • The cause was that I/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.

Policy for future measures

  • We will update the Lustre file system as appropriate. The schedule will be announced on the web page after due consideration is given to minimize the service outage period.

We sincerely apologize for any inconvenience caused.

You may also like...