{"id":2508,"date":"2022-07-21T13:06:51","date_gmt":"2022-07-21T04:06:51","guid":{"rendered":"https:\/\/gc.hgc.jp\/?p=2508"},"modified":"2022-11-30T13:53:40","modified_gmt":"2022-11-30T04:53:40","slug":"system-failure-rshare1","status":"publish","type":"post","link":"https:\/\/gc.hgc.jp\/en\/2022\/07\/system-failure-rshare1\/","title":{"rendered":"Failure of Home Disk rshare1 (July &#8211; October 2022)"},"content":{"rendered":"\n<p>Since Wednesday 13th July 2022, rshare1, where the home directory resides, has been experiencing a number of failures. <\/p>\n\n\n\n<h2>Fault log<\/h2>\n\n\n\n<ul><li>July 13th 10:38 to 20:10<\/li><li>July 16th 8:42 to 9:04<\/li><li>July 17th 0:18 to 19th 10:09<\/li><li>July 19th 18:28 to 18:42<\/li><li>July 20th 9:50 to 12:16<\/li><li>July 20th 18:26 to 18:42<\/li><li>July 20th 20:12 to 20:24<\/li><li>July 21st 7:00 to 10:20<\/li><li>July 23rd 7:24 to 7:30<\/li><li>July 26th 5:30 to 10:00<\/li><li>August 15th 0:00 to 9:38<\/li><li>August 17th 6:45 to 10:02<\/li><li>September 25th 1:53 to 10:03<\/li><li>September 26th 9:00 to 11:36<\/li><li>October 6th 13:00 to 14:06<\/li><\/ul>\n\n\n\n<p>The latest faults are listed in the &#8216;<a href=\"https:\/\/gc.hgc.jp\/en\/util-info\/fault-report\/\">Fault reports<\/a>&#8216; page.<\/p>\n\n\n\n<p>Inaccessibility of the home directory or long access times have caused problems for many SHIROKANE services.<\/p>\n\n\n\n<h2>Examples of problems occurring<\/h2>\n\n\n\n<ul><li>Cannot log in to login nodes such as slogin.hgc.jp<\/li><li>Cannot register public key<\/li><li>Cannot qlogin<\/li><li>No response after executing <em>xx<\/em> command. Display of <em>xx<\/em> command is different from usual<\/li><li>Cannot execute jobs<\/li><\/ul>\n\n\n\n<h2>Action to recover from the disability<\/h2>\n\n\n\n<ul><li>July 13th 10:38 to 20:10<ul><li>This is due to a problem with the Lustre file system, which caused a delay in I\/O processing to the file system.As a temporary countermeasure, we have performed maintenance on some of the servers that make up the file system and disabled the page cache function of the file system.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul id=\"block-2b7347e2-7037-4a42-9209-f8aa5a47ddf7\"><li>July 16th 8:42 to 9:04, July 17th 0:18 to 19th 10:09<ul><li>Some of the servers that make up the file system stopped due to a problem with the Lustre file system. After that, input\/output processing to the file system was stalled on the server that was switched to a redundant configuration. The situation has been restored by restarting some of the servers that make up the file system.<\/li><\/ul><\/li><li>July 19th 18:28 to 18:42<ul><li>This was due to a malfunction of some HDDs that make up the file system and the fact that I\/O processing to the file system was stalled due to the continuous execution of high-load jobs to the file system. The recovery was achieved by removing the HDDs in question and restarting some of the servers that make up the file system.<\/li><\/ul><\/li><li>July 20th 9:50 to 12:16<ul><li>The cause was that I\/O processing to the file system was stalled due to the execution of high-load jobs to the file system continuously since July 19. We have restored the system by restarting some of the servers that make up the file system.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul id=\"block-2b7347e2-7037-4a42-9209-f8aa5a47ddf7\"><li>July 20th 18:26 to 18:42, July 20th 20:12 to 20:24<ul><li>Some of the servers that make up the file system stopped due to a problem with the Lustre file system. The cause was that input\/output processing to the file system was stalled on the server that was switched to a redundant configuration. The administrator temporarily stopped the jobs with a heavy load on the file system and contacted the users who had executed the jobs. After implementing countermeasures, the administrator restored the switched server.<\/li><\/ul><\/li><li>July 21st 7:00 to 10:20<ul><li>The cause was that I\/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul><li>July 23th 7:24 to 7:30<ul><li>Some of the servers that make up the file system stopped due to a problem with the Lustre file system. The cause was that input\/output processing to the file system was stalled on the server that was switched to a redundant configuration. The problem has been resolved by restarting the servers in question.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul><li>July 26th 5:30 to 10:00<ul><li>The cause was that I\/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul><li>August 15th 0:00 to 9:38<ul><li>The cause was that I\/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul><li>August 17th 6:45 to 10:02<ul><li>The cause was that I\/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul><li>September 25th 1:53 to 10:03<ul><li>The cause was that I\/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul><li>September 26th 9:00 to 11:36<ul><li>The cause was that I\/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul><li>October 6th 13:00 to 14:06<ul><li>The cause was that I\/O processing to the file system was stalled on some of the servers that make up the file system. The problem has been resolved by restarting the servers in question.<\/li><\/ul><\/li><\/ul>\n\n\n\n<h2>Cause<\/h2>\n\n\n\n<p>We assume that the main reason was a combination of the following three factors.<\/p>\n\n\n\n<ul><li>Lustre File System Issues<\/li><li>Overload on rshare1 by jobs, etc.<\/li><li>Capacity exhaustion of rshare1<\/li><\/ul>\n\n\n\n<h2>Measures<\/h2>\n\n\n\n<ul><li>Lustre file system updates at <a href=\"https:\/\/supcom.hgc.jp\/internal\/mediawiki\/id\/1671\">Periodical inspection of October 2022<\/a><\/li><li>Suspend jobs with high I\/O load to rshare1 at any time<\/li><li>Organize data stored in rshare1<\/li><\/ul>\n\n\n\n<p>We would appreciate your cooperation on the following two points.<\/p>\n\n\n\n<ul><li>I\/O load reduction to rshare1<\/li><li>Organize data stored in rshare1<\/li><\/ul>\n\n\n\n<p>The above measures have been implemented and the problem has been corrected. We sincerely apologize for any inconvenience caused.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since Wednesday 13th July 2022, rshare1, where the home directory resides, has been experiencing a number of failures. Fault log July 13th 10:38 to 20:10 July 16th 8:42 to 9:04 July 17th 0:18 to&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":1677,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_locale":"en_US","_original_post":"https:\/\/gc.hgc.jp\/?p=2507"},"categories":[36],"tags":[],"_links":{"self":[{"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/posts\/2508"}],"collection":[{"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/comments?post=2508"}],"version-history":[{"count":18,"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/posts\/2508\/revisions"}],"predecessor-version":[{"id":2607,"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/posts\/2508\/revisions\/2607"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/media\/1677"}],"wp:attachment":[{"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/media?parent=2508"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/categories?post=2508"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gc.hgc.jp\/wp-json\/wp\/v2\/tags?post=2508"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}