Clustered file services failure causes all nodes down? [closed]

up vote
1
down vote

favorite

I am planning a cluster of 3 active-active nodes running Red Hat Linux 6.7, with local and shared file systems, such that the 3 servers can read / write some shared files concurrently. However, my system administrator told me that if the clustered file service is down, then all 3 nodes will go down, is there any clustering approach to overcome it?

edited Mar 24 at 2:38

asked Mar 23 at 16:40

Ken Sin

closed as too broad by maulinglawns, Christopher, Thomas, Jeff Schaller, roaima Mar 23 at 23:38

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

It might depend on the clustering implementation you're using. But you've forgotten to tell us what that is.
â€“Â roaima
Mar 23 at 23:38

One approach would be to have the shared filesystems be NFS filesystems, and the NFS server be a NetApp or some other NAS device with sufficient redundancy built-in. Or you might build your own NFS server cluster using DRBD, and have the worker nodes rely on that for shared filesystems. But still, if your workers need shared storage and it's down, then the workers will be unable to work too. The only way around it is replicating the data instead of sharing it, and that brings with it the problem of keeping the replicas in sync with each other. [...]
â€“Â telcoM
Mar 24 at 7:42

[...] If there is a chance that one of the workers is using a version of shared data that is out-of-sync with the others, then that worker must stop immediately and not pass on the stale data. Essentially, you must be able to treat the worker systems as expendable.
â€“Â telcoM
Mar 24 at 7:43

add a commentÂ |Â

up vote
1
down vote

favorite

edited Mar 24 at 2:38

asked Mar 23 at 16:40

Ken Sin

closed as too broad by maulinglawns, Christopher, Thomas, Jeff Schaller, roaima Mar 23 at 23:38

It might depend on the clustering implementation you're using. But you've forgotten to tell us what that is.
â€“Â roaima
Mar 23 at 23:38

One approach would be to have the shared filesystems be NFS filesystems, and the NFS server be a NetApp or some other NAS device with sufficient redundancy built-in. Or you might build your own NFS server cluster using DRBD, and have the worker nodes rely on that for shared filesystems. But still, if your workers need shared storage and it's down, then the workers will be unable to work too. The only way around it is replicating the data instead of sharing it, and that brings with it the problem of keeping the replicas in sync with each other. [...]
â€“Â telcoM
Mar 24 at 7:42

[...] If there is a chance that one of the workers is using a version of shared data that is out-of-sync with the others, then that worker must stop immediately and not pass on the stale data. Essentially, you must be able to treat the worker systems as expendable.
â€“Â telcoM
Mar 24 at 7:43

add a commentÂ |Â

up vote
1
down vote

favorite

edited Mar 24 at 2:38

asked Mar 23 at 16:40

Ken Sin

edited Mar 24 at 2:38

asked Mar 23 at 16:40

Ken Sin

edited Mar 24 at 2:38

asked Mar 23 at 16:40

Ken Sin

asked Mar 23 at 16:40

Ken Sin

asked Mar 23 at 16:40

Ken Sin

closed as too broad by maulinglawns, Christopher, Thomas, Jeff Schaller, roaima Mar 23 at 23:38

It might depend on the clustering implementation you're using. But you've forgotten to tell us what that is.
â€“Â roaima
Mar 23 at 23:38

One approach would be to have the shared filesystems be NFS filesystems, and the NFS server be a NetApp or some other NAS device with sufficient redundancy built-in. Or you might build your own NFS server cluster using DRBD, and have the worker nodes rely on that for shared filesystems. But still, if your workers need shared storage and it's down, then the workers will be unable to work too. The only way around it is replicating the data instead of sharing it, and that brings with it the problem of keeping the replicas in sync with each other. [...]
â€“Â telcoM
Mar 24 at 7:42

[...] If there is a chance that one of the workers is using a version of shared data that is out-of-sync with the others, then that worker must stop immediately and not pass on the stale data. Essentially, you must be able to treat the worker systems as expendable.
â€“Â telcoM
Mar 24 at 7:43

add a commentÂ |Â

It might depend on the clustering implementation you're using. But you've forgotten to tell us what that is.
â€“Â roaima
Mar 23 at 23:38

One approach would be to have the shared filesystems be NFS filesystems, and the NFS server be a NetApp or some other NAS device with sufficient redundancy built-in. Or you might build your own NFS server cluster using DRBD, and have the worker nodes rely on that for shared filesystems. But still, if your workers need shared storage and it's down, then the workers will be unable to work too. The only way around it is replicating the data instead of sharing it, and that brings with it the problem of keeping the replicas in sync with each other. [...]
â€“Â telcoM
Mar 24 at 7:42

[...] If there is a chance that one of the workers is using a version of shared data that is out-of-sync with the others, then that worker must stop immediately and not pass on the stale data. Essentially, you must be able to treat the worker systems as expendable.
â€“Â telcoM
Mar 24 at 7:43

It might depend on the clustering implementation you're using. But you've forgotten to tell us what that is.
â€“Â roaima
Mar 23 at 23:38

One approach would be to have the shared filesystems be NFS filesystems, and the NFS server be a NetApp or some other NAS device with sufficient redundancy built-in. Or you might build your own NFS server cluster using DRBD, and have the worker nodes rely on that for shared filesystems. But still, if your workers need shared storage and it's down, then the workers will be unable to work too. The only way around it is replicating the data instead of sharing it, and that brings with it the problem of keeping the replicas in sync with each other. [...]
â€“Â telcoM
Mar 24 at 7:42

[...] If there is a chance that one of the workers is using a version of shared data that is out-of-sync with the others, then that worker must stop immediately and not pass on the stale data. Essentially, you must be able to treat the worker systems as expendable.
â€“Â telcoM
Mar 24 at 7:43

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

Your admin is correct, if you have a single shared OS between nodes in a clustered filesystem and the clustering fails, all nodes are down as they would not be able to access important files necessary for operation.

You are also correct, if each node has its own OS and the clustering fails, you should only lose access to files within the cluster as the necessary operating system files would be local. You do not mention what clustering you are using, so there may need to be some further mechanisms to make sure a failed cluster does not otherwise hang the system.

It seems like your clustering would still be a single point of failure for the shared files, but if it does not take out entire hosts that may make your admin feel better.

answered Mar 23 at 19:15

GracefulRestart

74917

add a commentÂ |Â

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

It seems like your clustering would still be a single point of failure for the shared files, but if it does not take out entire hosts that may make your admin feel better.

answered Mar 23 at 19:15

GracefulRestart

74917

add a commentÂ |Â

up vote
1
down vote

It seems like your clustering would still be a single point of failure for the shared files, but if it does not take out entire hosts that may make your admin feel better.

answered Mar 23 at 19:15

GracefulRestart

74917

add a commentÂ |Â

up vote
1
down vote

It seems like your clustering would still be a single point of failure for the shared files, but if it does not take out entire hosts that may make your admin feel better.

answered Mar 23 at 19:15

GracefulRestart

74917

It seems like your clustering would still be a single point of failure for the shared files, but if it does not take out entire hosts that may make your admin feel better.

answered Mar 23 at 19:15

GracefulRestart

74917

answered Mar 23 at 19:15

GracefulRestart

74917

answered Mar 23 at 19:15

GracefulRestart

74917

answered Mar 23 at 19:15

GracefulRestart

74917

add a commentÂ |Â

搜尋此網誌

mjhjmtu