gluster rebalance failure
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
gluster rebalance is failing after the rebalance and brick is also going down after running the rebalance. The output and logs are as follow:
$ gluster --mode=script --wignore volume status
Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 49153 0 Y 64834
Task Status of Volume patchy
------------------------------------------------------------------------------
There are no active volume tasks
$ gluster --mode=script --wignore volume set patchy cluster.weighted-rebalance off
$ gluster --mode=script --wignore volume rebalance patchy start force
volume rebalance: patchy: success: Rebalance on patchy has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 5c573761-314d-4294-99ba-c6a518675e26
$ gluster --mode=script --wignore volume rebalance patchy status
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 0 3 0 failed 0:00:00
volume rebalance: patchy: success
$ gluster --mode=script --wignore volume status
Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 N/A N/A N N/A
Task Status of Volume patchy
------------------------------------------------------------------------------
Task : Rebalance
ID : 5c573761-314d-4294-99ba-c6a518675e26
Status : failed
The volume rebalance log is as follow
$ cat /var/log/glusterfs/patchy-rebalance.log
[2018-03-23 08:06:34.303638] I [MSGID: 100030] [glusterfsd.c:2625:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 4.0.1 (args: /usr/local/sbin/glusterfs -s localhost --volfile-id rebalance/patchy --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --process-name rebalance --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=88559a30-a606-4af0-beb6-458cfafa8df6 --xlator-option *dht.commit-hash=3584404562 --socket-file /var/run/gluster/gluster-rebalance-2f34d12e-1e62-4737-8eec-b14b75ae3500.sock --pid-file /var/lib/glusterd/vols/patchy/rebalance/88559a30-a606-4af0-beb6-458cfafa8df6.pid -l /var/log/glusterfs/patchy-rebalance.log)
[2018-03-23 08:06:34.316882] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-03-23 08:06:39.304868] I [MSGID: 109104] [dht-shared.c:710:dht_init] 0-patchy-dht: dht_init using commit hash 3584404562
[2018-03-23 08:06:39.305817] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-03-23 08:06:39.307050] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-0: parent translators are ready, attempting connect on transport
[2018-03-23 08:06:39.307634] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-1: parent translators are ready, attempting connect on transport
Final graph:
+------------------------------------------------------------------------------+
1: volume patchy-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host myhost
5: option remote-subvolume /d/backends/patchy1
6: option transport-type socket
7: option transport.address-family inet
8: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
9: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
10: option transport.tcp-user-timeout 0
11: option transport.socket.keepalive-time 20
12: option transport.socket.keepalive-interval 2
13: option transport.socket.keepalive-count 9
14: end-volume
15:
16: volume patchy-client-1
17: type protocol/client
18: option ping-timeout 42
19: option remote-host myhost
20: option remote-subvolume /d/backends/patchy2
21: option transport-type socket
22: option transport.address-family inet
23: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
24: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
25: option transport.tcp-user-timeout 0
26: option transport.socket.keepalive-time 20
27: option transport.socket.keepalive-interval 2
28: option transport.socket.keepalive-count 9
29: end-volume
30:
31: volume patchy-dht
32: type cluster/distribute
33: option use-readdirp yes
34: option lookup-unhashed yes
35: option assert-no-child-down yes
36: option readdir-optimize on
37: option rebalance-cmd 5
38: option node-uuid 88559a30-a606-4af0-beb6-458cfafa8df6
39: option commit-hash 3584404562
40: option lock-migration off
41: option force-migration off
42: option weighted-rebalance off
43: subvolumes patchy-client-0 patchy-client-1
44: end-volume
45:
46: volume patchy
47: type debug/io-stats
48: option log-level INFO
49: option latency-measurement off
50: option count-fop-hits off
51: subvolumes patchy-dht
52: end-volume
53:
+------------------------------------------------------------------------------+
[2018-03-23 08:06:39.308435] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308538] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308607] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308838] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308867] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49153 (from 0)
[2018-03-23 08:06:39.309138] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309222] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-0: changing port to 49152 (from 0)
[2018-03-23 08:06:39.309264] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309531] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309747] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.310115] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-1: Connected to patchy-client-1, attached to remote volume '/d/backends/patchy2'.
[2018-03-23 08:06:39.310264] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-0: Connected to patchy-client-0, attached to remote volume '/d/backends/patchy1'.
[2018-03-23 08:06:39.315151] I [MSGID: 109005] [dht-selfheal.c:2328:dht_selfheal_directory] 0-patchy-dht: Directory selfheal failed: Unable to form layout for directory /
[2018-03-23 08:06:39.315302] I [dht-rebalance.c:4513:gf_defrag_start_crawl] 0-patchy-dht: gf_defrag_start_crawl using commit hash 3584404562
[2018-03-23 08:06:39.315689] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.317123] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317201] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317434] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-1
[2018-03-23 08:06:39.317455] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317462] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-0
[2018-03-23 08:06:39.317469] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317601] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-1,cnt = 6119424
[2018-03-23 08:06:39.317730] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-0,cnt = 3149824
[2018-03-23 08:06:39.317739] I [MSGID: 0] [dht-rebalance.c:4275:gf_defrag_total_file_size] 0-patchy-dht: Total size files = 9269248
[2018-03-23 08:06:39.317866] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-1,cnt = 1570
[2018-03-23 08:06:39.318020] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-0,cnt = 897
[2018-03-23 08:06:39.318029] I [MSGID: 0] [dht-rebalance.c:4311:gf_defrag_total_file_cnt] 0-patchy-dht: Total number of files = 1233
[2018-03-23 08:06:39.318148] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2018-03-23 08:06:39.318323] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
[2018-03-23 08:06:39.318360] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[2] creation successful
[2018-03-23 08:06:39.318436] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[3] creation successful
[2018-03-23 08:06:39.377769] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /dir
[2018-03-23 08:06:39.378756] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /dir
[2018-03-23 08:06:39.689322] W [socket.c:592:__socket_rwv] 0-patchy-client-1: readv on 0.0.0.0:49153 failed (No data available)
[2018-03-23 08:06:39.689360] I [MSGID: 114018] [client.c:2227:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-03-23 08:06:39.689381] W [MSGID: 109073] [dht-common.c:10557:dht_notify] 0-patchy-dht: Received CHILD_DOWN. Exiting
[2018-03-23 08:06:39.689391] I [MSGID: 109029] [dht-rebalance.c:5327:gf_defrag_stop] 0-: Received stop command on rebalance
[2018-03-23 08:06:39.689573] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x15a)[0x3ff7e1294a2] (--> /usr/local/lib/libgfrpc.so.0(+0xdb1e)[0x3ff7e08db1e] (--> /usr/local/lib/libgfrpc.so.0(+0xdc8c)[0x3ff7e08dc8c] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x98)[0x3ff7e08f408] (--> /usr/local/lib/libgfrpc.so.0(+0xfffc)[0x3ff7e08fffc] ))))) 0-patchy-client-1: forced unwinding frame type(GlusterFS 4.x v1) op(READDIRP(40)) called at 2018-03-23 08:06:39.379300 (xid=0x22)
[2018-03-23 08:06:39.689588] W [MSGID: 114031] [client-rpc-fops_v2.c:2264:client4_0_readdirp_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.689635] W [MSGID: 109021] [dht-rebalance.c:3106:gf_defrag_get_entry] 0-patchy-dht: Readdirp failed. Aborting data migration for directory: /dir [Transport endpoint is not connected]
[2018-03-23 08:06:39.689655] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.689714] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /dir
[2018-03-23 08:06:39.690982] W [MSGID: 114061] [client-common.c:3375:client_pre_readdirp_v2] 0-patchy-client-1: (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File descriptor in bad state]
[2018-03-23 08:06:39.691056] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.691419] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.691436] E [MSGID: 109119] [dht-lock.c:1051:dht_blocking_inodelk_cbk] 0-patchy-dht: inodelk failed on subvol patchy-client-1, gfid:00000000-0000-0000-0000-000000000001 [Transport endpoint is not connected]
[2018-03-23 08:06:39.691623] E [MSGID: 109016] [dht-rebalance.c:3934:gf_defrag_fix_layout] 0-patchy-dht: Setxattr failed for / [Transport endpoint is not connected]
[2018-03-23 08:06:39.691636] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /
[2018-03-23 08:06:39.691653] E [MSGID: 114031] [client-rpc-fops_v2.c:2451:client4_0_opendir_cbk] 0-patchy-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2018-03-23 08:06:39.691812] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.691844] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /
[2018-03-23 08:06:39.691862] I [dht-rebalance.c:4716:gf_defrag_start_crawl] 0-DHT: crawling file-system completed
[2018-03-23 08:06:39.692135] I [MSGID: 109028] [dht-rebalance.c:5141:gf_defrag_status_get] 0-patchy-dht: Rebalance is failed. Time taken is 0.00 secs
[2018-03-23 08:06:39.692144] I [MSGID: 109028] [dht-rebalance.c:5145:gf_defrag_status_get] 0-patchy-dht: Files migrated: 0, size: 0, lookups: 0, failures: 3, skipped: 0
[2018-03-23 08:06:39.692230] W [glusterfsd.c:1424:cleanup_and_exit] (-->/lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ff7de87934] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0x110) [0x12e00b6b0] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x74) [0x12e00b494] ) 0-: received signum (15), shutting down
mount volume glusterfs
add a comment |Â
up vote
0
down vote
favorite
gluster rebalance is failing after the rebalance and brick is also going down after running the rebalance. The output and logs are as follow:
$ gluster --mode=script --wignore volume status
Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 49153 0 Y 64834
Task Status of Volume patchy
------------------------------------------------------------------------------
There are no active volume tasks
$ gluster --mode=script --wignore volume set patchy cluster.weighted-rebalance off
$ gluster --mode=script --wignore volume rebalance patchy start force
volume rebalance: patchy: success: Rebalance on patchy has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 5c573761-314d-4294-99ba-c6a518675e26
$ gluster --mode=script --wignore volume rebalance patchy status
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 0 3 0 failed 0:00:00
volume rebalance: patchy: success
$ gluster --mode=script --wignore volume status
Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 N/A N/A N N/A
Task Status of Volume patchy
------------------------------------------------------------------------------
Task : Rebalance
ID : 5c573761-314d-4294-99ba-c6a518675e26
Status : failed
The volume rebalance log is as follow
$ cat /var/log/glusterfs/patchy-rebalance.log
[2018-03-23 08:06:34.303638] I [MSGID: 100030] [glusterfsd.c:2625:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 4.0.1 (args: /usr/local/sbin/glusterfs -s localhost --volfile-id rebalance/patchy --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --process-name rebalance --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=88559a30-a606-4af0-beb6-458cfafa8df6 --xlator-option *dht.commit-hash=3584404562 --socket-file /var/run/gluster/gluster-rebalance-2f34d12e-1e62-4737-8eec-b14b75ae3500.sock --pid-file /var/lib/glusterd/vols/patchy/rebalance/88559a30-a606-4af0-beb6-458cfafa8df6.pid -l /var/log/glusterfs/patchy-rebalance.log)
[2018-03-23 08:06:34.316882] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-03-23 08:06:39.304868] I [MSGID: 109104] [dht-shared.c:710:dht_init] 0-patchy-dht: dht_init using commit hash 3584404562
[2018-03-23 08:06:39.305817] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-03-23 08:06:39.307050] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-0: parent translators are ready, attempting connect on transport
[2018-03-23 08:06:39.307634] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-1: parent translators are ready, attempting connect on transport
Final graph:
+------------------------------------------------------------------------------+
1: volume patchy-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host myhost
5: option remote-subvolume /d/backends/patchy1
6: option transport-type socket
7: option transport.address-family inet
8: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
9: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
10: option transport.tcp-user-timeout 0
11: option transport.socket.keepalive-time 20
12: option transport.socket.keepalive-interval 2
13: option transport.socket.keepalive-count 9
14: end-volume
15:
16: volume patchy-client-1
17: type protocol/client
18: option ping-timeout 42
19: option remote-host myhost
20: option remote-subvolume /d/backends/patchy2
21: option transport-type socket
22: option transport.address-family inet
23: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
24: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
25: option transport.tcp-user-timeout 0
26: option transport.socket.keepalive-time 20
27: option transport.socket.keepalive-interval 2
28: option transport.socket.keepalive-count 9
29: end-volume
30:
31: volume patchy-dht
32: type cluster/distribute
33: option use-readdirp yes
34: option lookup-unhashed yes
35: option assert-no-child-down yes
36: option readdir-optimize on
37: option rebalance-cmd 5
38: option node-uuid 88559a30-a606-4af0-beb6-458cfafa8df6
39: option commit-hash 3584404562
40: option lock-migration off
41: option force-migration off
42: option weighted-rebalance off
43: subvolumes patchy-client-0 patchy-client-1
44: end-volume
45:
46: volume patchy
47: type debug/io-stats
48: option log-level INFO
49: option latency-measurement off
50: option count-fop-hits off
51: subvolumes patchy-dht
52: end-volume
53:
+------------------------------------------------------------------------------+
[2018-03-23 08:06:39.308435] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308538] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308607] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308838] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308867] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49153 (from 0)
[2018-03-23 08:06:39.309138] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309222] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-0: changing port to 49152 (from 0)
[2018-03-23 08:06:39.309264] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309531] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309747] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.310115] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-1: Connected to patchy-client-1, attached to remote volume '/d/backends/patchy2'.
[2018-03-23 08:06:39.310264] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-0: Connected to patchy-client-0, attached to remote volume '/d/backends/patchy1'.
[2018-03-23 08:06:39.315151] I [MSGID: 109005] [dht-selfheal.c:2328:dht_selfheal_directory] 0-patchy-dht: Directory selfheal failed: Unable to form layout for directory /
[2018-03-23 08:06:39.315302] I [dht-rebalance.c:4513:gf_defrag_start_crawl] 0-patchy-dht: gf_defrag_start_crawl using commit hash 3584404562
[2018-03-23 08:06:39.315689] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.317123] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317201] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317434] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-1
[2018-03-23 08:06:39.317455] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317462] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-0
[2018-03-23 08:06:39.317469] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317601] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-1,cnt = 6119424
[2018-03-23 08:06:39.317730] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-0,cnt = 3149824
[2018-03-23 08:06:39.317739] I [MSGID: 0] [dht-rebalance.c:4275:gf_defrag_total_file_size] 0-patchy-dht: Total size files = 9269248
[2018-03-23 08:06:39.317866] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-1,cnt = 1570
[2018-03-23 08:06:39.318020] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-0,cnt = 897
[2018-03-23 08:06:39.318029] I [MSGID: 0] [dht-rebalance.c:4311:gf_defrag_total_file_cnt] 0-patchy-dht: Total number of files = 1233
[2018-03-23 08:06:39.318148] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2018-03-23 08:06:39.318323] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
[2018-03-23 08:06:39.318360] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[2] creation successful
[2018-03-23 08:06:39.318436] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[3] creation successful
[2018-03-23 08:06:39.377769] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /dir
[2018-03-23 08:06:39.378756] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /dir
[2018-03-23 08:06:39.689322] W [socket.c:592:__socket_rwv] 0-patchy-client-1: readv on 0.0.0.0:49153 failed (No data available)
[2018-03-23 08:06:39.689360] I [MSGID: 114018] [client.c:2227:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-03-23 08:06:39.689381] W [MSGID: 109073] [dht-common.c:10557:dht_notify] 0-patchy-dht: Received CHILD_DOWN. Exiting
[2018-03-23 08:06:39.689391] I [MSGID: 109029] [dht-rebalance.c:5327:gf_defrag_stop] 0-: Received stop command on rebalance
[2018-03-23 08:06:39.689573] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x15a)[0x3ff7e1294a2] (--> /usr/local/lib/libgfrpc.so.0(+0xdb1e)[0x3ff7e08db1e] (--> /usr/local/lib/libgfrpc.so.0(+0xdc8c)[0x3ff7e08dc8c] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x98)[0x3ff7e08f408] (--> /usr/local/lib/libgfrpc.so.0(+0xfffc)[0x3ff7e08fffc] ))))) 0-patchy-client-1: forced unwinding frame type(GlusterFS 4.x v1) op(READDIRP(40)) called at 2018-03-23 08:06:39.379300 (xid=0x22)
[2018-03-23 08:06:39.689588] W [MSGID: 114031] [client-rpc-fops_v2.c:2264:client4_0_readdirp_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.689635] W [MSGID: 109021] [dht-rebalance.c:3106:gf_defrag_get_entry] 0-patchy-dht: Readdirp failed. Aborting data migration for directory: /dir [Transport endpoint is not connected]
[2018-03-23 08:06:39.689655] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.689714] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /dir
[2018-03-23 08:06:39.690982] W [MSGID: 114061] [client-common.c:3375:client_pre_readdirp_v2] 0-patchy-client-1: (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File descriptor in bad state]
[2018-03-23 08:06:39.691056] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.691419] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.691436] E [MSGID: 109119] [dht-lock.c:1051:dht_blocking_inodelk_cbk] 0-patchy-dht: inodelk failed on subvol patchy-client-1, gfid:00000000-0000-0000-0000-000000000001 [Transport endpoint is not connected]
[2018-03-23 08:06:39.691623] E [MSGID: 109016] [dht-rebalance.c:3934:gf_defrag_fix_layout] 0-patchy-dht: Setxattr failed for / [Transport endpoint is not connected]
[2018-03-23 08:06:39.691636] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /
[2018-03-23 08:06:39.691653] E [MSGID: 114031] [client-rpc-fops_v2.c:2451:client4_0_opendir_cbk] 0-patchy-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2018-03-23 08:06:39.691812] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.691844] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /
[2018-03-23 08:06:39.691862] I [dht-rebalance.c:4716:gf_defrag_start_crawl] 0-DHT: crawling file-system completed
[2018-03-23 08:06:39.692135] I [MSGID: 109028] [dht-rebalance.c:5141:gf_defrag_status_get] 0-patchy-dht: Rebalance is failed. Time taken is 0.00 secs
[2018-03-23 08:06:39.692144] I [MSGID: 109028] [dht-rebalance.c:5145:gf_defrag_status_get] 0-patchy-dht: Files migrated: 0, size: 0, lookups: 0, failures: 3, skipped: 0
[2018-03-23 08:06:39.692230] W [glusterfsd.c:1424:cleanup_and_exit] (-->/lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ff7de87934] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0x110) [0x12e00b6b0] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x74) [0x12e00b494] ) 0-: received signum (15), shutting down
mount volume glusterfs
Lots of statements. What's the question?
â roaima
Mar 23 at 8:17
sorry for lots of logs. Not able to figureout why the rebalance is not working.
â rss
Mar 23 at 8:21
volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
â rss
Mar 30 at 3:56
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
gluster rebalance is failing after the rebalance and brick is also going down after running the rebalance. The output and logs are as follow:
$ gluster --mode=script --wignore volume status
Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 49153 0 Y 64834
Task Status of Volume patchy
------------------------------------------------------------------------------
There are no active volume tasks
$ gluster --mode=script --wignore volume set patchy cluster.weighted-rebalance off
$ gluster --mode=script --wignore volume rebalance patchy start force
volume rebalance: patchy: success: Rebalance on patchy has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 5c573761-314d-4294-99ba-c6a518675e26
$ gluster --mode=script --wignore volume rebalance patchy status
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 0 3 0 failed 0:00:00
volume rebalance: patchy: success
$ gluster --mode=script --wignore volume status
Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 N/A N/A N N/A
Task Status of Volume patchy
------------------------------------------------------------------------------
Task : Rebalance
ID : 5c573761-314d-4294-99ba-c6a518675e26
Status : failed
The volume rebalance log is as follow
$ cat /var/log/glusterfs/patchy-rebalance.log
[2018-03-23 08:06:34.303638] I [MSGID: 100030] [glusterfsd.c:2625:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 4.0.1 (args: /usr/local/sbin/glusterfs -s localhost --volfile-id rebalance/patchy --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --process-name rebalance --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=88559a30-a606-4af0-beb6-458cfafa8df6 --xlator-option *dht.commit-hash=3584404562 --socket-file /var/run/gluster/gluster-rebalance-2f34d12e-1e62-4737-8eec-b14b75ae3500.sock --pid-file /var/lib/glusterd/vols/patchy/rebalance/88559a30-a606-4af0-beb6-458cfafa8df6.pid -l /var/log/glusterfs/patchy-rebalance.log)
[2018-03-23 08:06:34.316882] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-03-23 08:06:39.304868] I [MSGID: 109104] [dht-shared.c:710:dht_init] 0-patchy-dht: dht_init using commit hash 3584404562
[2018-03-23 08:06:39.305817] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-03-23 08:06:39.307050] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-0: parent translators are ready, attempting connect on transport
[2018-03-23 08:06:39.307634] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-1: parent translators are ready, attempting connect on transport
Final graph:
+------------------------------------------------------------------------------+
1: volume patchy-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host myhost
5: option remote-subvolume /d/backends/patchy1
6: option transport-type socket
7: option transport.address-family inet
8: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
9: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
10: option transport.tcp-user-timeout 0
11: option transport.socket.keepalive-time 20
12: option transport.socket.keepalive-interval 2
13: option transport.socket.keepalive-count 9
14: end-volume
15:
16: volume patchy-client-1
17: type protocol/client
18: option ping-timeout 42
19: option remote-host myhost
20: option remote-subvolume /d/backends/patchy2
21: option transport-type socket
22: option transport.address-family inet
23: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
24: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
25: option transport.tcp-user-timeout 0
26: option transport.socket.keepalive-time 20
27: option transport.socket.keepalive-interval 2
28: option transport.socket.keepalive-count 9
29: end-volume
30:
31: volume patchy-dht
32: type cluster/distribute
33: option use-readdirp yes
34: option lookup-unhashed yes
35: option assert-no-child-down yes
36: option readdir-optimize on
37: option rebalance-cmd 5
38: option node-uuid 88559a30-a606-4af0-beb6-458cfafa8df6
39: option commit-hash 3584404562
40: option lock-migration off
41: option force-migration off
42: option weighted-rebalance off
43: subvolumes patchy-client-0 patchy-client-1
44: end-volume
45:
46: volume patchy
47: type debug/io-stats
48: option log-level INFO
49: option latency-measurement off
50: option count-fop-hits off
51: subvolumes patchy-dht
52: end-volume
53:
+------------------------------------------------------------------------------+
[2018-03-23 08:06:39.308435] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308538] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308607] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308838] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308867] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49153 (from 0)
[2018-03-23 08:06:39.309138] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309222] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-0: changing port to 49152 (from 0)
[2018-03-23 08:06:39.309264] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309531] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309747] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.310115] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-1: Connected to patchy-client-1, attached to remote volume '/d/backends/patchy2'.
[2018-03-23 08:06:39.310264] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-0: Connected to patchy-client-0, attached to remote volume '/d/backends/patchy1'.
[2018-03-23 08:06:39.315151] I [MSGID: 109005] [dht-selfheal.c:2328:dht_selfheal_directory] 0-patchy-dht: Directory selfheal failed: Unable to form layout for directory /
[2018-03-23 08:06:39.315302] I [dht-rebalance.c:4513:gf_defrag_start_crawl] 0-patchy-dht: gf_defrag_start_crawl using commit hash 3584404562
[2018-03-23 08:06:39.315689] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.317123] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317201] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317434] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-1
[2018-03-23 08:06:39.317455] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317462] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-0
[2018-03-23 08:06:39.317469] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317601] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-1,cnt = 6119424
[2018-03-23 08:06:39.317730] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-0,cnt = 3149824
[2018-03-23 08:06:39.317739] I [MSGID: 0] [dht-rebalance.c:4275:gf_defrag_total_file_size] 0-patchy-dht: Total size files = 9269248
[2018-03-23 08:06:39.317866] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-1,cnt = 1570
[2018-03-23 08:06:39.318020] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-0,cnt = 897
[2018-03-23 08:06:39.318029] I [MSGID: 0] [dht-rebalance.c:4311:gf_defrag_total_file_cnt] 0-patchy-dht: Total number of files = 1233
[2018-03-23 08:06:39.318148] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2018-03-23 08:06:39.318323] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
[2018-03-23 08:06:39.318360] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[2] creation successful
[2018-03-23 08:06:39.318436] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[3] creation successful
[2018-03-23 08:06:39.377769] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /dir
[2018-03-23 08:06:39.378756] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /dir
[2018-03-23 08:06:39.689322] W [socket.c:592:__socket_rwv] 0-patchy-client-1: readv on 0.0.0.0:49153 failed (No data available)
[2018-03-23 08:06:39.689360] I [MSGID: 114018] [client.c:2227:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-03-23 08:06:39.689381] W [MSGID: 109073] [dht-common.c:10557:dht_notify] 0-patchy-dht: Received CHILD_DOWN. Exiting
[2018-03-23 08:06:39.689391] I [MSGID: 109029] [dht-rebalance.c:5327:gf_defrag_stop] 0-: Received stop command on rebalance
[2018-03-23 08:06:39.689573] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x15a)[0x3ff7e1294a2] (--> /usr/local/lib/libgfrpc.so.0(+0xdb1e)[0x3ff7e08db1e] (--> /usr/local/lib/libgfrpc.so.0(+0xdc8c)[0x3ff7e08dc8c] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x98)[0x3ff7e08f408] (--> /usr/local/lib/libgfrpc.so.0(+0xfffc)[0x3ff7e08fffc] ))))) 0-patchy-client-1: forced unwinding frame type(GlusterFS 4.x v1) op(READDIRP(40)) called at 2018-03-23 08:06:39.379300 (xid=0x22)
[2018-03-23 08:06:39.689588] W [MSGID: 114031] [client-rpc-fops_v2.c:2264:client4_0_readdirp_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.689635] W [MSGID: 109021] [dht-rebalance.c:3106:gf_defrag_get_entry] 0-patchy-dht: Readdirp failed. Aborting data migration for directory: /dir [Transport endpoint is not connected]
[2018-03-23 08:06:39.689655] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.689714] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /dir
[2018-03-23 08:06:39.690982] W [MSGID: 114061] [client-common.c:3375:client_pre_readdirp_v2] 0-patchy-client-1: (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File descriptor in bad state]
[2018-03-23 08:06:39.691056] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.691419] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.691436] E [MSGID: 109119] [dht-lock.c:1051:dht_blocking_inodelk_cbk] 0-patchy-dht: inodelk failed on subvol patchy-client-1, gfid:00000000-0000-0000-0000-000000000001 [Transport endpoint is not connected]
[2018-03-23 08:06:39.691623] E [MSGID: 109016] [dht-rebalance.c:3934:gf_defrag_fix_layout] 0-patchy-dht: Setxattr failed for / [Transport endpoint is not connected]
[2018-03-23 08:06:39.691636] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /
[2018-03-23 08:06:39.691653] E [MSGID: 114031] [client-rpc-fops_v2.c:2451:client4_0_opendir_cbk] 0-patchy-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2018-03-23 08:06:39.691812] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.691844] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /
[2018-03-23 08:06:39.691862] I [dht-rebalance.c:4716:gf_defrag_start_crawl] 0-DHT: crawling file-system completed
[2018-03-23 08:06:39.692135] I [MSGID: 109028] [dht-rebalance.c:5141:gf_defrag_status_get] 0-patchy-dht: Rebalance is failed. Time taken is 0.00 secs
[2018-03-23 08:06:39.692144] I [MSGID: 109028] [dht-rebalance.c:5145:gf_defrag_status_get] 0-patchy-dht: Files migrated: 0, size: 0, lookups: 0, failures: 3, skipped: 0
[2018-03-23 08:06:39.692230] W [glusterfsd.c:1424:cleanup_and_exit] (-->/lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ff7de87934] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0x110) [0x12e00b6b0] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x74) [0x12e00b494] ) 0-: received signum (15), shutting down
mount volume glusterfs
gluster rebalance is failing after the rebalance and brick is also going down after running the rebalance. The output and logs are as follow:
$ gluster --mode=script --wignore volume status
Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 49153 0 Y 64834
Task Status of Volume patchy
------------------------------------------------------------------------------
There are no active volume tasks
$ gluster --mode=script --wignore volume set patchy cluster.weighted-rebalance off
$ gluster --mode=script --wignore volume rebalance patchy start force
volume rebalance: patchy: success: Rebalance on patchy has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 5c573761-314d-4294-99ba-c6a518675e26
$ gluster --mode=script --wignore volume rebalance patchy status
Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 0 3 0 failed 0:00:00
volume rebalance: patchy: success
$ gluster --mode=script --wignore volume status
Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 N/A N/A N N/A
Task Status of Volume patchy
------------------------------------------------------------------------------
Task : Rebalance
ID : 5c573761-314d-4294-99ba-c6a518675e26
Status : failed
The volume rebalance log is as follow
$ cat /var/log/glusterfs/patchy-rebalance.log
[2018-03-23 08:06:34.303638] I [MSGID: 100030] [glusterfsd.c:2625:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 4.0.1 (args: /usr/local/sbin/glusterfs -s localhost --volfile-id rebalance/patchy --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --process-name rebalance --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=88559a30-a606-4af0-beb6-458cfafa8df6 --xlator-option *dht.commit-hash=3584404562 --socket-file /var/run/gluster/gluster-rebalance-2f34d12e-1e62-4737-8eec-b14b75ae3500.sock --pid-file /var/lib/glusterd/vols/patchy/rebalance/88559a30-a606-4af0-beb6-458cfafa8df6.pid -l /var/log/glusterfs/patchy-rebalance.log)
[2018-03-23 08:06:34.316882] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-03-23 08:06:39.304868] I [MSGID: 109104] [dht-shared.c:710:dht_init] 0-patchy-dht: dht_init using commit hash 3584404562
[2018-03-23 08:06:39.305817] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-03-23 08:06:39.307050] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-0: parent translators are ready, attempting connect on transport
[2018-03-23 08:06:39.307634] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-1: parent translators are ready, attempting connect on transport
Final graph:
+------------------------------------------------------------------------------+
1: volume patchy-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host myhost
5: option remote-subvolume /d/backends/patchy1
6: option transport-type socket
7: option transport.address-family inet
8: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
9: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
10: option transport.tcp-user-timeout 0
11: option transport.socket.keepalive-time 20
12: option transport.socket.keepalive-interval 2
13: option transport.socket.keepalive-count 9
14: end-volume
15:
16: volume patchy-client-1
17: type protocol/client
18: option ping-timeout 42
19: option remote-host myhost
20: option remote-subvolume /d/backends/patchy2
21: option transport-type socket
22: option transport.address-family inet
23: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
24: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
25: option transport.tcp-user-timeout 0
26: option transport.socket.keepalive-time 20
27: option transport.socket.keepalive-interval 2
28: option transport.socket.keepalive-count 9
29: end-volume
30:
31: volume patchy-dht
32: type cluster/distribute
33: option use-readdirp yes
34: option lookup-unhashed yes
35: option assert-no-child-down yes
36: option readdir-optimize on
37: option rebalance-cmd 5
38: option node-uuid 88559a30-a606-4af0-beb6-458cfafa8df6
39: option commit-hash 3584404562
40: option lock-migration off
41: option force-migration off
42: option weighted-rebalance off
43: subvolumes patchy-client-0 patchy-client-1
44: end-volume
45:
46: volume patchy
47: type debug/io-stats
48: option log-level INFO
49: option latency-measurement off
50: option count-fop-hits off
51: subvolumes patchy-dht
52: end-volume
53:
+------------------------------------------------------------------------------+
[2018-03-23 08:06:39.308435] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308538] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308607] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308838] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308867] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49153 (from 0)
[2018-03-23 08:06:39.309138] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309222] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-0: changing port to 49152 (from 0)
[2018-03-23 08:06:39.309264] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309531] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309747] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.310115] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-1: Connected to patchy-client-1, attached to remote volume '/d/backends/patchy2'.
[2018-03-23 08:06:39.310264] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-0: Connected to patchy-client-0, attached to remote volume '/d/backends/patchy1'.
[2018-03-23 08:06:39.315151] I [MSGID: 109005] [dht-selfheal.c:2328:dht_selfheal_directory] 0-patchy-dht: Directory selfheal failed: Unable to form layout for directory /
[2018-03-23 08:06:39.315302] I [dht-rebalance.c:4513:gf_defrag_start_crawl] 0-patchy-dht: gf_defrag_start_crawl using commit hash 3584404562
[2018-03-23 08:06:39.315689] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.317123] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317201] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317434] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-1
[2018-03-23 08:06:39.317455] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317462] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-0
[2018-03-23 08:06:39.317469] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317601] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-1,cnt = 6119424
[2018-03-23 08:06:39.317730] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-0,cnt = 3149824
[2018-03-23 08:06:39.317739] I [MSGID: 0] [dht-rebalance.c:4275:gf_defrag_total_file_size] 0-patchy-dht: Total size files = 9269248
[2018-03-23 08:06:39.317866] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-1,cnt = 1570
[2018-03-23 08:06:39.318020] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-0,cnt = 897
[2018-03-23 08:06:39.318029] I [MSGID: 0] [dht-rebalance.c:4311:gf_defrag_total_file_cnt] 0-patchy-dht: Total number of files = 1233
[2018-03-23 08:06:39.318148] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2018-03-23 08:06:39.318323] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
[2018-03-23 08:06:39.318360] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[2] creation successful
[2018-03-23 08:06:39.318436] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[3] creation successful
[2018-03-23 08:06:39.377769] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /dir
[2018-03-23 08:06:39.378756] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /dir
[2018-03-23 08:06:39.689322] W [socket.c:592:__socket_rwv] 0-patchy-client-1: readv on 0.0.0.0:49153 failed (No data available)
[2018-03-23 08:06:39.689360] I [MSGID: 114018] [client.c:2227:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-03-23 08:06:39.689381] W [MSGID: 109073] [dht-common.c:10557:dht_notify] 0-patchy-dht: Received CHILD_DOWN. Exiting
[2018-03-23 08:06:39.689391] I [MSGID: 109029] [dht-rebalance.c:5327:gf_defrag_stop] 0-: Received stop command on rebalance
[2018-03-23 08:06:39.689573] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x15a)[0x3ff7e1294a2] (--> /usr/local/lib/libgfrpc.so.0(+0xdb1e)[0x3ff7e08db1e] (--> /usr/local/lib/libgfrpc.so.0(+0xdc8c)[0x3ff7e08dc8c] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x98)[0x3ff7e08f408] (--> /usr/local/lib/libgfrpc.so.0(+0xfffc)[0x3ff7e08fffc] ))))) 0-patchy-client-1: forced unwinding frame type(GlusterFS 4.x v1) op(READDIRP(40)) called at 2018-03-23 08:06:39.379300 (xid=0x22)
[2018-03-23 08:06:39.689588] W [MSGID: 114031] [client-rpc-fops_v2.c:2264:client4_0_readdirp_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.689635] W [MSGID: 109021] [dht-rebalance.c:3106:gf_defrag_get_entry] 0-patchy-dht: Readdirp failed. Aborting data migration for directory: /dir [Transport endpoint is not connected]
[2018-03-23 08:06:39.689655] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.689714] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /dir
[2018-03-23 08:06:39.690982] W [MSGID: 114061] [client-common.c:3375:client_pre_readdirp_v2] 0-patchy-client-1: (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File descriptor in bad state]
[2018-03-23 08:06:39.691056] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.691419] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.691436] E [MSGID: 109119] [dht-lock.c:1051:dht_blocking_inodelk_cbk] 0-patchy-dht: inodelk failed on subvol patchy-client-1, gfid:00000000-0000-0000-0000-000000000001 [Transport endpoint is not connected]
[2018-03-23 08:06:39.691623] E [MSGID: 109016] [dht-rebalance.c:3934:gf_defrag_fix_layout] 0-patchy-dht: Setxattr failed for / [Transport endpoint is not connected]
[2018-03-23 08:06:39.691636] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /
[2018-03-23 08:06:39.691653] E [MSGID: 114031] [client-rpc-fops_v2.c:2451:client4_0_opendir_cbk] 0-patchy-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2018-03-23 08:06:39.691812] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.691844] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /
[2018-03-23 08:06:39.691862] I [dht-rebalance.c:4716:gf_defrag_start_crawl] 0-DHT: crawling file-system completed
[2018-03-23 08:06:39.692135] I [MSGID: 109028] [dht-rebalance.c:5141:gf_defrag_status_get] 0-patchy-dht: Rebalance is failed. Time taken is 0.00 secs
[2018-03-23 08:06:39.692144] I [MSGID: 109028] [dht-rebalance.c:5145:gf_defrag_status_get] 0-patchy-dht: Files migrated: 0, size: 0, lookups: 0, failures: 3, skipped: 0
[2018-03-23 08:06:39.692230] W [glusterfsd.c:1424:cleanup_and_exit] (-->/lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ff7de87934] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0x110) [0x12e00b6b0] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x74) [0x12e00b494] ) 0-: received signum (15), shutting down
mount volume glusterfs
edited Mar 29 at 5:03
asked Mar 23 at 8:15
rss
11
11
Lots of statements. What's the question?
â roaima
Mar 23 at 8:17
sorry for lots of logs. Not able to figureout why the rebalance is not working.
â rss
Mar 23 at 8:21
volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
â rss
Mar 30 at 3:56
add a comment |Â
Lots of statements. What's the question?
â roaima
Mar 23 at 8:17
sorry for lots of logs. Not able to figureout why the rebalance is not working.
â rss
Mar 23 at 8:21
volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
â rss
Mar 30 at 3:56
Lots of statements. What's the question?
â roaima
Mar 23 at 8:17
Lots of statements. What's the question?
â roaima
Mar 23 at 8:17
sorry for lots of logs. Not able to figureout why the rebalance is not working.
â rss
Mar 23 at 8:21
sorry for lots of logs. Not able to figureout why the rebalance is not working.
â rss
Mar 23 at 8:21
volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
â rss
Mar 30 at 3:56
volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
â rss
Mar 30 at 3:56
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f433013%2fgluster-rebalance-failure%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Lots of statements. What's the question?
â roaima
Mar 23 at 8:17
sorry for lots of logs. Not able to figureout why the rebalance is not working.
â rss
Mar 23 at 8:21
volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
â rss
Mar 30 at 3:56