gluster rebalance failure

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












gluster rebalance is failing after the rebalance and brick is also going down after running the rebalance. The output and logs are as follow:



$ gluster --mode=script --wignore volume status



Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 49153 0 Y 64834

Task Status of Volume patchy
------------------------------------------------------------------------------
There are no active volume tasks


$ gluster --mode=script --wignore volume set patchy cluster.weighted-rebalance off



$ gluster --mode=script --wignore volume rebalance patchy start force



volume rebalance: patchy: success: Rebalance on patchy has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 5c573761-314d-4294-99ba-c6a518675e26


$ gluster --mode=script --wignore volume rebalance patchy status



 Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 0 3 0 failed 0:00:00
volume rebalance: patchy: success


$ gluster --mode=script --wignore volume status



 Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 N/A N/A N N/A

Task Status of Volume patchy

------------------------------------------------------------------------------
Task : Rebalance
ID : 5c573761-314d-4294-99ba-c6a518675e26
Status : failed


The volume rebalance log is as follow



$ cat /var/log/glusterfs/patchy-rebalance.log



[2018-03-23 08:06:34.303638] I [MSGID: 100030] [glusterfsd.c:2625:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 4.0.1 (args: /usr/local/sbin/glusterfs -s localhost --volfile-id rebalance/patchy --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --process-name rebalance --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=88559a30-a606-4af0-beb6-458cfafa8df6 --xlator-option *dht.commit-hash=3584404562 --socket-file /var/run/gluster/gluster-rebalance-2f34d12e-1e62-4737-8eec-b14b75ae3500.sock --pid-file /var/lib/glusterd/vols/patchy/rebalance/88559a30-a606-4af0-beb6-458cfafa8df6.pid -l /var/log/glusterfs/patchy-rebalance.log)
[2018-03-23 08:06:34.316882] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-03-23 08:06:39.304868] I [MSGID: 109104] [dht-shared.c:710:dht_init] 0-patchy-dht: dht_init using commit hash 3584404562
[2018-03-23 08:06:39.305817] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-03-23 08:06:39.307050] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-0: parent translators are ready, attempting connect on transport
[2018-03-23 08:06:39.307634] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-1: parent translators are ready, attempting connect on transport
Final graph:
+------------------------------------------------------------------------------+
1: volume patchy-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host myhost
5: option remote-subvolume /d/backends/patchy1
6: option transport-type socket
7: option transport.address-family inet
8: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
9: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
10: option transport.tcp-user-timeout 0
11: option transport.socket.keepalive-time 20
12: option transport.socket.keepalive-interval 2
13: option transport.socket.keepalive-count 9
14: end-volume
15:
16: volume patchy-client-1
17: type protocol/client
18: option ping-timeout 42
19: option remote-host myhost
20: option remote-subvolume /d/backends/patchy2
21: option transport-type socket
22: option transport.address-family inet
23: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
24: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
25: option transport.tcp-user-timeout 0
26: option transport.socket.keepalive-time 20
27: option transport.socket.keepalive-interval 2
28: option transport.socket.keepalive-count 9
29: end-volume
30:
31: volume patchy-dht
32: type cluster/distribute
33: option use-readdirp yes
34: option lookup-unhashed yes
35: option assert-no-child-down yes
36: option readdir-optimize on
37: option rebalance-cmd 5
38: option node-uuid 88559a30-a606-4af0-beb6-458cfafa8df6
39: option commit-hash 3584404562
40: option lock-migration off
41: option force-migration off
42: option weighted-rebalance off
43: subvolumes patchy-client-0 patchy-client-1
44: end-volume
45:
46: volume patchy
47: type debug/io-stats
48: option log-level INFO
49: option latency-measurement off
50: option count-fop-hits off
51: subvolumes patchy-dht
52: end-volume
53:
+------------------------------------------------------------------------------+
[2018-03-23 08:06:39.308435] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308538] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308607] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308838] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308867] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49153 (from 0)
[2018-03-23 08:06:39.309138] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309222] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-0: changing port to 49152 (from 0)
[2018-03-23 08:06:39.309264] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309531] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309747] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.310115] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-1: Connected to patchy-client-1, attached to remote volume '/d/backends/patchy2'.
[2018-03-23 08:06:39.310264] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-0: Connected to patchy-client-0, attached to remote volume '/d/backends/patchy1'.
[2018-03-23 08:06:39.315151] I [MSGID: 109005] [dht-selfheal.c:2328:dht_selfheal_directory] 0-patchy-dht: Directory selfheal failed: Unable to form layout for directory /
[2018-03-23 08:06:39.315302] I [dht-rebalance.c:4513:gf_defrag_start_crawl] 0-patchy-dht: gf_defrag_start_crawl using commit hash 3584404562
[2018-03-23 08:06:39.315689] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.317123] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317201] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317434] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-1
[2018-03-23 08:06:39.317455] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317462] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-0
[2018-03-23 08:06:39.317469] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317601] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-1,cnt = 6119424
[2018-03-23 08:06:39.317730] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-0,cnt = 3149824
[2018-03-23 08:06:39.317739] I [MSGID: 0] [dht-rebalance.c:4275:gf_defrag_total_file_size] 0-patchy-dht: Total size files = 9269248
[2018-03-23 08:06:39.317866] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-1,cnt = 1570
[2018-03-23 08:06:39.318020] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-0,cnt = 897
[2018-03-23 08:06:39.318029] I [MSGID: 0] [dht-rebalance.c:4311:gf_defrag_total_file_cnt] 0-patchy-dht: Total number of files = 1233
[2018-03-23 08:06:39.318148] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2018-03-23 08:06:39.318323] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
[2018-03-23 08:06:39.318360] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[2] creation successful
[2018-03-23 08:06:39.318436] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[3] creation successful
[2018-03-23 08:06:39.377769] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /dir
[2018-03-23 08:06:39.378756] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /dir
[2018-03-23 08:06:39.689322] W [socket.c:592:__socket_rwv] 0-patchy-client-1: readv on 0.0.0.0:49153 failed (No data available)
[2018-03-23 08:06:39.689360] I [MSGID: 114018] [client.c:2227:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-03-23 08:06:39.689381] W [MSGID: 109073] [dht-common.c:10557:dht_notify] 0-patchy-dht: Received CHILD_DOWN. Exiting
[2018-03-23 08:06:39.689391] I [MSGID: 109029] [dht-rebalance.c:5327:gf_defrag_stop] 0-: Received stop command on rebalance
[2018-03-23 08:06:39.689573] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x15a)[0x3ff7e1294a2] (--> /usr/local/lib/libgfrpc.so.0(+0xdb1e)[0x3ff7e08db1e] (--> /usr/local/lib/libgfrpc.so.0(+0xdc8c)[0x3ff7e08dc8c] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x98)[0x3ff7e08f408] (--> /usr/local/lib/libgfrpc.so.0(+0xfffc)[0x3ff7e08fffc] ))))) 0-patchy-client-1: forced unwinding frame type(GlusterFS 4.x v1) op(READDIRP(40)) called at 2018-03-23 08:06:39.379300 (xid=0x22)
[2018-03-23 08:06:39.689588] W [MSGID: 114031] [client-rpc-fops_v2.c:2264:client4_0_readdirp_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.689635] W [MSGID: 109021] [dht-rebalance.c:3106:gf_defrag_get_entry] 0-patchy-dht: Readdirp failed. Aborting data migration for directory: /dir [Transport endpoint is not connected]
[2018-03-23 08:06:39.689655] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.689714] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /dir
[2018-03-23 08:06:39.690982] W [MSGID: 114061] [client-common.c:3375:client_pre_readdirp_v2] 0-patchy-client-1: (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File descriptor in bad state]
[2018-03-23 08:06:39.691056] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.691419] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.691436] E [MSGID: 109119] [dht-lock.c:1051:dht_blocking_inodelk_cbk] 0-patchy-dht: inodelk failed on subvol patchy-client-1, gfid:00000000-0000-0000-0000-000000000001 [Transport endpoint is not connected]
[2018-03-23 08:06:39.691623] E [MSGID: 109016] [dht-rebalance.c:3934:gf_defrag_fix_layout] 0-patchy-dht: Setxattr failed for / [Transport endpoint is not connected]
[2018-03-23 08:06:39.691636] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /
[2018-03-23 08:06:39.691653] E [MSGID: 114031] [client-rpc-fops_v2.c:2451:client4_0_opendir_cbk] 0-patchy-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2018-03-23 08:06:39.691812] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.691844] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /
[2018-03-23 08:06:39.691862] I [dht-rebalance.c:4716:gf_defrag_start_crawl] 0-DHT: crawling file-system completed
[2018-03-23 08:06:39.692135] I [MSGID: 109028] [dht-rebalance.c:5141:gf_defrag_status_get] 0-patchy-dht: Rebalance is failed. Time taken is 0.00 secs
[2018-03-23 08:06:39.692144] I [MSGID: 109028] [dht-rebalance.c:5145:gf_defrag_status_get] 0-patchy-dht: Files migrated: 0, size: 0, lookups: 0, failures: 3, skipped: 0
[2018-03-23 08:06:39.692230] W [glusterfsd.c:1424:cleanup_and_exit] (-->/lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ff7de87934] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0x110) [0x12e00b6b0] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x74) [0x12e00b494] ) 0-: received signum (15), shutting down






share|improve this question






















  • Lots of statements. What's the question?
    – roaima
    Mar 23 at 8:17










  • sorry for lots of logs. Not able to figureout why the rebalance is not working.
    – rss
    Mar 23 at 8:21










  • volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
    – rss
    Mar 30 at 3:56















up vote
0
down vote

favorite












gluster rebalance is failing after the rebalance and brick is also going down after running the rebalance. The output and logs are as follow:



$ gluster --mode=script --wignore volume status



Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 49153 0 Y 64834

Task Status of Volume patchy
------------------------------------------------------------------------------
There are no active volume tasks


$ gluster --mode=script --wignore volume set patchy cluster.weighted-rebalance off



$ gluster --mode=script --wignore volume rebalance patchy start force



volume rebalance: patchy: success: Rebalance on patchy has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 5c573761-314d-4294-99ba-c6a518675e26


$ gluster --mode=script --wignore volume rebalance patchy status



 Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 0 3 0 failed 0:00:00
volume rebalance: patchy: success


$ gluster --mode=script --wignore volume status



 Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 N/A N/A N N/A

Task Status of Volume patchy

------------------------------------------------------------------------------
Task : Rebalance
ID : 5c573761-314d-4294-99ba-c6a518675e26
Status : failed


The volume rebalance log is as follow



$ cat /var/log/glusterfs/patchy-rebalance.log



[2018-03-23 08:06:34.303638] I [MSGID: 100030] [glusterfsd.c:2625:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 4.0.1 (args: /usr/local/sbin/glusterfs -s localhost --volfile-id rebalance/patchy --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --process-name rebalance --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=88559a30-a606-4af0-beb6-458cfafa8df6 --xlator-option *dht.commit-hash=3584404562 --socket-file /var/run/gluster/gluster-rebalance-2f34d12e-1e62-4737-8eec-b14b75ae3500.sock --pid-file /var/lib/glusterd/vols/patchy/rebalance/88559a30-a606-4af0-beb6-458cfafa8df6.pid -l /var/log/glusterfs/patchy-rebalance.log)
[2018-03-23 08:06:34.316882] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-03-23 08:06:39.304868] I [MSGID: 109104] [dht-shared.c:710:dht_init] 0-patchy-dht: dht_init using commit hash 3584404562
[2018-03-23 08:06:39.305817] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-03-23 08:06:39.307050] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-0: parent translators are ready, attempting connect on transport
[2018-03-23 08:06:39.307634] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-1: parent translators are ready, attempting connect on transport
Final graph:
+------------------------------------------------------------------------------+
1: volume patchy-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host myhost
5: option remote-subvolume /d/backends/patchy1
6: option transport-type socket
7: option transport.address-family inet
8: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
9: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
10: option transport.tcp-user-timeout 0
11: option transport.socket.keepalive-time 20
12: option transport.socket.keepalive-interval 2
13: option transport.socket.keepalive-count 9
14: end-volume
15:
16: volume patchy-client-1
17: type protocol/client
18: option ping-timeout 42
19: option remote-host myhost
20: option remote-subvolume /d/backends/patchy2
21: option transport-type socket
22: option transport.address-family inet
23: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
24: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
25: option transport.tcp-user-timeout 0
26: option transport.socket.keepalive-time 20
27: option transport.socket.keepalive-interval 2
28: option transport.socket.keepalive-count 9
29: end-volume
30:
31: volume patchy-dht
32: type cluster/distribute
33: option use-readdirp yes
34: option lookup-unhashed yes
35: option assert-no-child-down yes
36: option readdir-optimize on
37: option rebalance-cmd 5
38: option node-uuid 88559a30-a606-4af0-beb6-458cfafa8df6
39: option commit-hash 3584404562
40: option lock-migration off
41: option force-migration off
42: option weighted-rebalance off
43: subvolumes patchy-client-0 patchy-client-1
44: end-volume
45:
46: volume patchy
47: type debug/io-stats
48: option log-level INFO
49: option latency-measurement off
50: option count-fop-hits off
51: subvolumes patchy-dht
52: end-volume
53:
+------------------------------------------------------------------------------+
[2018-03-23 08:06:39.308435] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308538] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308607] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308838] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308867] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49153 (from 0)
[2018-03-23 08:06:39.309138] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309222] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-0: changing port to 49152 (from 0)
[2018-03-23 08:06:39.309264] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309531] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309747] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.310115] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-1: Connected to patchy-client-1, attached to remote volume '/d/backends/patchy2'.
[2018-03-23 08:06:39.310264] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-0: Connected to patchy-client-0, attached to remote volume '/d/backends/patchy1'.
[2018-03-23 08:06:39.315151] I [MSGID: 109005] [dht-selfheal.c:2328:dht_selfheal_directory] 0-patchy-dht: Directory selfheal failed: Unable to form layout for directory /
[2018-03-23 08:06:39.315302] I [dht-rebalance.c:4513:gf_defrag_start_crawl] 0-patchy-dht: gf_defrag_start_crawl using commit hash 3584404562
[2018-03-23 08:06:39.315689] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.317123] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317201] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317434] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-1
[2018-03-23 08:06:39.317455] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317462] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-0
[2018-03-23 08:06:39.317469] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317601] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-1,cnt = 6119424
[2018-03-23 08:06:39.317730] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-0,cnt = 3149824
[2018-03-23 08:06:39.317739] I [MSGID: 0] [dht-rebalance.c:4275:gf_defrag_total_file_size] 0-patchy-dht: Total size files = 9269248
[2018-03-23 08:06:39.317866] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-1,cnt = 1570
[2018-03-23 08:06:39.318020] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-0,cnt = 897
[2018-03-23 08:06:39.318029] I [MSGID: 0] [dht-rebalance.c:4311:gf_defrag_total_file_cnt] 0-patchy-dht: Total number of files = 1233
[2018-03-23 08:06:39.318148] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2018-03-23 08:06:39.318323] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
[2018-03-23 08:06:39.318360] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[2] creation successful
[2018-03-23 08:06:39.318436] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[3] creation successful
[2018-03-23 08:06:39.377769] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /dir
[2018-03-23 08:06:39.378756] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /dir
[2018-03-23 08:06:39.689322] W [socket.c:592:__socket_rwv] 0-patchy-client-1: readv on 0.0.0.0:49153 failed (No data available)
[2018-03-23 08:06:39.689360] I [MSGID: 114018] [client.c:2227:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-03-23 08:06:39.689381] W [MSGID: 109073] [dht-common.c:10557:dht_notify] 0-patchy-dht: Received CHILD_DOWN. Exiting
[2018-03-23 08:06:39.689391] I [MSGID: 109029] [dht-rebalance.c:5327:gf_defrag_stop] 0-: Received stop command on rebalance
[2018-03-23 08:06:39.689573] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x15a)[0x3ff7e1294a2] (--> /usr/local/lib/libgfrpc.so.0(+0xdb1e)[0x3ff7e08db1e] (--> /usr/local/lib/libgfrpc.so.0(+0xdc8c)[0x3ff7e08dc8c] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x98)[0x3ff7e08f408] (--> /usr/local/lib/libgfrpc.so.0(+0xfffc)[0x3ff7e08fffc] ))))) 0-patchy-client-1: forced unwinding frame type(GlusterFS 4.x v1) op(READDIRP(40)) called at 2018-03-23 08:06:39.379300 (xid=0x22)
[2018-03-23 08:06:39.689588] W [MSGID: 114031] [client-rpc-fops_v2.c:2264:client4_0_readdirp_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.689635] W [MSGID: 109021] [dht-rebalance.c:3106:gf_defrag_get_entry] 0-patchy-dht: Readdirp failed. Aborting data migration for directory: /dir [Transport endpoint is not connected]
[2018-03-23 08:06:39.689655] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.689714] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /dir
[2018-03-23 08:06:39.690982] W [MSGID: 114061] [client-common.c:3375:client_pre_readdirp_v2] 0-patchy-client-1: (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File descriptor in bad state]
[2018-03-23 08:06:39.691056] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.691419] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.691436] E [MSGID: 109119] [dht-lock.c:1051:dht_blocking_inodelk_cbk] 0-patchy-dht: inodelk failed on subvol patchy-client-1, gfid:00000000-0000-0000-0000-000000000001 [Transport endpoint is not connected]
[2018-03-23 08:06:39.691623] E [MSGID: 109016] [dht-rebalance.c:3934:gf_defrag_fix_layout] 0-patchy-dht: Setxattr failed for / [Transport endpoint is not connected]
[2018-03-23 08:06:39.691636] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /
[2018-03-23 08:06:39.691653] E [MSGID: 114031] [client-rpc-fops_v2.c:2451:client4_0_opendir_cbk] 0-patchy-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2018-03-23 08:06:39.691812] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.691844] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /
[2018-03-23 08:06:39.691862] I [dht-rebalance.c:4716:gf_defrag_start_crawl] 0-DHT: crawling file-system completed
[2018-03-23 08:06:39.692135] I [MSGID: 109028] [dht-rebalance.c:5141:gf_defrag_status_get] 0-patchy-dht: Rebalance is failed. Time taken is 0.00 secs
[2018-03-23 08:06:39.692144] I [MSGID: 109028] [dht-rebalance.c:5145:gf_defrag_status_get] 0-patchy-dht: Files migrated: 0, size: 0, lookups: 0, failures: 3, skipped: 0
[2018-03-23 08:06:39.692230] W [glusterfsd.c:1424:cleanup_and_exit] (-->/lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ff7de87934] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0x110) [0x12e00b6b0] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x74) [0x12e00b494] ) 0-: received signum (15), shutting down






share|improve this question






















  • Lots of statements. What's the question?
    – roaima
    Mar 23 at 8:17










  • sorry for lots of logs. Not able to figureout why the rebalance is not working.
    – rss
    Mar 23 at 8:21










  • volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
    – rss
    Mar 30 at 3:56













up vote
0
down vote

favorite









up vote
0
down vote

favorite











gluster rebalance is failing after the rebalance and brick is also going down after running the rebalance. The output and logs are as follow:



$ gluster --mode=script --wignore volume status



Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 49153 0 Y 64834

Task Status of Volume patchy
------------------------------------------------------------------------------
There are no active volume tasks


$ gluster --mode=script --wignore volume set patchy cluster.weighted-rebalance off



$ gluster --mode=script --wignore volume rebalance patchy start force



volume rebalance: patchy: success: Rebalance on patchy has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 5c573761-314d-4294-99ba-c6a518675e26


$ gluster --mode=script --wignore volume rebalance patchy status



 Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 0 3 0 failed 0:00:00
volume rebalance: patchy: success


$ gluster --mode=script --wignore volume status



 Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 N/A N/A N N/A

Task Status of Volume patchy

------------------------------------------------------------------------------
Task : Rebalance
ID : 5c573761-314d-4294-99ba-c6a518675e26
Status : failed


The volume rebalance log is as follow



$ cat /var/log/glusterfs/patchy-rebalance.log



[2018-03-23 08:06:34.303638] I [MSGID: 100030] [glusterfsd.c:2625:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 4.0.1 (args: /usr/local/sbin/glusterfs -s localhost --volfile-id rebalance/patchy --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --process-name rebalance --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=88559a30-a606-4af0-beb6-458cfafa8df6 --xlator-option *dht.commit-hash=3584404562 --socket-file /var/run/gluster/gluster-rebalance-2f34d12e-1e62-4737-8eec-b14b75ae3500.sock --pid-file /var/lib/glusterd/vols/patchy/rebalance/88559a30-a606-4af0-beb6-458cfafa8df6.pid -l /var/log/glusterfs/patchy-rebalance.log)
[2018-03-23 08:06:34.316882] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-03-23 08:06:39.304868] I [MSGID: 109104] [dht-shared.c:710:dht_init] 0-patchy-dht: dht_init using commit hash 3584404562
[2018-03-23 08:06:39.305817] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-03-23 08:06:39.307050] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-0: parent translators are ready, attempting connect on transport
[2018-03-23 08:06:39.307634] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-1: parent translators are ready, attempting connect on transport
Final graph:
+------------------------------------------------------------------------------+
1: volume patchy-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host myhost
5: option remote-subvolume /d/backends/patchy1
6: option transport-type socket
7: option transport.address-family inet
8: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
9: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
10: option transport.tcp-user-timeout 0
11: option transport.socket.keepalive-time 20
12: option transport.socket.keepalive-interval 2
13: option transport.socket.keepalive-count 9
14: end-volume
15:
16: volume patchy-client-1
17: type protocol/client
18: option ping-timeout 42
19: option remote-host myhost
20: option remote-subvolume /d/backends/patchy2
21: option transport-type socket
22: option transport.address-family inet
23: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
24: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
25: option transport.tcp-user-timeout 0
26: option transport.socket.keepalive-time 20
27: option transport.socket.keepalive-interval 2
28: option transport.socket.keepalive-count 9
29: end-volume
30:
31: volume patchy-dht
32: type cluster/distribute
33: option use-readdirp yes
34: option lookup-unhashed yes
35: option assert-no-child-down yes
36: option readdir-optimize on
37: option rebalance-cmd 5
38: option node-uuid 88559a30-a606-4af0-beb6-458cfafa8df6
39: option commit-hash 3584404562
40: option lock-migration off
41: option force-migration off
42: option weighted-rebalance off
43: subvolumes patchy-client-0 patchy-client-1
44: end-volume
45:
46: volume patchy
47: type debug/io-stats
48: option log-level INFO
49: option latency-measurement off
50: option count-fop-hits off
51: subvolumes patchy-dht
52: end-volume
53:
+------------------------------------------------------------------------------+
[2018-03-23 08:06:39.308435] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308538] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308607] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308838] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308867] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49153 (from 0)
[2018-03-23 08:06:39.309138] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309222] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-0: changing port to 49152 (from 0)
[2018-03-23 08:06:39.309264] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309531] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309747] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.310115] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-1: Connected to patchy-client-1, attached to remote volume '/d/backends/patchy2'.
[2018-03-23 08:06:39.310264] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-0: Connected to patchy-client-0, attached to remote volume '/d/backends/patchy1'.
[2018-03-23 08:06:39.315151] I [MSGID: 109005] [dht-selfheal.c:2328:dht_selfheal_directory] 0-patchy-dht: Directory selfheal failed: Unable to form layout for directory /
[2018-03-23 08:06:39.315302] I [dht-rebalance.c:4513:gf_defrag_start_crawl] 0-patchy-dht: gf_defrag_start_crawl using commit hash 3584404562
[2018-03-23 08:06:39.315689] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.317123] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317201] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317434] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-1
[2018-03-23 08:06:39.317455] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317462] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-0
[2018-03-23 08:06:39.317469] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317601] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-1,cnt = 6119424
[2018-03-23 08:06:39.317730] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-0,cnt = 3149824
[2018-03-23 08:06:39.317739] I [MSGID: 0] [dht-rebalance.c:4275:gf_defrag_total_file_size] 0-patchy-dht: Total size files = 9269248
[2018-03-23 08:06:39.317866] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-1,cnt = 1570
[2018-03-23 08:06:39.318020] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-0,cnt = 897
[2018-03-23 08:06:39.318029] I [MSGID: 0] [dht-rebalance.c:4311:gf_defrag_total_file_cnt] 0-patchy-dht: Total number of files = 1233
[2018-03-23 08:06:39.318148] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2018-03-23 08:06:39.318323] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
[2018-03-23 08:06:39.318360] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[2] creation successful
[2018-03-23 08:06:39.318436] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[3] creation successful
[2018-03-23 08:06:39.377769] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /dir
[2018-03-23 08:06:39.378756] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /dir
[2018-03-23 08:06:39.689322] W [socket.c:592:__socket_rwv] 0-patchy-client-1: readv on 0.0.0.0:49153 failed (No data available)
[2018-03-23 08:06:39.689360] I [MSGID: 114018] [client.c:2227:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-03-23 08:06:39.689381] W [MSGID: 109073] [dht-common.c:10557:dht_notify] 0-patchy-dht: Received CHILD_DOWN. Exiting
[2018-03-23 08:06:39.689391] I [MSGID: 109029] [dht-rebalance.c:5327:gf_defrag_stop] 0-: Received stop command on rebalance
[2018-03-23 08:06:39.689573] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x15a)[0x3ff7e1294a2] (--> /usr/local/lib/libgfrpc.so.0(+0xdb1e)[0x3ff7e08db1e] (--> /usr/local/lib/libgfrpc.so.0(+0xdc8c)[0x3ff7e08dc8c] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x98)[0x3ff7e08f408] (--> /usr/local/lib/libgfrpc.so.0(+0xfffc)[0x3ff7e08fffc] ))))) 0-patchy-client-1: forced unwinding frame type(GlusterFS 4.x v1) op(READDIRP(40)) called at 2018-03-23 08:06:39.379300 (xid=0x22)
[2018-03-23 08:06:39.689588] W [MSGID: 114031] [client-rpc-fops_v2.c:2264:client4_0_readdirp_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.689635] W [MSGID: 109021] [dht-rebalance.c:3106:gf_defrag_get_entry] 0-patchy-dht: Readdirp failed. Aborting data migration for directory: /dir [Transport endpoint is not connected]
[2018-03-23 08:06:39.689655] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.689714] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /dir
[2018-03-23 08:06:39.690982] W [MSGID: 114061] [client-common.c:3375:client_pre_readdirp_v2] 0-patchy-client-1: (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File descriptor in bad state]
[2018-03-23 08:06:39.691056] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.691419] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.691436] E [MSGID: 109119] [dht-lock.c:1051:dht_blocking_inodelk_cbk] 0-patchy-dht: inodelk failed on subvol patchy-client-1, gfid:00000000-0000-0000-0000-000000000001 [Transport endpoint is not connected]
[2018-03-23 08:06:39.691623] E [MSGID: 109016] [dht-rebalance.c:3934:gf_defrag_fix_layout] 0-patchy-dht: Setxattr failed for / [Transport endpoint is not connected]
[2018-03-23 08:06:39.691636] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /
[2018-03-23 08:06:39.691653] E [MSGID: 114031] [client-rpc-fops_v2.c:2451:client4_0_opendir_cbk] 0-patchy-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2018-03-23 08:06:39.691812] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.691844] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /
[2018-03-23 08:06:39.691862] I [dht-rebalance.c:4716:gf_defrag_start_crawl] 0-DHT: crawling file-system completed
[2018-03-23 08:06:39.692135] I [MSGID: 109028] [dht-rebalance.c:5141:gf_defrag_status_get] 0-patchy-dht: Rebalance is failed. Time taken is 0.00 secs
[2018-03-23 08:06:39.692144] I [MSGID: 109028] [dht-rebalance.c:5145:gf_defrag_status_get] 0-patchy-dht: Files migrated: 0, size: 0, lookups: 0, failures: 3, skipped: 0
[2018-03-23 08:06:39.692230] W [glusterfsd.c:1424:cleanup_and_exit] (-->/lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ff7de87934] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0x110) [0x12e00b6b0] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x74) [0x12e00b494] ) 0-: received signum (15), shutting down






share|improve this question














gluster rebalance is failing after the rebalance and brick is also going down after running the rebalance. The output and logs are as follow:



$ gluster --mode=script --wignore volume status



Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 49153 0 Y 64834

Task Status of Volume patchy
------------------------------------------------------------------------------
There are no active volume tasks


$ gluster --mode=script --wignore volume set patchy cluster.weighted-rebalance off



$ gluster --mode=script --wignore volume rebalance patchy start force



volume rebalance: patchy: success: Rebalance on patchy has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 5c573761-314d-4294-99ba-c6a518675e26


$ gluster --mode=script --wignore volume rebalance patchy status



 Node Rebalanced-files size scanned failures skipped status run time in h:m:s
--------- ----------- ----------- ----------- ----------- ----------- ------------ --------------
localhost 0 0Bytes 0 3 0 failed 0:00:00
volume rebalance: patchy: success


$ gluster --mode=script --wignore volume status



 Status of volume: patchy
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick myhost:/d/backends/patchy1 49152 0 Y 64813
Brick myhost:/d/backends/patchy2 N/A N/A N N/A

Task Status of Volume patchy

------------------------------------------------------------------------------
Task : Rebalance
ID : 5c573761-314d-4294-99ba-c6a518675e26
Status : failed


The volume rebalance log is as follow



$ cat /var/log/glusterfs/patchy-rebalance.log



[2018-03-23 08:06:34.303638] I [MSGID: 100030] [glusterfsd.c:2625:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 4.0.1 (args: /usr/local/sbin/glusterfs -s localhost --volfile-id rebalance/patchy --xlator-option *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal=off --xlator-option *dht.readdir-optimize=on --process-name rebalance --xlator-option *dht.rebalance-cmd=5 --xlator-option *dht.node-uuid=88559a30-a606-4af0-beb6-458cfafa8df6 --xlator-option *dht.commit-hash=3584404562 --socket-file /var/run/gluster/gluster-rebalance-2f34d12e-1e62-4737-8eec-b14b75ae3500.sock --pid-file /var/lib/glusterd/vols/patchy/rebalance/88559a30-a606-4af0-beb6-458cfafa8df6.pid -l /var/log/glusterfs/patchy-rebalance.log)
[2018-03-23 08:06:34.316882] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2018-03-23 08:06:39.304868] I [MSGID: 109104] [dht-shared.c:710:dht_init] 0-patchy-dht: dht_init using commit hash 3584404562
[2018-03-23 08:06:39.305817] I [MSGID: 101190] [event-epoll.c:609:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2
[2018-03-23 08:06:39.307050] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-0: parent translators are ready, attempting connect on transport
[2018-03-23 08:06:39.307634] I [MSGID: 114020] [client.c:2300:notify] 0-patchy-client-1: parent translators are ready, attempting connect on transport
Final graph:
+------------------------------------------------------------------------------+
1: volume patchy-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host myhost
5: option remote-subvolume /d/backends/patchy1
6: option transport-type socket
7: option transport.address-family inet
8: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
9: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
10: option transport.tcp-user-timeout 0
11: option transport.socket.keepalive-time 20
12: option transport.socket.keepalive-interval 2
13: option transport.socket.keepalive-count 9
14: end-volume
15:
16: volume patchy-client-1
17: type protocol/client
18: option ping-timeout 42
19: option remote-host myhost
20: option remote-subvolume /d/backends/patchy2
21: option transport-type socket
22: option transport.address-family inet
23: option username 288e00ca-26be-4a99-9e33-ea1b174ef347
24: option password 626b1311-2be8-4c6c-97f3-76a33c4a48e5
25: option transport.tcp-user-timeout 0
26: option transport.socket.keepalive-time 20
27: option transport.socket.keepalive-interval 2
28: option transport.socket.keepalive-count 9
29: end-volume
30:
31: volume patchy-dht
32: type cluster/distribute
33: option use-readdirp yes
34: option lookup-unhashed yes
35: option assert-no-child-down yes
36: option readdir-optimize on
37: option rebalance-cmd 5
38: option node-uuid 88559a30-a606-4af0-beb6-458cfafa8df6
39: option commit-hash 3584404562
40: option lock-migration off
41: option force-migration off
42: option weighted-rebalance off
43: subvolumes patchy-client-0 patchy-client-1
44: end-volume
45:
46: volume patchy
47: type debug/io-stats
48: option log-level INFO
49: option latency-measurement off
50: option count-fop-hits off
51: subvolumes patchy-dht
52: end-volume
53:
+------------------------------------------------------------------------------+
[2018-03-23 08:06:39.308435] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308538] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308607] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308838] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.308867] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-1: changing port to 49153 (from 0)
[2018-03-23 08:06:39.309138] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309222] I [rpc-clnt.c:2071:rpc_clnt_reconfig] 0-patchy-client-0: changing port to 49152 (from 0)
[2018-03-23 08:06:39.309264] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-1: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309531] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.309747] W [rpc-clnt.c:1739:rpc_clnt_submit] 0-patchy-client-0: error returned while attempting to connect to host:(null), port:0
[2018-03-23 08:06:39.310115] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-1: Connected to patchy-client-1, attached to remote volume '/d/backends/patchy2'.
[2018-03-23 08:06:39.310264] I [MSGID: 114046] [client-handshake.c:1176:client_setvolume_cbk] 0-patchy-client-0: Connected to patchy-client-0, attached to remote volume '/d/backends/patchy1'.
[2018-03-23 08:06:39.315151] I [MSGID: 109005] [dht-selfheal.c:2328:dht_selfheal_directory] 0-patchy-dht: Directory selfheal failed: Unable to form layout for directory /
[2018-03-23 08:06:39.315302] I [dht-rebalance.c:4513:gf_defrag_start_crawl] 0-patchy-dht: gf_defrag_start_crawl using commit hash 3584404562
[2018-03-23 08:06:39.315689] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.317123] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317201] E [MSGID: 109039] [dht-common.c:4057:dht_find_local_subvol_cbk] 0-patchy-dht: getxattr err for dir [No data available]
[2018-03-23 08:06:39.317434] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-1
[2018-03-23 08:06:39.317455] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317462] I [MSGID: 0] [dht-rebalance.c:4585:gf_defrag_start_crawl] 0-patchy-dht: local subvols are patchy-client-0
[2018-03-23 08:06:39.317469] I [MSGID: 0] [dht-rebalance.c:4591:gf_defrag_start_crawl] 0-patchy-dht: node uuids are 88559a30-a606-4af0-beb6-458cfafa8df6
[2018-03-23 08:06:39.317601] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-1,cnt = 6119424
[2018-03-23 08:06:39.317730] I [MSGID: 0] [dht-rebalance.c:4271:gf_defrag_total_file_size] 0-patchy-dht: local subvol: patchy-client-0,cnt = 3149824
[2018-03-23 08:06:39.317739] I [MSGID: 0] [dht-rebalance.c:4275:gf_defrag_total_file_size] 0-patchy-dht: Total size files = 9269248
[2018-03-23 08:06:39.317866] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-1,cnt = 1570
[2018-03-23 08:06:39.318020] I [MSGID: 0] [dht-rebalance.c:4300:gf_defrag_total_file_cnt] 0-patchy-dht: local subvol: patchy-client-0,cnt = 897
[2018-03-23 08:06:39.318029] I [MSGID: 0] [dht-rebalance.c:4311:gf_defrag_total_file_cnt] 0-patchy-dht: Total number of files = 1233
[2018-03-23 08:06:39.318148] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[0] creation successful
[2018-03-23 08:06:39.318323] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[1] creation successful
[2018-03-23 08:06:39.318360] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[2] creation successful
[2018-03-23 08:06:39.318436] I [dht-rebalance.c:4667:gf_defrag_start_crawl] 0-DHT: Thread[3] creation successful
[2018-03-23 08:06:39.377769] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /dir
[2018-03-23 08:06:39.378756] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /dir
[2018-03-23 08:06:39.689322] W [socket.c:592:__socket_rwv] 0-patchy-client-1: readv on 0.0.0.0:49153 failed (No data available)
[2018-03-23 08:06:39.689360] I [MSGID: 114018] [client.c:2227:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-03-23 08:06:39.689381] W [MSGID: 109073] [dht-common.c:10557:dht_notify] 0-patchy-dht: Received CHILD_DOWN. Exiting
[2018-03-23 08:06:39.689391] I [MSGID: 109029] [dht-rebalance.c:5327:gf_defrag_stop] 0-: Received stop command on rebalance
[2018-03-23 08:06:39.689573] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x15a)[0x3ff7e1294a2] (--> /usr/local/lib/libgfrpc.so.0(+0xdb1e)[0x3ff7e08db1e] (--> /usr/local/lib/libgfrpc.so.0(+0xdc8c)[0x3ff7e08dc8c] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x98)[0x3ff7e08f408] (--> /usr/local/lib/libgfrpc.so.0(+0xfffc)[0x3ff7e08fffc] ))))) 0-patchy-client-1: forced unwinding frame type(GlusterFS 4.x v1) op(READDIRP(40)) called at 2018-03-23 08:06:39.379300 (xid=0x22)
[2018-03-23 08:06:39.689588] W [MSGID: 114031] [client-rpc-fops_v2.c:2264:client4_0_readdirp_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.689635] W [MSGID: 109021] [dht-rebalance.c:3106:gf_defrag_get_entry] 0-patchy-dht: Readdirp failed. Aborting data migration for directory: /dir [Transport endpoint is not connected]
[2018-03-23 08:06:39.689655] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.689714] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /dir
[2018-03-23 08:06:39.690982] W [MSGID: 114061] [client-common.c:3375:client_pre_readdirp_v2] 0-patchy-client-1: (00000000-0000-0000-0000-000000000001) remote_fd is -1. EBADFD [File descriptor in bad state]
[2018-03-23 08:06:39.691056] I [MSGID: 109081] [dht-common.c:5602:dht_setxattr] 0-patchy-dht: fixing the layout of /
[2018-03-23 08:06:39.691419] E [MSGID: 114031] [client-rpc-fops_v2.c:1306:client4_0_inodelk_cbk] 0-patchy-client-1: remote operation failed [Transport endpoint is not connected]
[2018-03-23 08:06:39.691436] E [MSGID: 109119] [dht-lock.c:1051:dht_blocking_inodelk_cbk] 0-patchy-dht: inodelk failed on subvol patchy-client-1, gfid:00000000-0000-0000-0000-000000000001 [Transport endpoint is not connected]
[2018-03-23 08:06:39.691623] E [MSGID: 109016] [dht-rebalance.c:3934:gf_defrag_fix_layout] 0-patchy-dht: Setxattr failed for / [Transport endpoint is not connected]
[2018-03-23 08:06:39.691636] I [dht-rebalance.c:3274:gf_defrag_process_dir] 0-patchy-dht: migrate data called on /
[2018-03-23 08:06:39.691653] E [MSGID: 114031] [client-rpc-fops_v2.c:2451:client4_0_opendir_cbk] 0-patchy-client-1: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]
[2018-03-23 08:06:39.691812] W [dht-rebalance.c:3448:gf_defrag_process_dir] 0-patchy-dht: Found error from gf_defrag_get_entry
[2018-03-23 08:06:39.691844] E [MSGID: 109111] [dht-rebalance.c:3962:gf_defrag_fix_layout] 0-patchy-dht: gf_defrag_process_dir failed for directory: /
[2018-03-23 08:06:39.691862] I [dht-rebalance.c:4716:gf_defrag_start_crawl] 0-DHT: crawling file-system completed
[2018-03-23 08:06:39.692135] I [MSGID: 109028] [dht-rebalance.c:5141:gf_defrag_status_get] 0-patchy-dht: Rebalance is failed. Time taken is 0.00 secs
[2018-03-23 08:06:39.692144] I [MSGID: 109028] [dht-rebalance.c:5145:gf_defrag_status_get] 0-patchy-dht: Files migrated: 0, size: 0, lookups: 0, failures: 3, skipped: 0
[2018-03-23 08:06:39.692230] W [glusterfsd.c:1424:cleanup_and_exit] (-->/lib/s390x-linux-gnu/libpthread.so.0(+0x7934) [0x3ff7de87934] -->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0x110) [0x12e00b6b0] -->/usr/local/sbin/glusterfs(cleanup_and_exit+0x74) [0x12e00b494] ) 0-: received signum (15), shutting down








share|improve this question













share|improve this question




share|improve this question








edited Mar 29 at 5:03

























asked Mar 23 at 8:15









rss

11




11











  • Lots of statements. What's the question?
    – roaima
    Mar 23 at 8:17










  • sorry for lots of logs. Not able to figureout why the rebalance is not working.
    – rss
    Mar 23 at 8:21










  • volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
    – rss
    Mar 30 at 3:56

















  • Lots of statements. What's the question?
    – roaima
    Mar 23 at 8:17










  • sorry for lots of logs. Not able to figureout why the rebalance is not working.
    – rss
    Mar 23 at 8:21










  • volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
    – rss
    Mar 30 at 3:56
















Lots of statements. What's the question?
– roaima
Mar 23 at 8:17




Lots of statements. What's the question?
– roaima
Mar 23 at 8:17












sorry for lots of logs. Not able to figureout why the rebalance is not working.
– rss
Mar 23 at 8:21




sorry for lots of logs. Not able to figureout why the rebalance is not working.
– rss
Mar 23 at 8:21












volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
– rss
Mar 30 at 3:56





volume rebalance fails as the number of files in the bricks are more than 750. It is true and strange. any idea?
– rss
Mar 30 at 3:56
















active

oldest

votes











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f433013%2fgluster-rebalance-failure%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes










 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f433013%2fgluster-rebalance-failure%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Bahrain

Postfix configuration issue with fips on centos 7; mailgun relay