05 Ceph试验
今后的各个实验,将采用以下的某个实验环境,不再详细列出,而将以env-X的方式代指。
单节点的集群环境,在该节点部署mon,挂载了三个2T的磁盘:
[root@ceph-1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 128G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 127.5G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 2G 0 lvm [SWAP]
└─centos-home 253:2 0 75.5G 0 lvm /home
sdb 8:16 0 2T 0 disk
sdc 8:32 0 2T 0 disk
sdd 8:48 0 2T 0 disk
sr0 11:0 1 1024M 0 rom
[root@ceph-1 ~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
[root@ceph-1 ~]# ceph -v
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
[root@ceph-1 cluster]# cat /etc/ceph/ceph.conf
[global]
fsid = 99fcd5bc-f4ec-4419-88b5-0a1921b90e77
public_network = 192.168.57.0/24
mon_initial_members = ceph-1
mon_host = 192.168.57.241
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_size = 1
osd_pool_default_min_size = 1
osd_crush_chooseleaf_type = 0 # from OSD
三节点的集群环境,每个节点部署一个Mon,挂载有三个2T的磁盘,三个节点分别名为[ceph-1,ceph-2,ceph-3]:
[root@ceph-1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 128G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 127.5G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 2G 0 lvm [SWAP]
└─centos-home 253:2 0 75.5G 0 lvm /home
sdb 8:16 0 2T 0 disk
sdc 8:32 0 2T 0 disk
sdd 8:48 0 2T 0 disk
sr0 11:0 1 1024M 0 rom
[root@ceph-1 ~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
[root@ceph-1 ~]# ceph -v
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
[root@ceph-1 ~]# cat /etc/ceph/ceph.conf
[global]
fsid = 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
public_network = 192.168.57.0/24
mon_initial_members = ceph-1, ceph-2, ceph-3
mon_host = 192.168.57.241,192.168.57.242,192.168.57.243
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
由于先前的将OSD的journal软链接到SSD盘的某个5G文件,这种方式未能考虑到SSD的对齐,而如果SSD未能正确对齐的话,数据读写速度将会大大降低,本实验将介绍合理的对齐方法。
env-1- 假定
/dev/sdb为SSD盘,/dev/sdc&/dev/sdd为SATA盘。 - 假定journal使用20G空间,需要向
ceph.conf中加入osd_journal_size = 20480。 /dev/sdb1给/dev/sdc作journal,/dev/sdb2给/dev/sdd作journal。
- 将
/dev/sdb进行分区,前两个分区为20G。
[root@ceph-1 ~]# fdisk /dev/sdb
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
欢迎使用 fdisk (util-linux 2.23.2)。
更改将停留在内存中,直到您决定将更改写入磁盘。
使用写入命令前请三思。
命令(输入 m 获取帮助):m
命令操作
d delete a partition
g create a new empty GPT partition table
G create an IRIX (SGI) partition table
l list known partition types
m print this menu
n add a new partition
o create a new empty DOS partition table
q quit without saving changes
s create a new empty Sun disklabel
w write table to disk and exit
命令(输入 m 获取帮助):g
Building a new GPT disklabel (GUID: 9A5D6B86-5C14-462E-8350-3B95BDDF312F)
命令(输入 m 获取帮助):n
分区号 (1-128,默认 1):1
第一个扇区 (2048-4294965214,默认 2048):2048
Last sector, +sectors or +size{K,M,G,T,P} (2048-4294965214,默认 4294965214):+20G
已创建分区 1
命令(输入 m 获取帮助):w
The partition table has been altered!
Calling ioctl() to re-read partition table.
正在同步磁盘。
[root@ceph-1 ~]# fdisk /dev/sdb
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
欢迎使用 fdisk (util-linux 2.23.2)。
更改将停留在内存中,直到您决定将更改写入磁盘。
使用写入命令前请三思。
命令(输入 m 获取帮助):n
分区号 (2-128,默认 2):2
第一个扇区 (41945088-4294965214,默认 41945088):
Last sector, +sectors or +size{K,M,G,T,P} (41945088-4294965214,默认 4294965214):+20G
已创建分区 2
命令(输入 m 获取帮助):w
The partition table has been altered!
Calling ioctl() to re-read partition table.
正在同步磁盘。
查看新的分区表:
[root@ceph-1 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 128G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 127.5G 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 2G 0 lvm [SWAP]
└─centos-home 253:2 0 75.5G 0 lvm /home
sdb 8:16 0 2T 0 disk
├─sdb1 8:17 0 20G 0 part
└─sdb2 8:18 0 20G 0 part
sdc 8:32 0 2T 0 disk
sdd 8:48 0 2T 0 disk
sr0 11:0 1 1024M 0 rom
建立新的OSD,将journal指向sdb的分区。
[root@ceph-1 cluster]# ceph --show-config|grep osd_journal_size
osd_journal_size = 20480
[root@ceph-1 cluster]# ceph-deploy osd prepare ceph-1:/dev/sdc:/dev/sdb1 ceph-1:/dev/sdd:/dev/sdb2 --zap-disk
[root@ceph-1 cluster]# ceph-deploy osd activate ceph-1:/dev/sdc1 ceph-1:/dev/sdd1
查看日志生效:
[root@ceph-1 cluster]# ll /var/lib/ceph/osd/ceph-0/
总用量 40
-rw-r--r-- 1 root root 192 8月 4 15:07 activate.monmap
-rw-r--r-- 1 root root 3 8月 4 15:07 active
-rw-r--r-- 1 root root 37 8月 4 15:06 ceph_fsid
drwxr-xr-x 36 root root 565 8月 4 15:08 current
-rw-r--r-- 1 root root 37 8月 4 15:06 fsid
lrwxrwxrwx 1 root root 9 8月 4 15:06 journal -> /dev/sdb1
-rw------- 1 root root 56 8月 4 15:07 keyring
-rw-r--r-- 1 root root 21 8月 4 15:06 magic
-rw-r--r-- 1 root root 6 8月 4 15:07 ready
-rw-r--r-- 1 root root 4 8月 4 15:07 store_version
-rw-r--r-- 1 root root 53 8月 4 15:07 superblock
-rw-r--r-- 1 root root 0 8月 4 15:07 sysvinit
-rw-r--r-- 1 root root 2 8月 4 15:07 whoami
创建一个不对其的分区,第一个扇区选项比默认值+1:
[root@ceph-1 cluster]# fdisk /dev/sdb
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
欢迎使用 fdisk (util-linux 2.23.2)。
更改将停留在内存中,直到您决定将更改写入磁盘。
使用写入命令前请三思。
命令(输入 m 获取帮助):n
分区号 (3-128,默认 3):
第一个扇区 (83888128-4294965214,默认 83888128):83888129
Last sector, +sectors or +size{K,M,G,T,P} (83888129-4294965214,默认 4294965214):+20G
已创建分区 3
命令(输入 m 获取帮助):w
The partition table has been altered!
有时候会得到如下提醒,可以执行partprobe通知内核更新分区表:
WARNING: Re-reading the partition table failed with error 16: 设备或资源忙.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
[root@ceph-1 ~]# partprobe
检查/dev/sdb[1,2,3]是否对齐:
[root@ceph-1 cluster]# parted /dev/sdb
GNU Parted 3.1
使用 /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) align-check optimal 1
1 aligned
(parted) align-check optimal 2
2 aligned
(parted) align-check optimal 3
3 not aligned
可以看到/dev/sdb3未对齐,删了重新分区吧,暂时没想到别的方法。
在给SSD分完区后,一定要检查一下分区是否对其,否则会影响SSD的读写速度。
在日常使用中,用作journal的SSD盘有可能会损坏,损坏之后替换新的SSD盘,需要将原先的journal指向新的SSD盘。或者就是简单的替换journal盘。
env-1&exp-1- 删除
exp-1的不对齐的/dev/sdb3,增加两个分区分别为/dev/sdb[3,4]
希望将osd.0的journal修改为/dev/sdb3,osd.1的日志修改为/dev/sdb4,目前,日志盘使用情况为:
[root@ceph-1 cluster]# ll /var/lib/ceph/osd/ceph-0/journal
lrwxrwxrwx 1 root root 9 8月 4 15:06 /var/lib/ceph/osd/ceph-0/journal -> /dev/sdb1
[root@ceph-1 cluster]# ll /var/lib/ceph/osd/ceph-1/journal
lrwxrwxrwx 1 root root 9 8月 4 15:07 /var/lib/ceph/osd/ceph-1/journal -> /dev/sdb2
设置noout标志,防止集群进行数据恢复。
[root@ceph-1 cluster]# ceph osd set noout
set noout
[root@ceph-1 cluster]# ceph -s
cluster 99fcd5bc-f4ec-4419-88b5-0a1921b90e77
health HEALTH_WARN
noout flag(s) set
monmap e1: 1 mons at {ceph-1=192.168.57.241:6789/0}
election epoch 2, quorum 0 ceph-1
osdmap e10: 2 osds: 2 up, 2 in
flags noout
pgmap v13: 64 pgs, 1 pools, 0 bytes data, 0 objects
68152 kB used, 4093 GB / 4093 GB avail
64 active+clean
停止OSD进程:
[root@ceph-1 cluster]# service ceph stop osd.0
下刷journal到OSD中:
[root@ceph-1 cluster]# ceph-osd -i 0 --flush-journal
2016-08-04 16:27:07.531321 7ffbc0876880 -1 flushed journal /var/lib/ceph/osd/ceph-0/journal for object store /var/lib/ceph/osd/ceph-0
/var/lib/ceph/osd/ceph-0/journal -> /dev/sdb1这种链接方式存在一定的危险:如果磁盘被拔掉再插上,虽然可能性不大,盘符可能会变,但是分区的uuid是唯一的,拔插并不会改变uuid,所以建议将journal链接到uuid上。
查看/dev/sdb3的uuid:
[root@ceph-1 cluster]# ll /dev/disk/by-partuuid/ |grep sdb3
lrwxrwxrwx 1 root root 10 8月 4 14:34 4472e58f-37ae-4277-bd24-1cc759cd5a51 -> ../../sdb3
移除旧的journal,并将新的journal链接到原先位置:
[root@ceph-1 cluster]# rm -rf /var/lib/ceph/osd/ceph-0/journal
[root@ceph-1 cluster]# ln -s /dev/disk/by-partuuid/4472e58f-37ae-4277-bd24-1cc759cd5a51 /var/lib/ceph/osd/ceph-0/journal
[root@ceph-1 cluster]# chown ceph:ceph /var/lib/ceph/osd/ceph-0/journal
[root@ceph-1 cluster]# echo 4472e58f-37ae-4277-bd24-1cc759cd5a51 > /var/lib/ceph/osd/ceph-0/journal_uuid
至此链接建立完成,创建journal,并开启OSD:
[root@ceph-1 cluster]# ceph-osd -i 0 --mkjournal
2016-08-04 16:55:06.381554 7f07012aa880 -1 journal check: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected fae5d972-cb4c-46e1-a620-9407002556ba, invalid (someone else's?) journal
2016-08-04 16:55:06.387479 7f07012aa880 -1 created new journal /var/lib/ceph/osd/ceph-0/journal for object store /var/lib/ceph/osd/ceph-0
[root@ceph-1 cluster]# service ceph start osd.0
=== osd.0 ===
=== osd.0 ===
create-or-move updated item name 'osd.0' weight 2 at location {host=ceph-1,root=default} to crush map
Starting Ceph osd.0 on ceph-1...
Running as unit ceph-osd.0.1470301052.328264448.service.
去除noout标志,查看journal状态:
[root@ceph-1 cluster]# ceph osd unset noout
unset noout
[root@ceph-1 cluster]# ll /var/lib/ceph/osd/ceph-0/
总用量 44
-rw-r--r-- 1 root root 192 8月 4 16:24 activate.monmap
-rw-r--r-- 1 root root 3 8月 4 16:24 active
-rw-r--r-- 1 root root 37 8月 4 16:23 ceph_fsid
drwxr-xr-x 36 root root 565 8月 4 16:24 current
-rw-r--r-- 1 root root 37 8月 4 16:23 fsid
lrwxrwxrwx 1 root root 58 8月 4 16:53 journal -> /dev/disk/by-partuuid/4472e58f-37ae-4277-bd24-1cc759cd5a51
-rw-r--r-- 1 root root 37 8月 4 16:54 journal_uuid
..
如文中所述,使用盘符的挂载链接方式并不安全,建议使用uuid进行链接,即使拔插磁盘也不会导致journal无法识别,而产生的OSD无法启动现象。
实验前提: monitor目录下的
store.db内部文件未被损坏或删除。
在集群的所有key(mon.keyring/client.admin.keyring/bootstrap-[osd/mds/rgw].keyring)均丢失的情况下,进入集群进行操作。
ceph客户端通过调用rados进行访问集群,我们在调用ceph -s指令时,实际上还调用了一些默认值:ceph --conf=/etc/ceph/ceph.conf --name=client.admin --keyring=/etc/ceph/ceph.client.admin.keyring -s,即 使用用户client.admin以及它对应的keyring来连接rados,进行认证和访问。此刻我们已经丢失了这个keyring,并且无法访问集群获取之。而权限最大的用户mon.的keyring,在/var/lib/ceph/mon/ceph-ceph-1/keyring下,如果该keyring也丢失了,仍可以通过创建mon的keyring来访问集群。
env-1
如果存在多个monitor,那么如果卡在probing状态,可以参照其他实验(exp-4),修改monmap,使集群认为只有一个mon。
随便创建一个keyring(key的内容随意),保存在monitor目录下:
[root@ceph-1 ceph-ceph-1]# cat /var/lib/ceph/mon/ceph-ceph-1/keyring
[mon.]
key = AQDPraFXlH/4MBAA0ozKh9l9jKKp/5ofE/Xjsw==
caps mon = "allow *"
启动monitor:
[root@ceph-1 ceph-ceph-1]# service ceph start mon
=== mon.ceph-1 ===
Starting Ceph mon.ceph-1 on ceph-1...
Running as unit ceph-mon.ceph-1.1470305528.606084264.service.
Starting ceph-create-keys on ceph-1...
此处有两个方法获取所有keyring:
- 关闭
cephx认证,执行ceph auth list即可。 - 为了更深刻的理解auth,使用
mon.用户去访问集群,下面将介绍这种方法。
目前在ceph集群里面,我们已知的用户只有mon.和它的keyring。那么用这两个参数去访问集群:
[root@ceph-1 ceph-ceph-1]# ceph auth list --name=mon. --keyring=/var/lib/ceph/mon/ceph-ceph-1/keyring --conf=/etc/ceph/ceph.conf
installed auth entries:
osd.0
key: AQC5+6JXR07eCBAASy1ZFWT0LT5/gguBGqyjVw==
caps: [mon] allow profile osd
caps: [osd] allow *
osd.1
key: AQDC+6JXL7uGCBAArNnHZCuBb3FsxlIGJzwdwg==
caps: [mon] allow profile osd
caps: [osd] allow *
client.admin
key: AQBx+6JXp5KhCBAAb77YZqcnK23qx3p8J9OD6A==
caps: [mds] allow
caps: [mon] allow *
caps: [osd] allow *
client.bootstrap-mds
key: AQBx+6JXrpSHIRAA8wMo9pDF+jGCkWtN6z4znA==
caps: [mon] allow profile bootstrap-mds
client.bootstrap-osd
key: AQBx+6JX2a+PFBAAbpCCWi5vtA0UNgdWEv9PPg==
caps: [mon] allow profile bootstrap-osd
client.bootstrap-rgw
key: AQB0+6JX2/fqHRAArBHMhgW0exYiv1oAp4yEQg==
caps: [mon] allow profile bootstrap-rgw
这样就可以看到client.admin用户的keyring了,可以拷贝出来,或者用以下指令导出之,这里使用的是get-or-create如果admin用户不存在会创建,否则直接拉去已存在的keyring:
[root@ceph-1 ceph-ceph-1]# ceph --cluster=ceph --name=mon. --keyring=/var/lib/ceph/mon/ceph-ceph-1/keyring auth get-or-create client.admin mon 'allow *' osd 'allow *' mds 'allow' -o /etc/ceph/ceph.client.admin.keyring
jewel版本可能需要 mds ‘allow *’
拥有了ceph.client.admin.keyring,就可以为所欲为了。
附加的测试:
将ceph.conf和/var/lib/ceph/mon/ceph-ceph-1/keyring拷贝到其他节点上,如ceph-2:/root/ceph-1/,执行如下指令可以访问ceph-1的集群:
[root@ceph-2 ceph-1]# ceph --conf=/root/ceph-1/ceph.conf --name=mon. --keyring=/root/ceph-1/keyring -s
cluster 99fcd5bc-f4ec-4419-88b5-0a1921b90e77
health HEALTH_WARN
64 pgs stale
64 pgs stuck stale
2/2 in osds are down
monmap e1: 1 mons at {ceph-1=192.168.57.241:6789/0}
election epoch 1, quorum 0 ceph-1
osdmap e15: 2 osds: 0 up, 2 in
pgmap v23: 64 pgs, 1 pools, 0 bytes data, 0 objects
67568 kB used, 4093 GB / 4093 GB avail
64 stale+active+clean
[root@ceph-2 ceph-1]# ls
ceph.conf keyring
本实验的宗旨是,利用可以自定义keyring的mon.用户,启动monitor,访问集群,从而进行各种操作,由此可见,集群内的所有keyring都是可有可无的,而mon.的keyring又没有进行加密等操作,因此,实际上是可以增删改查任何keyring的。
假定某次故障,最坏情况下,三个monitor损坏了两个,并且磁盘无法恢复,这时,单monitor是无法启动的,会一直卡在probing状态,等待其他至少一个monitor上线,本实验将要模拟使用仅存的mon来启动集群的方法,当然实验前提是这个节点的monitor的store.db没有损坏。
通过修改monmap,剔除原先的两个损坏mon,让集群认为只有一个mon。。。NONONO,这是别人讲的方法,要修改monmap,那么就要导出monmap,集群此时挂了,是无法执行导出指令的,此实验通过创建一个新的monmap,注入到集群中,来访问集群。
env-2
实验之前,集群有三个mon,三个OSD,状态如下:
[root@ceph-1 cluster]# ceph -s
cluster 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
health HEALTH_OK
monmap e1: 3 mons at {ceph-1=192.168.57.241:6789/0,ceph-2=192.168.57.242:6789/0,ceph-3=192.168.57.243:6789/0}
election epoch 28, quorum 0,1,2 ceph-1,ceph-2,ceph-3
osdmap e13: 3 osds: 3 up, 3 in
pgmap v19: 64 pgs, 1 pools, 0 bytes data, 0 objects
100 MB used, 6125 GB / 6126 GB avail
64 active+clean
现在关闭两台(ceph-2/ceph-3)两台MON,观察/var/log/ceph/ceph-mon.ceph-1.log文件,将持续卡在probing状态,ceph的指令也无法执行:
[root@ceph-1 ~]# tail -f /var/log/ceph/ceph-mon.ceph-1.log
2016-08-05 13:20:35.491102 7f22a7e94700 0 -- 192.168.57.241:6789/0 >> 192.168.57.243:6789/0 pipe(0x4d4e000 sd=22 :6789 s=1 pgs=28 cs=1 l=0 c=0x46cb080).fault
2016-08-05 13:20:45.490826 7f22aa99c700 1 mon.ceph-1@0(leader).paxos(paxos active c 1..162) lease_ack_timeout -- calling new election
2016-08-05 13:21:16.856613 7f22aa99c700 0 mon.ceph-1@0(probing).data_health(30) update_stats avail 95% total 51175 MB, used 2295 MB, avail 48879 MB
2016-08-05 13:22:16.857007 7f22aa99c700 0 mon.ceph-1@0(probing).data_health(30) update_stats avail 95% total 51175 MB, used 2295 MB, avail 48879 MB
[root@ceph-1 cluster]# ceph mon getmap -o map
2016-08-05 13:25:13.135617 7f67f832d700 0 -- :/2409307395 >> 192.168.57.243:6789/0 pipe(0x7f67f4062550 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67f405b450).fault
2016-08-05 13:25:16.136372 7f67f822c700 0 -- :/2409307395 >> 192.168.57.242:6789/0 pipe(0x7f67e8000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67e8004ef0).fault
现在关闭ceph-1的monitor,开始创建新的monmap,需要集群的fsid,并查看新创建的monmap,不用担心,epoch的值并不会影响,代码里是设定为0的:
[root@ceph-1 cluster]# monmaptool --create --fsid 58f3771b-0fd0-4042-a174-a5a2c36c4dbc --add ceph-1 192.168.57.241 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: set fsid to 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
[root@ceph-1 cluster]# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 0
fsid 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
last_changed 2016-08-05 13:30:40.717762
created 2016-08-05 13:30:40.717762
0: 192.168.57.241:6789/0 mon.ceph-1
注入进集群:
ceph-mon -i ceph-1 --inject-monmap /tmp/monmap
修改ceph.conf,去除ceph-2和ceph-3这两个Host及其对应IP。 启动ceph-1的monitor:
[root@ceph-1 cluster]# service ceph start mon
=== mon.ceph-1 ===
Starting Ceph mon.ceph-1 on ceph-1...
Running as unit ceph-mon.ceph-1.1470375746.702089733.service.
Starting ceph-create-keys on ceph-1...
[root@ceph-1 cluster]# ceph -s
cluster 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
health HEALTH_OK
monmap e2: 1 mons at {ceph-1=192.168.57.241:6789/0}
election epoch 1, quorum 0 ceph-1
osdmap e17: 3 osds: 3 up, 3 in
pgmap v32: 64 pgs, 1 pools, 0 bytes data, 0 objects
100 MB used, 6125 GB / 6126 GB avail
64 active+clean
可以观察到,此刻集群可以正常访问,并且,集群只剩下一个monitor,只要集群能访问,再增加mon也是很方便的事情,增加mon之前记得清理干净新增节点mon的目录。
如果只想单纯的恢复三个MON,可以将存活节点的
/var/lib/ceph/mon/ceph-ceph-1/store.db覆盖到其余两个节点的这个目录下面,然后再启动三个MON,就可以正常访问集群了,当然另外两个节点的IP等信息应该和旧集群的信息一样。这个经过试验证实方法可行。
由该实验的现象可以推测出一些可行的方案:
- 三个mon的磁盘如果都损坏了,一般是SSD系统盘报废,那么可以尝试拷贝出一个
store.db的所有文件,将其放置到一个正常的节点上,新建一个monmap,只包含该节点的mon,将其注入到进mon里面,启动这个节点。这个实验为exp-5。
有几种可能运用到的场景:
- 三台Monitor主机都无法再启动,需要去新的机器上建立Mon,并访问原来的集群。
- Monitor集体搬迁,需要修改他们的IP。
env-2- 新节点为
ceph-admin:192.168.57.227,此节点为干净节点。
将旧节点的/etc/ceph/以及/var/lib/ceph/mon/ceph-ceph-1/目录分别拷贝到新节点的/etc/ceph/以及/var/lib/ceph/mon/ceph-ceph-admin/目录下:
[root@ceph-1 ~]# scp /etc/ceph/* ceph-admin:/etc/ceph/
root@ceph-admin's password:
ceph.client.admin.keyring 100% 63 0.1KB/s 00:00
ceph.conf 100% 231 0.2KB/s 00:00
[root@ceph-1 ~]# scp -r /var/lib/ceph/mon/ceph-ceph-1/* ceph-admin:/var/lib/ceph/mon/ceph-ceph-admin/
root@ceph-admin's password:
done 100% 0 0.0KB/s 00:00
keyring 100% 77 0.1KB/s 00:00
CURRENT 100% 16 0.0KB/s 00:00
LOG.old 100% 317 0.3KB/s 00:00
LOCK 100% 0 0.0KB/s 00:00
LOG 100% 1142 1.1KB/s 00:00
000037.log 100% 1366KB 1.3MB/s 00:00
MANIFEST-000035 100% 527 0.5KB/s 00:00
000038.sst 100% 2112KB 2.1MB/s 00:00
000039.sst 100% 2058KB 2.0MB/s 00:00
000040.sst 100% 17KB 17.1KB/s 00:00
sysvinit 100% 0 0.0KB/s 00:00
创建一个只包含ceph-admin节点的monmap并导入之:
[root@ceph-admin store.db]# monmaptool --create --fsid 58f3771b-0fd0-4042-a174-a5a2c36c4dbc --add ceph-admin 192.168.57.227 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: set fsid to 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
monmaptool: writing epoch 0 to /tmp/monmap (1 monitors)
[root@ceph-admin store.db]# ceph-mon --inject-monmap /tmp/monmap -i ceph-admin
修改/etc/ceph/ceph.conf:
[root@ceph-admin store.db]# cat /etc/ceph/ceph.conf
[global]
fsid = 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
public_network = 192.168.57.0/24
mon_initial_members = ceph-admin
mon_host = 192.168.57.227
..
开启Mon,查看集群状态:
[root@ceph-admin store.db]# service ceph start mon
=== mon.ceph-admin ===
Starting Ceph mon.ceph-admin on ceph-admin...
Running as unit ceph-mon.ceph-admin.1470381040.185790242.service.
Starting ceph-create-keys on ceph-admin...
[root@ceph-admin store.db]# ceph -s
cluster 58f3771b-0fd0-4042-a174-a5a2c36c4dbc
health HEALTH_OK
monmap e4: 1 mons at {ceph-admin=192.168.57.227:6789/0}
election epoch 1, quorum 0 ceph-admin
osdmap e25: 3 osds: 3 up, 3 in
pgmap v48: 64 pgs, 1 pools, 0 bytes data, 0 objects
101 MB used, 6125 GB / 6126 GB avail
64 active+clean
可以看到,此刻的集群可以正常访问,并且monitor已经更新为新的节点的IP,当然,此时需要前往各个子节点修改conf文件,再重启所有的服务。如果不重启,尽管此刻集群HEALTH_OK,但是这只是假象,原先的OSD仍然向旧的集群汇报信息,所以要修改conf再重启。
当然,exp-4,exp-5这三个实验的前提都是monitor的数据库store.db未被损坏,可见,只要数据库文件完整,均可通过这种方式启动monitor。
当然是为了快速部署ceph。
env-1,我的源搭建在ceph-admin节点上。
在本地源节点安装httpd以及工具createrepo:
yum install httpd createrepo -y
创建ceph源目录,并下载所有文件:
下面为
0.94.7的:
mkdir -p /var/www/html/ceph/0.94.7
cd /var/www/html/ceph/0.94.7
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-common-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-devel-compat-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-fuse-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-libs-compat-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-radosgw-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/ceph-test-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/cephfs-java-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libcephfs1-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libcephfs1-devel-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libcephfs_jni1-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libcephfs_jni1-devel-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/librados2-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/librados2-devel-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libradosstriper1-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/libradosstriper1-devel-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/librbd1-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/librbd1-devel-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/python-ceph-compat-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/python-cephfs-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/python-rados-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/python-rbd-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/rbd-fuse-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/x86_64/rest-bench-0.94.7-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-hammer/el7/noarch/ceph-deploy-1.5.34-0.noarch.rpm
下面为
10.2.2的
mkdir -p /var/www/html/ceph/0.94.7
cd /var/www/html/ceph/0.94.7
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-base-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-common-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-devel-compat-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-fuse-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-libs-compat-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-mds-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-mon-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-osd-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-radosgw-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-selinux-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/ceph-test-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/cephfs-java-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs_jni1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libcephfs_jni1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librados2-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librados2-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libradosstriper1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/libradosstriper1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librbd1-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librbd1-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librgw2-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/librgw2-devel-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-ceph-compat-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-cephfs-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-rados-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/python-rbd-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/rbd-fuse-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/rbd-mirror-10.2.2-0.el7.x86_64.rpm
wget http://mirrors.aliyun.com/ceph/rpm-jewel/el7/x86_64/rbd-nbd-10.2.2-0.el7.x86_64.rpm
创建repo:
createrepo /var/www/html/ceph/0.94.7
Spawning worker 0 with 3 pkgs
Spawning worker 1 with 2 pkgs
Spawning worker 2 with 2 pkgs
Spawning worker 3 with 2 pkgs
Spawning worker 4 with 2 pkgs
Spawning worker 5 with 2 pkgs
Spawning worker 6 with 2 pkgs
Spawning worker 7 with 2 pkgs
Spawning worker 8 with 2 pkgs
Spawning worker 9 with 2 pkgs
Spawning worker 10 with 2 pkgs
Spawning worker 11 with 2 pkgs
Workers Finished
Saving Primary metadata
Saving file lists metadata
Saving other metadata
Generating sqlite DBs
Sqlite DBs complete
创建ceph.repo
[root@ceph-admin ~]# cat /etc/yum.repos.d/ceph.repo
[ceph_local]
name=ceph
baseurl=http://192.168.57.227/ceph/0.94.7
gpgcheck=0
wget ftp://195.220.108.108/linux/fedora-secondary/releases/23/Everything/aarch64/os/Packages/g/glibc-2.22-3.fc23.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/23/Everything/aarch64/os/Packages/g/glibc-common-2.22-3.fc23.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/22/Everything/aarch64/os/Packages/l/leveldb-1.12.0-6.fc21.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/22/Everything/aarch64/os/Packages/l/leveldb-devel-1.12.0-6.fc21.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/24/Everything/aarch64/os/Packages/l/libbabeltrace-1.2.4-4.fc24.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/24/Everything/aarch64/os/Packages/l/lttng-ust-2.6.2-3.fc24.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/24/Everything/aarch64/os/Packages/l/lttng-ust-devel-2.6.2-3.fc24.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/24/Everything/aarch64/os/Packages/u/userspace-rcu-0.8.6-2.fc24.aarch64.rpm
wget ftp://195.220.108.108/linux/fedora-secondary/releases/24/Everything/aarch64/os/Packages/u/userspace-rcu-devel-0.8.6-2.fc24.aarch64.rpm
安装rpm包:
rpm -ivh glibc-* --replacefiles
rpm -ivh userspace-rcu-*
rpm -ivh lttng-ust-*
rpm -ivh libbabeltrace-*
rpm -ivh leveldb-*
```
增加ceph.repo,内容如下:
[ceph] name=ceph baseurl=http://mirrors.aliyun.com/ceph/rpm-hammer/el7/aarch64/ gpgcheck=0 [ceph-noarch] name=cephnoarch baseurl=http://mirrors.aliyun.com/ceph/rpm-hammer/el7/noarch/ gpgcheck=0
安装ceph:
yum install -y ceph ceph-common ceph-radosgw
## bluestore配置
keyvaluestore backend = rocksdb filestore_omap_backend = rocksdb enable experimental unrecoverable data corrupting features = rocksdb,bluestore osd_objectstore = bluestore