ljzsdut
GitHubToggle Dark/Light/Auto modeToggle Dark/Light/Auto modeToggle Dark/Light/Auto modeBack to homepage

07 Osdmap提取crushmap

osdmap提取crushmap

发表于 2016-10-18 | | 阅读次数

实验目的

  • 为了证实大话RBD文中对于横向平移crushmap的猜测。
  • 从一个dead cluster中,是否能够重现所有的PGObject Map
  • 本文从一个OSD中的若干osdmap中任意一个提取出来整个集群的CrushMap,并依此复现出原始集群的所有对象对应关系。
  • 还提供了一种简单的方法导出crushmap

实验环境

为了证实这个实验的普适性,前往任意一个集群的OSD目录下都可以操作,这里我采用了一个生产集群的数据,因为这个集群更具有代表性。 该集群规模如下:

[root@yd1st003 ~]# ceph -s
    cluster 3727c106-0ac9-420d-99a9-4218ea4e099f
     health HEALTH_OK
     monmap e3: 3 mons at {ceph-1=233.233.233.231:6789/0,ceph-2=233.233.233.232:6789/0,ceph-3=233.233.233.233:6789/0}
           election epoch 150, quorum 0,1,2 ceph-1,ceph-2,ceph-3
     osdmap e5166: 20 osds: 20 up, 20 in
            flags sortbitwise
      pgmap v8337878: 1036 pgs, 4 pools, 3259 GB data, 549 kobjects
            9721 GB used, 64663 GB / 74384 GB avail
                1036 active+clean

这个集群的OSDMAPepoch5166,即共产生了5166个版本的OSDMAP

实验过程

首先,简单介绍一下OSDMAP的生成原因,集群刚刚创建时,为osdmap e1,即版本号为1,之后每当增删OSD或者任一OSD的状态变化:[in | up | down | out],任一状态变化为其他状态时,epoch的值会增加,以记录下对应的变化,一般值会增加个位数至少为1,同时会生成一个文件用于保存,这个文件保存在OSD目录的/current/meta下,因为当前版本为5166, 我们在osd.0目录下查找一个不是太新的osdmap文件,这里我们取5000

[root@yd1st003 ~]# cd /var/lib/ceph/osd/ceph-0/current/meta/
[root@yd1st003 meta]# find . |grep 5000
./DIR_7/DIR_D/inc\uosdmap.5000__0_A66D13D7__none
./DIR_8/DIR_6/osdmap.5000__0_0A038C68__none

当然,尝试在meta目录下执行find .,你会找到一堆名称类似的文件:osdmap.NUM__...,这里的NUM就是osdmapepoch值,之所以我这里选择5000,是因为我假象这个集群已经挂掉,并且我手边只有一个osd.0的磁盘,如果5000成功的话,那么其他的5000附近的epoch文件也可以成功。

将这个文件拷贝出来作进一步操作,并查看该文件内容:

[root@yd1st003 meta]# cp ./DIR_8/DIR_6/osdmap.5000__0_0A038C68__none /root/osdmap
[root@yd1st003 ~]# hexdump -Cv /root/osdmap 
00000000  08 07 62 71 00 00 03 01  ac 47 00 00 37 27 c1 06  |..bq.....G..7'..|
00000010  0a c9 42 0d 99 a9 42 18  ea 4e 09 9f 88 13 00 00  |..B...B..N......|
00000020  4b 03 78 57 a1 97 c6 2c  fe 66 fd 57 73 cb 99 28  |K.xW...,.f.Ws..(|
00000030  04 00 00 00 01 00 00 00  00 00 00 00 18 05 e5 00  |................|
...
000003f0  76 6f 6c 75 6d 65 73 02  00 00 00 00 00 00 00 06  |volumes.........|
00000400  00 00 00 69 6d 61 67 65  73 05 00 00 00 00 00 00  |...images.......|
00000410  00 0b 00 00 00 76 6f 6c  75 6d 65 73 5f 73 73 64  |.....volumes_ssd|
00000420  06 00 00 00 00 00 00 00  06 00 00 00 64 6f 63 6b  |............dock|
00000430  65 72 06 00 00 00 00 80  01 00 16 00 00 00 16 00  |er..............|
...
00004340  00 00 00 01 00 15 04 00  00 00 00 01 00 10 00 00  |................|
00004350  00 01 00 00 00 16 00 00  00 04 00 00 00 ff ff ff  |................|
00004360  ff 0a 00 04 00 fc ff 49  00 04 00 00 00 fb ff ff  |.......I........|
00004370  ff fe ff ff ff fd ff ff  ff fc ff ff ff ff 7f 12  |................|
00004380  00 00 00 01 00 ff 7f 12  00 00 00 01 00 ff 7f 12  |................|
00004390  00 00 00 01 00 ff 7f 12  00 00 00 01 00 04 00 00  |................|
000043a0  00 fe ff ff ff 01 00 04  00 ff 7f 12 00 05 00 00  |................|
000043b0  00 07 00 00 00 08 00 00  00 09 00 00 00 0a 00 00  |................|
000043c0  00 0b 00 00 00 33 b3 03  00 00 00 01 00 33 b3 03  |.....3.......3..|
000043d0  00 00 00 01 00 33 b3 03  00 00 00 01 00 33 b3 03  |.....3.......3..|
000043e0  00 00 00 01 00 33 b3 03  00 00 00 01 00 04 00 00  |.....3..........|
000043f0  00 fd ff ff ff 01 00 04  00 ff 7f 12 00 05 00 00  |................|
00004400  00 0c 00 00 00 0d 00 00  00 0e 00 00 00 00 00 00  |................|
00004410  00 01 00 00 00 33 b3 03  00 00 00 01 00 33 b3 03  |.....3.......3..|
00004420  00 00 00 01 00 33 b3 03  00 00 00 01 00 33 b3 03  |.....3.......3..|
00004430  00 00 00 01 00 33 b3 03  00 00 00 01 00 04 00 00  |.....3..........|
00004440  00 fc ff ff ff 01 00 04  00 ff 7f 12 00 05 00 00  |................|
00004450  00 11 00 00 00 12 00 00  00 13 00 00 00 14 00 00  |................|
00004460  00 15 00 00 00 33 b3 03  00 00 00 01 00 33 b3 03  |.....3.......3..|
00004470  00 00 00 01 00 33 b3 03  00 00 00 01 00 33 b3 03  |.....3.......3..|
00004480  00 00 00 01 00 33 b3 03  00 00 00 01 00 04 00 00  |.....3..........|
00004490  00 fb ff ff ff 01 00 04  00 ff 7f 12 00 05 00 00  |................|
000044a0  00 04 00 00 00 05 00 00  00 06 00 00 00 02 00 00  |................|
000044b0  00 03 00 00 00 33 b3 03  00 00 00 01 00 33 b3 03  |.....3.......3..|
000044c0  00 00 00 01 00 33 b3 03  00 00 00 01 00 33 b3 03  |.....3.......3..|
000044d0  00 00 00 01 00 33 b3 03  00 00 00 01 00 00 00 00  |.....3..........|
000044e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000044f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00004500  00 00 00 00 00 00 00 00  00 01 00 00 00 03 00 00  |................|
00004510  00 00 01 01 0a 01 00 00  00 ff ff ff ff 00 00 00  |................|
00004520  00 06 00 00 00 00 00 00  00 01 00 00 00 04 00 00  |................|
00004530  00 00 00 00 00 00 00 00  00 0b 00 00 00 00 00 00  |................|
00004540  00 03 00 00 00 6f 73 64  01 00 00 00 04 00 00 00  |.....osd........|
00004550  68 6f 73 74 02 00 00 00  07 00 00 00 63 68 61 73  |host........chas|
00004560  73 69 73 03 00 00 00 04  00 00 00 72 61 63 6b 04  |sis........rack.|
00004570  00 00 00 03 00 00 00 72  6f 77 05 00 00 00 03 00  |.......row......|
00004580  00 00 70 64 75 06 00 00  00 03 00 00 00 70 6f 64  |..pdu........pod|
00004590  07 00 00 00 04 00 00 00  72 6f 6f 6d 08 00 00 00  |........room....|
000045a0  0a 00 00 00 64 61 74 61  63 65 6e 74 65 72 09 00  |....datacenter..|
000045b0  00 00 06 00 00 00 72 65  67 69 6f 6e 0a 00 00 00  |......region....|
000045c0  04 00 00 00 72 6f 6f 74  19 00 00 00 fb ff ff ff  |....root........|
000045d0  08 00 00 00 79 64 31 73  74 30 30 31 fc ff ff ff  |....yd1st001....|
000045e0  08 00 00 00 79 64 31 73  74 30 30 34 fd ff ff ff  |....yd1st004....|
000045f0  08 00 00 00 79 64 31 73  74 30 30 33 fe ff ff ff  |....yd1st003....|
00004600  08 00 00 00 79 64 31 73  74 30 30 32 ff ff ff ff  |....yd1st002....|
00004610  07 00 00 00 64 65 66 61  75 6c 74 00 00 00 00 05  |....default.....|
00004620  00 00 00 6f 73 64 2e 30  01 00 00 00 05 00 00 00  |...osd.0........|
00004630  6f 73 64 2e 31 02 00 00  00 05 00 00 00 6f 73 64  |osd.1........osd|
00004640  2e 32 03 00 00 00 05 00  00 00 6f 73 64 2e 33 04  |.2........osd.3.|
00004650  00 00 00 05 00 00 00 6f  73 64 2e 34 05 00 00 00  |.......osd.4....|
00004660  05 00 00 00 6f 73 64 2e  35 06 00 00 00 05 00 00  |....osd.5.......|
00004670  00 6f 73 64 2e 36 07 00  00 00 05 00 00 00 6f 73  |.osd.6........os|
...
00004720  00 00 00 6f 73 64 2e 32  31 01 00 00 00 00 00 00  |...osd.21.......|
00004730  00 12 00 00 00 72 65 70  6c 69 63 61 74 65 64 5f  |.....replicated_|
00004740  72 75 6c 65 73 65 74 00  00 00 00 00 00 00 00 32  |ruleset........2|
00004750  00 00 00 01 00 00 00 01  01 16 00 00 00 00 01 00  |................|
00004760  00 00 07 00 00 00 64 65  66 61 75 6c 74 04 00 00  |......default...|
00004770  00 01 00 00 00 6b 01 00  00 00 32 01 00 00 00 6d  |.....k....2....m|

上面列出了这个文件的一些有用的信息,这里我给出一个我自己总结的osdmap保存crushmap的位置特点的猜测总结:

  • CrushMap0x00 0x00 0x01开头,后面有很多的3....3..3这种信息,一般先找到OSD的信息后,再去往上查找3..33...3这样的信息的前面有个0x00 0x00 0x01的开头,那么这个就是Crushmap开头,比如上面的第00004360行。
  • CrushMap0x01 0x01 0x16 0x00 0x00 0x00 0x00结尾,在OSD的最后,比如这里的00004740这一行的下面,一般是在ruleset的后面一点点。

然后,我们就可以把这个片段给截出来,起始为0x4349=17225,终点为0x475e=18270,长度为18270-17225=1045

[root@yd1st003 ~]# dd if=/root/osdmap skip=17225 bs=1 count=1045 of=/root/crushmap iflag=skip_bytes
记录了1045+0 的读入
记录了1045+0 的写出
1045字节(1.0 kB)已复制,0.00290987 秒,359 kB/秒
[root@yd1st003 ~]# hexdump -Cv crushmap 
00000000  00 00 01 00 10 00 00 00  01 00 00 00 16 00 00 00  |................|
00000010  04 00 00 00 ff ff ff ff  0a 00 04 00 fc ff 49 00  |..............I.|
00000020  04 00 00 00 fb ff ff ff  fe ff ff ff fd ff ff ff  |................|
00000030  fc ff ff ff ff 7f 12 00  00 00 01 00 ff 7f 12 00  |................|
...
000003f0  69 63 61 74 65 64 5f 72  75 6c 65 73 65 74 00 00  |icated_ruleset..|
00000400  00 00 00 00 00 00 32 00  00 00 01 00 00 00 01 01  |......2.........|
00000410  16 00 00 00 00                                    |.....|

这时候,就要做一个神奇的事情了:

[root@yd1st003 ~]# crushtool -d crushmap -o crushmap.txt
[root@yd1st003 ~]# cat crushmap.txt 
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
...
# rules
rule replicated_ruleset {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type host
	step emit
}

# end crush map

我们将刚刚截取的crushmap文件,用crushtool反编译,再查看生成的文件内容,这和我们ceph osd getcrushmap的内容居然是一模一样的,这样我们就完成了从随便一个osdmap中提取出了这里的CrushMap,不过这里还要再说明一个问题,根据我的理解,osdmap中保存了当前的crushmap,一般情况下,我们是不会修改crushMap的,所以一连续段的epochosdmap文件内应该保存着一样的crushmap,所以这就是我这个实验的一个假设:假设在一段时间内都没有修改过crushmap

那么有了crushmap之后,我们能做些什么呢,原先我的计划是,从crushmap中读取所有的OSD的权重,和所有的bucket,再建立一个全新的集群,使之crushmap和旧集群完全一样,这时候我们再调用ceph osd map之类的指令就可以获取object或者PG的位置信息,经过试验认证,指令得到的输出是和原来的集群是一模一样的,但是试验过程中,我了解到了osdmaptool这个工具,利用这个工具可以直接得出刚刚类似的输出信息:

[root@yd1st003 ~]# osdmaptool --print osdmap 
osdmaptool: osdmap file 'osdmap'
epoch 5000
fsid 3727c106-0ac9-420d-99a9-4218ea4e099f
created 2016-07-03 02:09:15.751212
modified 2016-10-12 06:26:06.681167
flags sortbitwise

pool 1 'volumes' replicated size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 700 pgp_num 700 last_change 4483 flags hashpspool stripe_width 0
	removed_snaps [1~1]
...
max_osd 22
osd.0 up   in  weight 1 up_from 4935 up_thru 4996 down_at 4930 last_clean_interval [4760,4929) 172.19.48.203:6800/2537077 3.3.4.3:6800/2537077 3.3.4.3:6801/2537077 172.19.48.203:6801/2537077 exists,up ecb520b4-58c7-4ff9-9cb4-d715ce458be4
osd.1 up   in  weight 1 up_from 4934 up_thru 4999 down_at 4932 last_clean_interval [4758,4929) 172.19.48.203:6802/2537081 3.3.4.3:6802/2537081 3.3.4.3:6803/2537081 172.19.48.203:6803/2537081 exists,up d623aef1-6297-4e19-8a77-b2afccc9a6e3
...

使用这个工具,我们可以立马得到刚刚提取出来的crushmap

[root@yd1st003 ~]# osdmaptool  osdmap --export-crush i-am-crush
osdmaptool: osdmap file 'osdmap'
osdmaptool: exported crush map to i-am-crush
[root@yd1st003 ~]# crushtool -d i-am-crush -o /tmp/i-am-crush.txt

从刚刚的试验已知osdmap里面是包含crushmap的,同时,osdmaptool --print的输出信息可以得知每个OSD的权重,pool的具体信息,这些信息已经足够我们映射出所有的PGobject

[root@yd1st003 ~]# osdmaptool   --test-map-pg 1.0 osdmap 
osdmaptool: osdmap file 'osdmap'
 parsed '1.0' -> 1.0
1.0 raw ([6,10,21], p6) up ([6,10], p6) acting ([6,10], p6)
[root@yd1st003 ~]# osdmaptool   --test-map-pg 1.0 osdmap  --mark-up-in
osdmaptool: osdmap file 'osdmap'
marking all OSDs up and in
 parsed '1.0' -> 1.0
1.0 raw ([6,10,21], p6) up ([6,10,21], p6) acting ([6,10,21], p6)
[root@yd1st003 ~]# ceph pg map 1.0
osdmap e5166 pg 1.0 (1.0) -> up [6,10,21] acting [6,10,21]
[root@yd1st003 ~]# osdmaptool --test-map-object rbd_data.f7ae7f1458600.0000000000000000 --pool 1  osdmap --mark-up-in
osdmaptool: osdmap file 'osdmap'
marking all OSDs up and in
 object 'rbd_data.f7ae7f1458600.0000000000000000' -> 1.1d6 -> [0,5,18]
[root@yd1st003 ~]# ceph osd map volumes rbd_data.f7ae7f1458600.0000000000000000
osdmap e5166 pool 'volumes' (1) object 'rbd_data.f7ae7f1458600.0000000000000000' -> pg 1.cea4bfd6 (1.1d6) -> up ([0,5,18], p0) acting ([0,5,18], p0)

值得一提的是刚刚的一个参数mark-up-in,可以看到加和没加这个参数,map-pg 1.0的输出是不一样的,因为5000这个osdmap中,osd.21down状态的,所以映射出来的acting set没有21,使用这个标记就可以将所有的OSD标为up&in

这样就可以使用一个osdmap来定位所有的你想要知道的信息了。

实验结论

虽然这个实验山路十八弯得导出了crushmap,但是我还是觉得有意义的,至少明确一个事实:osdmap里包含了crushmap,同时,我们可以利用一个osdmap得到所有的PG or object的位置信息。也许,目前你体会不到这个实验的好处,但是如果机房被震了,所有的主机就只剩下OSD盘的时候,那时候恢复集群会用到这里面的信息,也就是说,对于一个dead cluster,我们也可以不通过指令得出很多有用的位置信息。