PostgreSQL Page页结构解析(6)- B-Tree索引存储结构#2

本文简单介绍了在PG数据库B-Tree索引的物理存储结构中Special space部分,包括根节点、左右兄弟节点等相关索引结构信息,以及初步探讨了PG在物理存储上如何保证索引的有序性。

一、测试数据

测试数据同上一节,索引文件raw data:

[xdb@localhost utf8db]$ hexdump -C base/16477/26637
00000000  01 00 00 00 20 5d 0e db  00 00 00 00 40 00 f0 1f  |.... ]......@...|
00000010  f0 1f 04 20 00 00 00 00  62 31 05 00 03 00 00 00  |... ....b1......|
00000020  01 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 f0 bf  |................|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001ff0  00 00 00 00 00 00 00 00  00 00 00 00 08 00 00 00  |................|
00002000  01 00 00 00 98 5c 0e db  00 00 00 00 28 00 b0 1f  |.....\......(...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  b0 9f 20 00 00 00 00 00  |.. ... ... .....|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

二、索引结构信息

索引物理存储结构在上一节已大体介绍,这里主要介绍索引的结构信息。通过pageinspect插件的bt_page_stats函数可以获得索引结构信息,包括root/leaf page,next & previous page:

testdb=# select * from bt_page_stats('pk_t_index',1);
 blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags 
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
     1 | l    |          4 |          0 |            16 |      8192 |      8068 |         0 |         0 |    0 |          3
(1 row)

相关的数据结构如下:

//---------------------- src/include/access/nbtree.h
/*
 *  BTPageOpaqueData -- At the end of every page, we store a pointer
 *  to both siblings in the tree.  This is used to do forward/backward
 *  index scans.  The next-page link is also critical for recovery when
 *  a search has navigated to the wrong page due to concurrent page splits
 *  or deletions; see src/backend/access/nbtree/README for more info.
 *
 *  In addition, we store the page's btree level (counting upwards from
 *  zero at a leaf page) as well as some flag bits indicating the page type
 *  and status.  If the page is deleted, we replace the level with the
 *  next-transaction-ID value indicating when it is safe to reclaim the page.
 *
 *  We also store a "vacuum cycle ID".  When a page is split while VACUUM is
 *  processing the index, a nonzero value associated with the VACUUM run is
 *  stored into both halves of the split page.  (If VACUUM is not running,
 *  both pages receive zero cycleids.)  This allows VACUUM to detect whether
 *  a page was split since it started, with a small probability of false match
 *  if the page was last split some exact multiple of MAX_BT_CYCLE_ID VACUUMs
 *  ago.  Also, during a split, the BTP_SPLIT_END flag is cleared in the left
 *  (original) page, and set in the right page, but only if the next page
 *  to its right has a different cycleid.
 *
 *  NOTE: the BTP_LEAF flag bit is redundant since level==0 could be tested
 *  instead.
 */

typedef struct BTPageOpaqueData
{
    BlockNumber btpo_prev;      /* left sibling, or P_NONE if leftmost */
    BlockNumber btpo_next;      /* right sibling, or P_NONE if rightmost */
    union
    {
        uint32      level;      /* tree level --- zero for leaf pages */
        TransactionId xact;     /* next transaction ID, if deleted */
    }           btpo;
    uint16      btpo_flags;     /* flag bits, see below */
    BTCycleId   btpo_cycleid;   /* vacuum cycle ID of latest split */
} BTPageOpaqueData;

typedef BTPageOpaqueData *BTPageOpaque;

/* Bits defined in btpo_flags */
#define BTP_LEAF        (1 << 0)    /* leaf page, i.e. not internal page */
#define BTP_ROOT        (1 << 1)    /* root page (has no parent) */
#define BTP_DELETED     (1 << 2)    /* page has been deleted from tree */
#define BTP_META        (1 << 3)    /* meta-page */
#define BTP_HALF_DEAD   (1 << 4)    /* empty, but still in tree */
#define BTP_SPLIT_END   (1 << 5)    /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6)    /* page has LP_DEAD tuples */
#define BTP_INCOMPLETE_SPLIT (1 << 7)   /* right sibling's downlink is missing */

查询结果中,type=l(字母l),表示这个block(page)是leaf block,在这个block中有4个item(live_items=4),没有废弃的items(dead_items=0),没有left sibling(btpo_prev =0)也没有right sibling(btpo_next =0),也就是左右两边都没有同级节点。btpo是一个union,值为0,表示该page为叶子page,btpo_flags值为3即BTP_LEAF | BTP_ROOT,既是叶子page也是根page。
这些信息物理存储在先前介绍过的PageHeader中的special space中,共占用16个字节:

testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E5C98 |        0 |     0 |    40 |  8112 |    8176 |     8192 |       4 |         0
(1 row)

testdb=# select 8192+8176;
 ?column? 
----------
    16368
(1 row)

[xdb@localhost utf8db]$ hexdump -C base/16477/26637 -s 16368 -n 16
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

三、索引有序性

我们都知道,B-Tree索引是有序的,下面我们看看在物理存储结构上如何保证有序性。
插入数据,id=18

testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E5C98 |        0 |     0 |    40 |  8112 |    8176 |     8192 |       4 |         0
(1 row)

testdb=# -- 插入数据,id=18
testdb=# insert into t_index values(18,'4','d');
INSERT 0 1
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E6498 |        0 |     0 |    44 |  8096 |    8176 |     8192 |       4 |         0
(1 row)
-- dump索引页
[xdb@localhost utf8db]$ hexdump -C base/16477/26637 -s 8192
00002000  01 00 00 00 98 64 0e db  00 00 00 00 2c 00 a0 1f  |.....d......,...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  a0 9f 20 00 00 00 00 00  |.. ... ... .....|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fa0  00 00 00 00 05 00 10 00  12 00 00 00 00 00 00 00  |................|
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

插入数据,id=17

testdb=# -- 插入数据,id=17
testdb=# insert into t_index values(17,'4','d');
INSERT 0 1
testdb=# checkpoint;
CHECKPOINT
testdb=# -- 查看索引数据页头数据
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
    lsn     | checksum | flags | lower | upper | special | pagesize | version | prune_xid 
------------+----------+-------+-------+-------+---------+----------+---------+-----------
 1/DB0E6808 |        0 |     0 |    48 |  8080 |    8176 |     8192 |       4 |         0
(1 row)
-- dump索引页
[xdb@localhost utf8db]$ hexdump  -C base/16477/26637 -s 8192
00002000  01 00 00 00 08 68 0e db  00 00 00 00 30 00 90 1f  |.....h......0...|
00002010  f0 1f 04 20 00 00 00 00  e0 9f 20 00 d0 9f 20 00  |... ...... ... .|
00002020  c0 9f 20 00 b0 9f 20 00  90 9f 20 00 a0 9f 20 00  |.. ... ... ... .|
00002030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003f90  00 00 00 00 06 00 10 00  11 00 00 00 00 00 00 00  |................|
00003fa0  00 00 00 00 05 00 10 00  12 00 00 00 00 00 00 00  |................|
00003fb0  00 00 00 00 04 00 10 00  10 00 00 00 00 00 00 00  |................|
00003fc0  00 00 00 00 03 00 10 00  08 00 00 00 00 00 00 00  |................|
00003fd0  00 00 00 00 02 00 10 00  04 00 00 00 00 00 00 00  |................|
00003fe0  00 00 00 00 01 00 10 00  02 00 00 00 00 00 00 00  |................|
00003ff0  00 00 00 00 00 00 00 00  00 00 00 00 03 00 00 00  |................|
00004000

索引的数据区,并没有按照大小顺序排序,\x11(17)在\x12(18)的后面(从尾部开始往前),但在索引页的头部ItemId区域是有序的,第5个ItemId(\x00209f90)指向的是17,而第6个ItemId(\x00209fa0)指向的是18,通过索引数据指针的有序保证索引有序性。

四、小结

小结一下,知识要点:
1、Special Space:介绍了索引PageHeaderData的Special Space的存储结构,通过Special Space可以获得B-Tree的root、左右sibling等信息;
2、有序性:索引Page通过索引项指针保证有序性。

最后:

Don’t Be Afraid To Explore Beneath The Surface

Captain Nemo at the helm

Professor Aronnax risked his life and career to find the elusive Nautilus and to join Captain Nemo on a long series of amazing underwater adventures. We should do the same: Don’t be afraid to dive underwater – inside and underneath the tools, languages and technologies that you use every day. You may know all about how to use Postgres, but do you really know how Postgres itself works internally? Take a look inside; before you know it, you’ll be on an underwater adventure of your own.
Studying the Computer Science at work behind the scenes of our applications isn’t just a matter of having fun, it’s part of being a good developer. As software development tools improve year after year, and as building web sites and mobile apps becomes easier and easier, we shouldn’t loose sight of the Computer Science we depend on. We’re all standing on the shoulders of giants – people like Lehman and Yao, and the open source developers who used their theories to build Postgres. Don’t take the tools you use everyday for granted – take a look inside them! You’ll become a wiser developer and you’ll find insights and knowledge you could never have imagined before.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 159,117评论 4 362
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,328评论 1 293
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 108,839评论 0 243
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 44,007评论 0 206
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,384评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,629评论 1 219
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,880评论 2 313
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,593评论 0 198
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,313评论 1 243
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,575评论 2 246
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,066评论 1 260
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,392评论 2 253
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,052评论 3 236
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,082评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,844评论 0 195
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,662评论 2 274
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,575评论 2 270

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,106评论 0 10
  • 1.前言 对于非常庞大的数据量,经常需要分片分区在分片分区的模式下:每一个数据(一行记录)只属于一个分区,每个分区...
    赤子心_d709阅读 895评论 0 4
  • 昨天的天下足球专题之《德罗巴,蓝桥遗梦》,突然想起魔兽已经离开蓝军两年了,两年了,魔兽始终没淡出我们的视野,不管是...
    lampard_xu阅读 1,192评论 7 8
  • 哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈哈
    召一阅读 186评论 0 0
  • 选择排序的思想是有 n 个数,先假定第一个数最小,然后从剩下 n-1 个数中选择一个最小的与第一个数交换(如果当前...
    lesliefang阅读 153评论 0 0