FlinkSQL生成StreamGraph

Flink根据算子之间的依赖关系构建出Transformations并依次形成StreamGraph、JobGraph 、ExecutionGraph、 物理执行图。

  • StreamGraph:根据用户通过 Stream API 编写的代码生成的最初的图。
  • JobGraph:StreamGraph经过优化后生成JobGraph,并提交给 远程JobManager调度执行。
  • ExecutionGraph:JobManager 根据 JobGraph 生成ExecutionGraph。ExecutionGraph是JobGraph的并行化版本,是调度层最核心的数据结构。
  • 物理执行图:JobManager 根据 ExecutionGraph 对 Job 进行调度后,在各个TaskManager 上部署 Task 后形成的“图”,并不是一个具体的数据结构。

根据以上转换流程,查看Flink如何将SQL转换为底层对应的Transformation

FlinkSQL生成StreamGraph

FlinkSQL提供了可插拔的SQL解析模块,Flink 1.11+后,已将BlinkPlanner设置为FlinkSQL的默认解析器。BlinkPlanner扩展了DDL语法、维表关联语法等,并对扩展了很多优化规则,目标是使用一套SQL解析器实现流 & 批的统一处理。通过使用 Calcite 的 SQL parser 解析用户提交的 SQL,将不同类型的 SQL 解析为不同 Operation(比如 DDL 对应 CreateTableOperation,DSL 对应 QueryOperation),并将 AST 以SqlNode表示,SqlNode经过验证后被转换为关系表达式在 Calcite中使用RelNode表示,RelNode通过规则匹配(rules match)进行相应的优化,RelNode又被转换为ExecNode,ExecNode包含转换为对应的transformation的方法。
image.png

元数据管理

在 Flink1.9 版本发布之前,Flink SQL 完全借助于 Calcite 的 Schema 接口来管理注册的表,不太方便和 Hive 等已有的元数据管理进行集成。Flink1.9后做了重构提供了Catalog接口,Catalog 接口能够支持对数据库、表、函数、甚至于分区等相关操作。在 Catalog 中,每一张表对应的是一个 CatalogBaseTable 对象。Catalog接口包含GenericInMemoryCatalog、HiveCatalog、JdbcCatalog三个子类,其中GenericInMemoryCatalog是默认的Catalog。通过 CatalogManager,可以同时在一个会话中挂载多个Catalog,从而访问到多个不同的外部系统。

/**
 *  This interface is responsible for reading and writing metadata such as 
 *  database/table/views/UDFs from a registered catalog. 
 *  It connects a registered catalog and Flink's Table API.
 */
public interface Catalog {
    // ------ databases ------
    String getDefaultDatabase();
    List<String> listDatabases();
    CatalogDatabase getDatabase(String databaseName);
    boolean databaseExists(String databaseName); 
    void createDatabase(String name, CatalogDatabase database, boolean ignoreIfExists);
    void dropDatabase(String name, boolean ignoreIfNotExists, boolean cascade) 
    void alterDatabase(String name, CatalogDatabase newDatabase, boolean ignoreIfNotExists)
    // ------ tables and views ------
    List<String> listTables(String databaseName) 
    List<String> listViews(String databaseName) 
        
    **CatalogBaseTable getTable(ObjectPath tablePath)
    boolean tableExists(ObjectPath tablePath) 
    void dropTable(ObjectPath tablePath, boolean ignoreIfNotExists)
    void renameTable(ObjectPath tablePath, String newTableName, boolean ignoreIfNotExists)
    void createTable(ObjectPath tablePath, CatalogBaseTable table, boolean ignoreIfExists)
    void alterTable(ObjectPath tablePath, CatalogBaseTable newTable, boolean ignoreIfNotExists);
    // ------ partitions ------
    // ------ functions ------
    // ------ statistics ------
}
/**
 * CatalogBaseTable is the common parent of table and view. It has a map of
 * key-value pairs defining the properties of the table.
 */
public interface CatalogBaseTable {
    Map<String, String> getProperties();
    * TableSchema getSchema();
    String getComment();
    CatalogBaseTable copy();
    Optional<String> getDescription();
    Optional<String> getDetailedDescription();
}


/**
 * A manager for dealing with catalog objects such as tables, views, functions, and types. It encapsulates
 * all available catalogs and stores temporary objects.
 */
@Internal
public final class CatalogManager {
    // A map between names and catalogs.
    private Map<String, Catalog> catalogs;

    private String currentCatalogName;
    private String currentDatabaseName;

    public void registerCatalog(String catalogName, Catalog catalog) {}
    public void unregisterCatalog(String catalogName, boolean ignoreIfNotExists) {}
    public Optional<Catalog> getCatalog(String catalogName) {}
    
    *public Optional<TableLookupResult> getTable(ObjectIdentifier objectIdentifier) {}
}

FlinkSQL执行时,通过解析扩展的create table、create catalog语法,来创建或者加载外部CatalogBaseTable,存储到CatalogManager对应的Catalog。如下SQL执行时会创建出PostgresCatalog同时初始化JDBC连接,执行show tables会通过JDBC连接查询PG Schema信息。

CREATE CATALOG pgcatalog WITH(
    'type' = 'jdbc',
    'default-database' = 'test_01',
    'username' = 'test_01',
    'password' = 'test_01',
    'base-url' = 'jdbc:postgresql://test_01:8432/'
);

use catalog pgcatalog;
show tables;
show databases;

select id, org_id from org_user;

DDL语句执行

以如下双流JOIN语句来展示Flink对SQL语法的扩展:


CREATE TABLE user_info (
    userId BIGINT,
    userName VARCHAR,
    userAge BIGINT,
    proctime AS PROCTIME()
) WITH (
    'connector' = 'kafka',
    'properties.bootstrap.servers' = '172.16.100.241:9092',
    'topic' = 'tp01',  
    'format' = 'json',
    'scan.startup.mode' = 'latest-offset'
);

CREATE TABLE user_address (
    userId BIGINT,
    addressInfo VARCHAR,
    proctime AS PROCTIME()
) WITH (
    'connector' = 'kafka', 
    'properties.bootstrap.servers' = '172.16.100.241:9092',
    'topic' = 'tp02', 
    'format' = 'json',
    'scan.startup.mode' = 'latest-offset'
);

CREATE TABLE user_sink (
    userId BIGINT,
    userName VARCHAR,
    userAddress VARCHAR,
    PRIMARY KEY (userId) NOT ENFORCED
) WITH (
    'connector' = 'print'
);

INSERT INTO user_sink
SELECT
  u.userId,
  u.userName,
  a.addressInfo
FROM
  user_info u
LEFT JOIN user_address  a
ON u.userId = a.userId
where u.userAge >18;

DDL语句扩展

目前,flinksql支持两种SqlDialect分别是DEFAULT、HIVE,通过set table.sql-dialect参数切换。

  • DEFAULT:是Flink自己扩展的语法方言,通过flink-sql-parser模块解析。
  • HIVE:用来对接HIVE,支持HQL语法,通过flink-sql-parser-hive模块解析。
    在flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl扩展建表语法。
SqlCreate SqlCreateTable(Span s, boolean replace, boolean isTemporary) :
{
    final SqlParserPos startPos = s.pos();
    boolean ifNotExists = false;
    SqlIdentifier tableName;
    List<SqlTableConstraint> constraints = new ArrayList<SqlTableConstraint>();
    SqlWatermark watermark = null;
    SqlNodeList columnList = SqlNodeList.EMPTY;
    SqlCharStringLiteral comment = null;
    SqlTableLike tableLike = null;

    SqlNodeList propertyList = SqlNodeList.EMPTY;
    SqlNodeList partitionColumns = SqlNodeList.EMPTY;
    SqlParserPos pos = startPos;
}
{
    <TABLE>
    ifNotExists = IfNotExistsOpt()
    tableName = CompoundIdentifier()
    [
        <LPAREN> { pos = getPos(); TableCreationContext ctx = new TableCreationContext();}
        TableColumn(ctx)
        (
            <COMMA> TableColumn(ctx)
        )*
        {
            pos = pos.plus(getPos());
            columnList = new SqlNodeList(ctx.columnList, pos);
            constraints = ctx.constraints;
            watermark = ctx.watermark;
        }
        <RPAREN>
    ]
    [ <COMMENT> <QUOTED_STRING> {
        String p = SqlParserUtil.parseString(token.image);
        comment = SqlLiteral.createCharString(p, getPos());
    }]
    [
        <PARTITIONED> <BY>
        partitionColumns = ParenthesizedSimpleIdentifierList()
    ]
    [
        <WITH>
        propertyList = TableProperties()
    ]
    [
        <LIKE>
        tableLike = SqlTableLike(getPos())
    ]
    {
        return new SqlCreateTable(startPos.plus(getPos()),
                tableName,
                columnList,
                constraints,
                propertyList,
                partitionColumns,
                watermark,
                comment,
                tableLike,
                isTemporary,
                ifNotExists);
    }
}
  • 创建SqlCreateTable类继承SqlCreate,重写validate、unparse方法。
public class SqlCreateTable extends SqlCreate implements ExtendedSqlNode {
    public static final SqlSpecialOperator OPERATOR = new SqlSpecialOperator("CREATE TABLE", SqlKind.CREATE_TABLE);
    .....
    public SqlCreateTable(
            SqlParserPos pos,
            SqlIdentifier tableName,
            SqlNodeList columnList,
            List<SqlTableConstraint> tableConstraints,
            SqlNodeList propertyList,
            SqlNodeList partitionKeyList,
            @Nullable SqlWatermark watermark,
            @Nullable SqlCharStringLiteral comment,
            @Nullable SqlTableLike tableLike,
            boolean isTemporary,
            boolean ifNotExists) {
        super(OPERATOR, pos, false, ifNotExists);
        this.tableName = requireNonNull(tableName, "tableName should not be null");
        this.columnList = requireNonNull(columnList, "columnList should not be null");
        this.tableConstraints = requireNonNull(tableConstraints, "table constraints should not be null");
        this.propertyList = requireNonNull(propertyList, "propertyList should not be null");
        this.partitionKeyList = requireNonNull(partitionKeyList, "partitionKeyList should not be null");
        this.watermark = watermark;
        this.comment = comment;
        this.tableLike = tableLike;
        this.isTemporary = isTemporary;
    }
 
    @Override
    public void validate() {
        ...
      // 对主键字段进行校验,并设置notNullType
      // 如果使用like建表,对tableLike进行校验
    }
    
    @Override
    public void unparse(
            SqlWriter writer,
            int leftPrec,
            int rightPrec) {
        .....
       //用来从解析树构建SQL语句       
    }

DDL语句执行

creat table语句执行流程:

org.apache.flink.table.api.internal.TableEnvironmentImpl#executeSql
    org.apache.flink.table.planner.delegation.ParserImpl#parse
        org.apache.flink.table.planner.operations.SqlCreateTableConverter#convertCreateTable:返回CreateTableOperation包含了CatalogTable信息
    org.apache.flink.table.api.internal.TableEnvironmentImpl#executeOperation  
        org.apache.flink.table.catalog.CatalogManager#createTable
            org.apache.flink.table.catalog.GenericInMemoryCatalog#createTable     
  1. 根据扩展的逻辑节点SqlCreateTable提取并创建CatalogBaseTable,代表存储在Catalog中的表。CatalogTableImpl代表使用DDL语句创建的表,ConnectorCatalogTable代表通过registerTableSource定义的表。如果使用Create Table like建表,需要根据传递的LIKE选项,从对应源表提取相关属性。
SqlCreateTableConverter#createCatalogTable:

    private CatalogTable createCatalogTable(SqlCreateTable sqlCreateTable) {

        final TableSchema sourceTableSchema;
        final List<String> sourcePartitionKeys;
        final List<SqlTableLike.SqlTableLikeOption> likeOptions;
        final Map<String, String> sourceProperties;
        // 处理 like语句
        if (sqlCreateTable.getTableLike().isPresent()) {
            SqlTableLike sqlTableLike = sqlCreateTable.getTableLike().get();
            CatalogTable table = lookupLikeSourceTable(sqlTableLike);
            sourceTableSchema = table.getSchema();
            sourcePartitionKeys = table.getPartitionKeys();
            likeOptions = sqlTableLike.getOptions();
            sourceProperties = table.getProperties();
        } else {
            sourceTableSchema = TableSchema.builder().build();
            sourcePartitionKeys = Collections.emptyList();
            likeOptions = Collections.emptyList();
            sourceProperties = Collections.emptyMap();
        }
        //  处理SqlTableLike中的选项,INCLUDING ALL、OVERWRITING OPTIONS、EXCLUDING PARTITIONS等
        Map<SqlTableLike.FeatureOption, SqlTableLike.MergingStrategy> mergingStrategies =
            mergeTableLikeUtil.computeMergingStrategies(likeOptions);

        Map<String, String> mergedOptions = mergeOptions(sqlCreateTable, sourceProperties, mergingStrategies);
        //  主键
        Optional<SqlTableConstraint> primaryKey = sqlCreateTable.getFullConstraints()
            .stream()
            .filter(SqlTableConstraint::isPrimaryKey)
            .findAny();
        // 获取TableSchema
        TableSchema mergedSchema = mergeTableLikeUtil.mergeTables(
            mergingStrategies,
            sourceTableSchema,
            sqlCreateTable.getColumnList().getList(),
            sqlCreateTable.getWatermark().map(Collections::singletonList).orElseGet(Collections::emptyList),
            primaryKey.orElse(null)
        );
        // 表分区
        List<String> partitionKeys = mergePartitions(
            sourcePartitionKeys,
            sqlCreateTable.getPartitionKeyList(),
            mergingStrategies
        );
        verifyPartitioningColumnsExist(mergedSchema, partitionKeys);
        //  注释
        String tableComment = sqlCreateTable.getComment()
            .map(comment -> comment.getNlsString().getValue())
            .orElse(null);
        return new CatalogTableImpl(mergedSchema,
            partitionKeys,
            mergedOptions,
            tableComment);
    }
  1. 返回CreateTableOperation,代表建表操作,包含表名全路径、CatalogTable、是否为临时表信息。
SqlCreateTableConverter#convertCreateTable: 
Operation convertCreateTable(SqlCreateTable sqlCreateTable) {
    //不支持UNIQUE字段约束
    sqlCreateTable.getTableConstraints().forEach(validateTableConstraint);
    CatalogTable catalogTable = createCatalogTable(sqlCreateTable);
    // 原始表路径验证
    UnresolvedIdentifier unresolvedIdentifier = UnresolvedIdentifier.of(sqlCreateTable.fullTableName());
    // 构建标准table path,格式为catalog.database.tablename
    ObjectIdentifier identifier = catalogManager.qualifyIdentifier(unresolvedIdentifier);

    return new CreateTableOperation(
        identifier,
        catalogTable,
        sqlCreateTable.isIfNotExists(),
        sqlCreateTable.isTemporary());
}
  1. 在将表元数据信息以GenericInMemoryCatalog形式存储到CatalogManager中。
org.apache.flink.table.catalog.CatalogManager#createTable:
public void createTable(CatalogBaseTable table, ObjectIdentifier objectIdentifier, boolean ignoreIfExists) {
    execute(
        // catalog默认为GenericInMemoryCatalog,将表信息存储到内存中
        (catalog, path) -> catalog.createTable(path, table, ignoreIfExists),
        objectIdentifier,
        false,
        "CreateTable");
}

Insert into语句执行

insert into语句执行的调用链路。

org.apache.flink.table.api.internal.TableEnvironmentImpl#executeSql
    org.apache.flink.table.planner.delegation.ParserImpl#parse: RichSqlInsert
        *org.apache.flink.table.planner.operations.SqlToOperationConverter#convertSqlInsert: 最终返回CatalogSinkModifyOperation
            org.apache.flink.table.planner.operations.SqlToOperationConverter#toQueryOperation: 处理Select语句,返回PlannerQueryOperation(relational.project())
        org.apache.flink.table.api.internal.TableEnvironmentImpl#executeOperation
            org.apache.flink.table.api.internal.TableEnvironmentImpl#executeInternal(java.util.List<org.apache.flink.table.operations.ModifyOperation>)
                org.apache.flink.table.api.internal.TableEnvironmentImpl#translate
                    org.apache.flink.table.planner.delegation.PlannerBase#translate
                        org.apache.flink.table.planner.delegation.PlannerBase#translateToRel   对select字段进行校验,将ModifyOperation转换为Calcite的relational expression.也就是LogicalLegacySink。
                        *org.apache.flink.table.planner.delegation.PlannerBase#optimize         对生成的relational expression进行优化,默认使用StreamCommonSubGraphBasedOptimizer。
                        org.apache.flink.table.planner.delegation.PlannerBase#translateToExecNodePlan    优化后的RelNode被转换FlinkPhysicalRel
                        org.apache.flink.table.planner.delegation.StreamPlanner#translateToPlan
                            org.apache.flink.table.planner.plan.nodes.exec.ExecNode#translateToPlan   将每一个ExecNode转换为对应transformation
                                org.apache.flink.table.planner.plan.nodes.common.CommonPhysicalTableSourceScan#createSourceTransformation
                                    org.apache.flink.table.planner.plan.nodes.physical.stream.StreamExecTableSourceScan#createInputFormatTransformation
                                    org.apache.flink.table.planner.plan.nodes.common.CommonPhysicalSink#createSinkTransformation

validate验证

插入语句执行时,Flink会先处理query语句,calcite对逻辑表达式sqlNode进行validate验证,检查查询表及字段是否在CatalogManager中,并将其转换为关系表达式RelNode,以PlannerQueryOperation形式返回。通过使用Flink扩展的CalciteSchema,在validate过程中,会将Flink的CatalogBaseTable转换为Calcite的Table,从而打通了Flink catalog's database and Calcite's schema。sqlNode进行validate验证流程:

org.apache.flink.table.planner.operations.SqlToOperationConverter#convert
org.apache.flink.table.planner.calcite.FlinkPlannerImpl#validate
    org.apache.calcite.sql.validate.SqlValidatorImpl#validate
        org.apache.calcite.sql.validate.SqlValidatorImpl#validateScopedExpression
            org.apache.calcite.sql.validate.SqlValidatorImpl#validateQuery
                org.apache.calcite.sql.validate.SqlValidatorImpl#validateNamespace
                    org.apache.calcite.sql.validate.SelectNamespace#validateImpl    包含validateFrom,validateWhereClause等
                        ...
                        org.apache.flink.table.planner.catalog.DatabaseCalciteSchema#getTable
                            *org.apache.flink.table.catalog.CatalogManager#getTable 从CatalogManager获取CatalogBaseTable

返回的CatalogSchemaTable为org.apache.calcite.schema.Table的子类。

org.apache.calcite.schema.Table 
public Table getTable(String tableName) {
        ObjectIdentifier identifier = ObjectIdentifier.of(catalogName, databaseName, tableName);
        return catalogManager.getTable(identifier)
            .map(result -> {
                CatalogBaseTable table = result.getTable();
                FlinkStatistic statistic = getStatistic(result.isTemporary(), table, identifier);
                return new CatalogSchemaTable(
                    identifier,
                    result,
                    statistic,
                    catalogManager.getCatalog(catalogName).orElseThrow(IllegalStateException::new),
                    isStreamingMode);
            })
            .orElse(null);
    }

validate验证通过后会将Query语句对应的sqlNode转换为relNode,rel调用流程。

org.apache.flink.table.planner.operations.SqlToOperationConverter#toQueryOperation
    org.apache.flink.table.planner.calcite.FlinkPlannerImpl#rel
        org.apache.calcite.sql2rel.SqlToRelConverter#convertQuery
            org.apache.calcite.sql2rel.SqlToRelConverter#convertQueryRecursive
                org.apache.calcite.sql2rel.SqlToRelConverter#convertFrom   
                    org.apache.calcite.sql2rel.SqlToRelConverter#convertIdentifier   源表表名
                        org.apache.calcite.sql.validate.SqlValidatorUtil#getRelOptTable 
                            *org.apache.flink.table.planner.plan.FlinkCalciteCatalogReader#getTable
                                org.apache.flink.table.planner.plan.FlinkCalciteCatalogReader#toPreparingTable   返回的CatalogSourceTable为RelOptTable子类
                                    org.apache.flink.table.planner.plan.FlinkCalciteCatalogReader#convertCatalogTable  生成CatalogSourceTable
                        org.apache.calcite.sql2rel.SqlToRelConverter#toRel
                            org.apache.flink.table.planner.plan.schema.CatalogSourceTable#toRel   

FlinkCalciteCatalogReader的getTable会从DatabaseCalciteSchema读取表的schema信息,最终返回CatalogSourceTable作为RelOptTable的实现。

    public Prepare.PreparingTable getTable(List<String> names) {
        Prepare.PreparingTable originRelOptTable = super.getTable(names);
        if (originRelOptTable == null) {
            return null;
        } else {
            // Wrap as FlinkPreparingTableBase to use in query optimization.
            CatalogSchemaTable table = originRelOptTable.unwrap(CatalogSchemaTable.class);
            if (table != null) {
                return toPreparingTable(originRelOptTable.getRelOptSchema(),
                    originRelOptTable.getQualifiedName(),
                    originRelOptTable.getRowType(),
                    table);
            } else {
                return originRelOptTable;
            }
        }
    }

RelOptTable的toRel方法会构建出Flink的LogicalTableScan。

  override def toRel(context: RelOptTable.ToRelContext): RelNode = {
    val cluster = context.getCluster
    val flinkContext = cluster
        .getPlanner
        .getContext
        .unwrap(classOf[FlinkContext])
    val typeFactory = cluster.getTypeFactory.asInstanceOf[FlinkTypeFactory]
    val conf = flinkContext.getTableConfig.getConfiguration
    val relBuilder = FlinkRelBuilder.of(cluster, getRelOptSchema)
    val toRexFactory = flinkContext.getSqlExprToRexConverterFactory

    // 1. push table scan
    val scan = buildTableScan(cluster, context.getTableHints, conf, typeFactory)
    relBuilder.push(scan)

    // 2. push computed column project
    if (containsGeneratedColumns(catalogTable)) {
      val fieldExprs = catalogTable.getSchema.getTableColumns
        .map(c => if (c.isGenerated) c.getExpr.get() else s"`${c.getName}`")
        .toArray
      val fieldNames = util.Arrays.asList(catalogTable.getSchema.getFieldNames: _*)
      val rexNodes = toRexFactory.create(scan.getRowType).convertToRexNodes(fieldExprs)
      relBuilder.projectNamed(rexNodes.toList, fieldNames, true)
    }

    // 3. push watermark assigner
    val watermarkSpec = catalogTable
      .getSchema
      // we only support single watermark currently
      .getWatermarkSpecs.asScala.headOption
    if (schemaTable.isStreamingMode && watermarkSpec.nonEmpty) {
      val rowtime = watermarkSpec.get.getRowtimeAttribute
      if (rowtime.contains(".")) {
        throw new TableException(
          s"Nested field '$rowtime' as rowtime attribute is not supported right now.")
      }
      val inputRowType = relBuilder.peek().getRowType
      val rowtimeIndex = inputRowType.getFieldNames.indexOf(rowtime)
      val watermarkRexNode = toRexFactory
          .create(inputRowType)
          .convertToRexNode(watermarkSpec.get.getWatermarkExpr)
      relBuilder.watermark(rowtimeIndex, watermarkRexNode)
    }

    // 4. returns the final RelNode
    relBuilder.build()
  }

接着调用TableEnvironmentImpl#translate将CatalogSinkModifyOperation解析为一组transformations。在调用translate方法时translateToRel会构建出包含DynamicTableSink和CatalogTable的LogicalSink,并对Sink字段和Source字段的名称和类型进行检查。

  override def translate(
      modifyOperations: util.List[ModifyOperation]): util.List[Transformation[_]] = {
    if (modifyOperations.isEmpty) {
      return List.empty[Transformation[_]]
    }
    // prepare the execEnv before translating
    getExecEnv.configure(
      getTableConfig.getConfiguration,
      Thread.currentThread().getContextClassLoader)
    overrideEnvParallelism()

    // 将modifyOperation转换为relNodes
    val relNodes = modifyOperations.map(translateToRel)
    // 对relNodes进行优化
    val optimizedRelNodes = optimize(relNodes)
    //  Converts [[FlinkPhysicalRel]] DAG to [[ExecNode]] DAG,
    val execNodes = translateToExecNodePlan(optimizedRelNodes)
    translateToPlan(execNodes)
  }

optimize优化

Flink根据运行环境使用StreamPlanner或者BatchPlanner来对生成的原始RelNode进行优化或者将SQL转换为特定的Operation,默认使用StreamPlanner。优化RelNode调用的链路。

org.apache.flink.table.planner.delegation.PlannerBase#optimize
    org.apache.flink.table.planner.plan.optimize.CommonSubGraphBasedOptimizer#optimize
        org.apache.flink.table.planner.plan.optimize.StreamCommonSubGraphBasedOptimizer#doOptimize
            org.apache.flink.table.planner.plan.optimize.StreamCommonSubGraphBasedOptimizer#optimizeTree
                org.apache.flink.table.planner.plan.optimize.program.FlinkChainedProgram#optimize

doOptimize执行时,先根据传递的跟roots集合抽取RelNodeBlock,RelNodeBlock会复用公共的RelNodeBlock,案例sql只有一个Sink表,因此只有一个RelNodeBlock。最后调用optimizeTree进行优化。

/**
  *
  *  [[RelNodeBlock]]是[[RelNode]] DAG中的子树,并表示[[CommonSubGraphBasedOptimizer]]中的公共子图。
  *   每个块中的所有[[RelNode]]仅具有一个[[LegacySink]]输出。
  *
  * A [[RelNodeBlock]] is a sub-tree in the [[RelNode]] DAG, and represents common sub-graph
  * in [[CommonSubGraphBasedOptimizer]]. All [[RelNode]]s in each block have
  * only one [[LegacySink]] output.
  *
  *   该算法的工作原理如下:
  *     1.如果只有一棵树,则整棵树都在一个块中。
  *     2.在不同的RelNode树中通过复用公共子计划,生成RelNode DAG。
  *     3.从根到叶遍历每棵树,并标记每个RelNode的sink RelNode
  *     4.如果遇到具有多个Sink RelNode的RelNode(案例中的:LogicalJoin),则再次从根到叶遍历每棵树,该RelNode是新块(或称为break-point)的输出节点。
  *       在几种特殊情况下,RelNode不能成为断点。
  *       a.当 [[RelNodeBlockPlanBuilder.TABLE_OPTIMIZER_UNIONALL_AS_BREAKPOINT_DISABLED]] 为true时UnionAll不是断点。
  *       b.[[TableFunctionScan]],[[Snapshot]]或窗口聚合(具有窗口属性的[[Project]]上的[[Aggregate]])不是断点, 因为它们的物理RelNode是复合RelNode,它们每个都不能 分别进行优化。
  * 
    The algorithm works as follows:
  * 1. If there is only one tree, the whole tree is in one block. (the next steps is needless.)
  * 2. reuse common sub-plan in different RelNode tree, generate a RelNode DAG,
  * 3. traverse each tree from root to leaf, and mark the sink RelNode of each RelNode
  * 4. traverse each tree from root to leaf again, if meet a RelNode which has multiple sink
  * RelNode, the RelNode is the output node of a new block (or named break-point).
  * There are several special cases that a RelNode can not be a break-point.
  * (1). UnionAll is not a break-point
  * when [[RelNodeBlockPlanBuilder.TABLE_OPTIMIZER_UNIONALL_AS_BREAKPOINT_DISABLED]] is true
  * (2). [[TableFunctionScan]], [[Snapshot]] or window aggregate ([[Aggregate]] on a [[Project]]
  * with window attribute) are not a break-point because their physical RelNodes are a composite
  * RelNode, each of them cannot be optimized individually. e.g. FlinkLogicalTableFunctionScan and
  * FlinkLogicalCorrelate will be combined into a BatchExecCorrelate or a StreamExecCorrelate.
  *
  * For example: (Table API)
  *     (flinksql 需要根据视图来执行多Sink才能优化???)
  * {{{-
  *  val sourceTable = tEnv.scan("test_table").select('a, 'b, 'c)
  *  val leftTable = sourceTable.filter('a > 0).select('a as 'a1, 'b as 'b1)
  *  val rightTable = sourceTable.filter('c.isNotNull).select('b as 'b2, 'c as 'c2)
  *  val joinTable = leftTable.join(rightTable, 'a1 === 'b2)
  *  joinTable.where('a1 >= 70).select('a1, 'b1).writeToSink(sink1)
  *  joinTable.where('a1 < 70 ).select('a1, 'c2).writeToSink(sink2)
  * }}}
  *
  * the RelNode DAG is:
  *
  * {{{-
  * Sink(sink1)     Sink(sink2)
  *    |               |
  * Project(a1,b1)  Project(a1,c2)
  *    |               |
  * Filter(a1>=70)  Filter(a1<70)
  *       \          /
  *        Join(a1=b2)  (多sink break-point)
  *       /           \
  * Project(a1,b1)  Project(b2,c2)
  *      |             |
  * Filter(a>0)     Filter(c is not null)
  *      \           /
  *      Project(a,b,c)
  *          |
  *       TableScan (根)
  * }}}
  *
  * This [[RelNode]] DAG will be decomposed into three [[RelNodeBlock]]s, the break-point
  * is the [[RelNode]](`Join(a1=b2)`) which data outputs to multiple [[LegacySink]]s.
  * <p>Notes: Although `Project(a,b,c)` has two parents (outputs),
  * they eventually merged at `Join(a1=b2)`. So `Project(a,b,c)` is not a break-point.
  * <p>the first [[RelNodeBlock]] includes TableScan, Project(a,b,c), Filter(a>0),
  * Filter(c is not null), Project(a1,b1), Project(b2,c2) and Join(a1=b2)
  * <p>the second one includes Filter(a1>=70), Project(a1,b1) and Sink(sink1)
  * <p>the third one includes Filter(a1<70), Project(a1,c2) and Sink(sink2)
  * <p>And the first [[RelNodeBlock]] is the child of another two.
  *
  * The [[RelNodeBlock]] plan is:
  * {{{-
  * RelNodeBlock2  RelNodeBlock3
  *        \            /
  *        RelNodeBlock1
  * }}}
  *
  *   优化顺序是从子块到父块。优化结果(RelNode)将首先包装为IntermediateRelTable,然后转换为新的TableScan,该TableScan是当前块的新输出节点,也是其父块的输入。
  *
  * The optimizing order is from child block to parent. The optimized result (RelNode)
  * will be wrapped as an IntermediateRelTable first, and then be converted to a new TableScan
  * which is the new output node of current block and is also the input of its parent blocks.
  *
  * @param outputNode A RelNode of the output in the block, which could be a [[LegacySink]] or
  *                   other RelNode which data outputs to multiple [[LegacySink]]s.
  */
class RelNodeBlock(val outputNode: RelNode) {
    .....
}
  override protected def doOptimize(roots: Seq[RelNode]): Seq[RelNodeBlock] = {
    val config = planner.getTableConfig
    // 构建RelNodeBlock, planRelNodeBlock在有多个Sink时,将公共Source进行抽取。 
    // 当前只有一个Sink对应一个RelNodeBlock
    val sinkBlocks = RelNodeBlockPlanBuilder.buildRelNodeBlockPlan(roots, config)
    sinkBlocks.foreach { sinkBlock =>
      // 子块不需要在父块更新之前进行更新
      sinkBlock.setUpdateBeforeRequired(false)
      // 根据参数提取微批间隔
       ....
      sinkBlock.setMiniBatchInterval(miniBatchInterval)
    }
    //  单sinkBlocks处理
    if (sinkBlocks.size == 1) {
      val block = sinkBlocks.head
      val optimizedTree = optimizeTree(
        block.getPlan, //单sink下,children为None,未做相关优化
        block.isUpdateBeforeRequired,
        block.getMiniBatchInterval,
        isSinkBlock = true)
      block.setOptimizedPlan(optimizedTree)
      return sinkBlocks
    }
    // 多sink情况处理
    ....
  }

optimizeTree中默认会使用FlinkStreamProgram中提供的一组Program,Program通过FlinkChainedProgram进行连接。每一个Program又包含多个FlinkOptimizeProgram分别对应了一组优化rule,最终调用FlinkHepProgram或者FlinkVolcanoProgram的优化规则。

// 针对流表进行优化的相关优化
org.apache.flink.table.planner.plan.optimize.program.FlinkStreamProgram:

val SUBQUERY_REWRITE = "subquery_rewrite"
val TEMPORAL_JOIN_REWRITE = "temporal_join_rewrite"
val DECORRELATE = "decorrelate"
val TIME_INDICATOR = "time_indicator"
val DEFAULT_REWRITE = "default_rewrite"
val PREDICATE_PUSHDOWN = "predicate_pushdown"
val JOIN_REORDER = "join_reorder"
val LOGICAL = "logical"
val LOGICAL_REWRITE = "logical_rewrite"
val PHYSICAL = "physical"
val PHYSICAL_REWRITE = "physical_rewrite"

FlinkHepProgram使用Calcite 的HepPlanner,是基于规则的优化器(RBO)。根据优化规则对关系表达式进行转换,这里的转换是说一个关系表达式经过优化规则后会变成另外一个关系表达式,同时原有表达式会被裁剪掉,经过一系列转换后生成最终的执行计划。HepPlanner执行流程就是遍历每一个节点,然后对每一个节点进行匹配转换,直到超过matchLimit或者规则遍历完一遍关系树不会再有新的变化。<br />FlinkVolcanoProgram使用Calcite的VolcanoPlanner,是基于成本的优化器(CBO)。根据优化规则对关系表达式进行转换,这里的转换是说一个关系表达式经过优化规则后会生成另外一个关系表达式,同时原有表达式也会保留,经过一系列转换后会生成多个执行计划,然后 CBO 会根据统计信息和代价模型 (Cost Model) 计算每个执行计划的 Cost,从中挑选 Cost 最小的执行计划。Flink只有在"logical"、"physical"优化过程中使用了VolcanoPlanner优化器。

  private def optimizeTree(
      relNode: RelNode,
      updateBeforeRequired: Boolean,
      miniBatchInterval: MiniBatchInterval,
      isSinkBlock: Boolean): RelNode = {
    ....
    // 使用FlinkStreamProgram提供的一组Program
    val programs = calciteConfig.getStreamProgram
      .getOrElse(FlinkStreamProgram.buildProgram(config.getConfiguration))
    programs.optimize(relNode, new StreamOptimizeContext() {
                ...
    })
  }

以谓词下推predicate_pushdown的优化规则举例,查看Flink如何扩展Calcite的优化规则。predicate_pushdownProgram包含了"filter rules","push predicate into table scan","prune empty after predicate push down"三个子Program,这三个Program使用FlinkHepRuleSetProgram构建FlinkHepProgram,FlinkHepProgram最终对应Calcite的HepPlanner。在addProgram中设置HepPlanner从子节点开始匹配,并添加一组优化规则RuleSet。

    chainedProgram.addLast(
      PREDICATE_PUSHDOWN,  
      FlinkGroupProgramBuilder.newBuilder[StreamOptimizeContext]
        .addProgram(
          FlinkHepRuleSetProgramBuilder.newBuilder
            // 规则类型为一组集合
            .setHepRulesExecutionType(HEP_RULES_EXECUTION_TYPE.RULE_COLLECTION)
            //  子节点 自下而上
            .setHepMatchOrder(HepMatchOrder.BOTTOM_UP)
            // 定义一组优化规则,使用Calcite内部的规则居多
            .add(FlinkStreamRuleSets.FILTER_PREPARE_RULES)
            .build(), "filter rules")
        .addProgram(
          FlinkHepRuleSetProgramBuilder.newBuilder
            .setHepRulesExecutionType(HEP_RULES_EXECUTION_TYPE.RULE_SEQUENCE)
            .setHepMatchOrder(HepMatchOrder.BOTTOM_UP)
            .add(FlinkStreamRuleSets.FILTER_TABLESCAN_PUSHDOWN_RULES)
            .build(), "push predicate into table scan")
        .addProgram(
          FlinkHepRuleSetProgramBuilder.newBuilder
            .setHepRulesExecutionType(HEP_RULES_EXECUTION_TYPE.RULE_SEQUENCE)
            .setHepMatchOrder(HepMatchOrder.BOTTOM_UP)
            .add(FlinkStreamRuleSets.PRUNE_EMPTY_RULES)
            .build(), "prune empty after predicate push down")
        .build())

"push predicate into table scan"绑定了一组优化集合,查看Flink对PushFilterIntoTableSourceScanRule规则的扩展。

  /**
    * RuleSet to do push predicate/partition into table scan
    */
  val FILTER_TABLESCAN_PUSHDOWN_RULES: RuleSet = RuleSets.ofList(
    // push a filter down into the table scan
    * PushFilterIntoTableSourceScanRule.INSTANCE,
    PushFilterIntoLegacyTableSourceScanRule.INSTANCE,
    // push partition into the table scan
    PushPartitionIntoLegacyTableSourceScanRule.INSTANCE,
    // push partition into the dynamic table scan
    PushPartitionIntoTableSourceScanRule.INSTANCE
  )

// 下推过滤条件到TableSourceScan
public class PushFilterIntoTableSourceScanRule extends RelOptRule {
    //指定匹配规则:匹配Filter operand且第一个operand LogicalTableScan
    public PushFilterIntoTableSourceScanRule() {
        super(operand(Filter.class,
            operand(LogicalTableScan.class, none())),
            "PushFilterIntoTableSourceScanRule");
    }

    @Override
    public boolean matches(RelOptRuleCall call) {
        //  判断是否能够使用下推规则
        Filter filter = call.rel(0);
        if (filter.getCondition() == null) {
            return false;
        }
        LogicalTableScan scan = call.rel(1);
        TableSourceTable tableSourceTable = scan.getTable().unwrap(TableSourceTable.class);
        // we can not push filter twice
        return tableSourceTable != null
            && tableSourceTable.tableSource() instanceof SupportsFilterPushDown
            && Arrays.stream(tableSourceTable.extraDigests()).noneMatch(str -> str.startsWith("filter=["));
    }
    // 规则能够匹配时,执行具体的下推逻辑。
    @Override
    public void onMatch(RelOptRuleCall call) {
        Filter filter = call.rel(0);
        LogicalTableScan scan = call.rel(1);
        TableSourceTable table = scan.getTable().unwrap(TableSourceTable.class);
        //  执行具体下推动作
        pushFilterIntoScan(call, filter, scan, table);
        .........
    }
}

在FlinkHepRuleSetProgram包含了创建FlinkHepProgram相关配置,在optimize通过创建FlinkHepProgram进行Relnode的优化。

FlinkHepRuleSetProgram#optimize: 
override def optimize(input: RelNode, context: OC): RelNode = {
    // 构建HepProgram
    val builder = new HepProgramBuilder
    builder.addMatchOrder(matchOrder)
    builder.addMatchLimit(matchLimit)
    executionType match {
      case HEP_RULES_EXECUTION_TYPE.RULE_SEQUENCE =>
        rules.foreach(builder.addRuleInstance)
      case HEP_RULES_EXECUTION_TYPE.RULE_COLLECTION =>
        builder.addRuleCollection(rules)
      case _ =>
        throw new RuntimeException(s"Unsupported HEP_RULES_EXECUTION_TYPE: $executionType")
    }
     // 用来配置创建flinkHepProgram的
    val flinkHepProgram = FlinkHepProgram[OC](builder.build(), requestedRootTraits)
    // 获取优化后的RelNode
    flinkHepProgram.optimize(input, context)
  }
  
program.FlinkHepProgram#apply:
  def apply[OC <: FlinkOptimizeContext](
      hepProgram: HepProgram,
      requestedRootTraits: Option[Array[RelTrait]] = None): FlinkHepProgram[OC] = {

    val flinkHepProgram = new FlinkHepProgram[OC]()
    flinkHepProgram.setHepProgram(hepProgram)
    if (requestedRootTraits.isDefined) {
      flinkHepProgram.setRequestedRootTraits(requestedRootTraits.get)
    }
    flinkHepProgram
  }
// 调用calcite的HepPlanner的findBestExp,获取最优解。
optimize.program.FlinkHepProgram#optimize
  override def optimize(root: RelNode, context: OC): RelNode = {
    try {
      val planner = new HepPlanner(hepProgram.get, context)
      FlinkRelMdNonCumulativeCost.THREAD_PLANNER.set(planner)
      planner.setRoot(root)

      if (requestedRootTraits.isDefined) {
        val targetTraitSet = root.getTraitSet.plusAll(requestedRootTraits.get)
        if (!root.getTraitSet.equals(targetTraitSet)) {
          planner.changeTraits(root, targetTraitSet.simplify)
        }
      }
      planner.findBestExp
    }  finally {
      FlinkRelMdNonCumulativeCost.THREAD_PLANNER.remove()
    }
  }

Flink Program构建及相关优化规则。


Flink Program构建.png

最初的RelNode经过逻辑优化、物理优化后,最终生成Flink物理执行计划FlinkPhysicalRel。其中,FlinkLogicalCalc包含了Project和Filter操作,StreamExecExchange继承了calcite的Exchange,标识一次hash distrbute。

Flink执行树构建.png

生成Transformation

FlinkPhysicalRel代表了Flink执行相关信息,包含了转换为对应Transformation相关方法。

StreamPlanner#translateToPlanInternal:  
  override protected def translateToPlanInternal(
      planner: StreamPlanner): Transformation[Any] = {

    val inputTransformation = getInputNodes.get(0) match {
      //从StreamExecSink[RowData]开始,从上到下处理每一个StreamExecNode
      case node: StreamExecNode[RowData] =>
        node.translateToPlan(planner)
      case _ =>
        throw new TableException("Cannot generate DataStream due to an invalid logical plan. " +
          "This is a bug and should not happen. Please file an issue.")
    }
    val inputLogicalType = getInput.getRowType
    val rowtimeFields = inputLogicalType.getFieldList.zipWithIndex
      .filter { case (f, _) =>
        FlinkTypeFactory.isRowtimeIndicatorType(f.getType)
      }

    val inputChangelogMode = ChangelogPlanUtils.getChangelogMode(
      getInput.asInstanceOf[StreamPhysicalRel]).get
    // tell sink the ChangelogMode of input
    tableSink.getChangelogMode(inputChangelogMode)
    val rowtimeFieldIndex: Int = rowtimeFields.map(_._2).headOption.getOrElse(-1)

    createSinkTransformation(
      planner.getExecEnv,
      inputTransformation,
      planner.getTableConfig,
      rowtimeFieldIndex,
      isBounded = false)
  }

调用ExecNode节点的translateToPlanInternal方法,使用CalcCodeGenerator生成对应的Operator,进一步生成Transformation。

StreamExecCalc#translateToPlanInternal
override protected def translateToPlanInternal(
      planner: StreamPlanner): Transformation[RowData] = {
    val config = planner.getTableConfig
    val inputTransform = getInputNodes.get(0).translateToPlan(planner)
        .asInstanceOf[Transformation[RowData]]
    // materialize time attributes in condition
    val condition = if (calcProgram.getCondition != null) {
      Some(calcProgram.expandLocalRef(calcProgram.getCondition))
    } else {
      None
    }

    val ctx = CodeGeneratorContext(config).setOperatorBaseClass(
      classOf[AbstractProcessStreamOperator[RowData]])
    val outputType = FlinkTypeFactory.toLogicalRowType(getRowType)
        // 生成对应的StreamOperator
    val substituteStreamOperator = CalcCodeGenerator.generateCalcOperator(
      ctx,
      cluster,
      inputTransform,
      outputType,
      config,
      calcProgram,
      condition,
      retainHeader = true,
      "StreamExecCalc"
    )
    val ret = new OneInputTransformation(
      inputTransform,
      getRelDetailedDescription,
      substituteStreamOperator,
      RowDataTypeInfo.of(outputType),
      inputTransform.getParallelism)

    if (inputsContainSingleton()) {
      ret.setParallelism(1)
      ret.setMaxParallelism(1)
    }
    ret
  }

相关参考

Apache Calcite 优化器详解(二)

Flink 源码阅读笔记(16)- Flink SQL 的元数据管理