Flink 源码之基本算子

Flink源码分析系列文档目录

请点击:Flink 源码分析系列文档目录

Map和Filter

DataStream的map方法

public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper) {

    TypeInformation<R> outType = TypeExtractor.getMapReturnTypes(clean(mapper), getType(),
            Utils.getCallLocationName(), true);

    return transform("Map", outType, new StreamMap<>(clean(mapper)));
}

StreamMap.class

@Internal
public class StreamMap<IN, OUT>
        extends AbstractUdfStreamOperator<OUT, MapFunction<IN, OUT>>
        implements OneInputStreamOperator<IN, OUT> {

    private static final long serialVersionUID = 1L;

    public StreamMap(MapFunction<IN, OUT> mapper) {
        super(mapper);
        chainingStrategy = ChainingStrategy.ALWAYS;
    }

    @Override
    public void processElement(StreamRecord<IN> element) throws Exception {
    // 此处userFunction为mapper function。执行mapper方法,使用执行结果替换原先的元素
        output.collect(element.replace(userFunction.map(element.getValue())));
    }
}

filter方法

public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter) {
    return transform("Filter", getType(), new StreamFilter<>(clean(filter)));
}

StreamFilter.class

@Internal
public class StreamFilter<IN> extends AbstractUdfStreamOperator<IN, FilterFunction<IN>> implements OneInputStreamOperator<IN, IN> {

    private static final long serialVersionUID = 1L;

    public StreamFilter(FilterFunction<IN> filterFunction) {
        super(filterFunction);
        chainingStrategy = ChainingStrategy.ALWAYS;
    }

    @Override
    public void processElement(StreamRecord<IN> element) throws Exception {
    // 如果filter方法执行结果为true,则collect该元素,否则忽略该元素。
        if (userFunction.filter(element.getValue())) {
            output.collect(element);
        }
    }
}

transform方法

public <R> SingleOutputStreamOperator<R> transform(String operatorName, TypeInformation<R> outTypeInfo, OneInputStreamOperator<T, R> operator) {

// 检查输出元素类型是否为MissingTypeInfo。如果是的话会抛出异常
    // read the output type of the input Transform to coax out errors about MissingTypeInfo
    transformation.getOutputType();

// 创建OneInputTransformation
    OneInputTransformation<T, R> resultTransform = new OneInputTransformation<>(
            this.transformation,
            operatorName,
            operator,
            outTypeInfo,
            environment.getParallelism());

    @SuppressWarnings({ "unchecked", "rawtypes" })
    SingleOutputStreamOperator<R> returnStream = new SingleOutputStreamOperator(environment, resultTransform);

    getExecutionEnvironment().addOperator(resultTransform);

    return returnStream;
}

我们这里关注OneInputTransformation两个成员变量

  • input 作为该transformation数据输入的transformation。即上游的transformation。
  • operator 此处transformation需要进行的数据转换操作。

多个级联的map和filter操作会被transform成为一连串的OneInputTransformation。后一个transformation的input指向前一个transformation

Split

Split算子已被废弃,建议使用sideOutput

public SplitStream<T> split(OutputSelector<T> outputSelector) {
    return new SplitStream<>(this, clean(outputSelector));
}

SplitStream.java

public DataStream<OUT> select(String... outputNames) {
    return selectOutput(outputNames);
}

private DataStream<OUT> selectOutput(String[] outputNames) {
    for (String outName : outputNames) {
        if (outName == null) {
            throw new RuntimeException("Selected names must not be null");
        }
    }

    SelectTransformation<OUT> selectTransform = new SelectTransformation<OUT>(this.getTransformation(), Lists.newArrayList(outputNames));
    return new DataStream<OUT>(this.getExecutionEnvironment(), selectTransform);
}

Union

@SafeVarargs
public final DataStream<T> union(DataStream<T>... streams) {
    List<StreamTransformation<T>> unionedTransforms = new ArrayList<>();
    unionedTransforms.add(this.transformation);

// 检查所有union的stream类型是否一致
    for (DataStream<T> newStream : streams) {
        if (!getType().equals(newStream.getType())) {
            throw new IllegalArgumentException("Cannot union streams of different types: "
                    + getType() + " and " + newStream.getType());
        }

        unionedTransforms.add(newStream.getTransformation());
    }
    return new DataStream<>(this.environment, new UnionTransformation<>(unionedTransforms));
}

KeyBy

public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key) {
    Preconditions.checkNotNull(key);
    return new KeyedStream<>(this, clean(key));
}

KeyedStream

public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector) {
    this(dataStream, keySelector, TypeExtractor.getKeySelectorTypes(keySelector, dataStream.getType()));
}
public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
    this(
        dataStream,
        new PartitionTransformation<>(
            dataStream.getTransformation(),
            new KeyGroupStreamPartitioner<>(keySelector, StreamGraphGenerator.DEFAULT_LOWER_BOUND_MAX_PARALLELISM)),
        keySelector,
        keyType);
}
KeyedStream(
    DataStream<T> stream,
    PartitionTransformation<T> partitionTransformation,
    KeySelector<T, KEY> keySelector,
    TypeInformation<KEY> keyType) {

    super(stream.getExecutionEnvironment(), partitionTransformation);
    this.keySelector = clean(keySelector);
    this.keyType = validateKeyType(keyType);
}

PartitionTransformation 分区变换。该变换在生成StreamGraph的时候会被处理为VirtualPartitionNode

包含了上游input和StreamPartitioner

此处的StreamPartitioner传入的是KeyGroupStreamPartitioner

KeyGroupStreamPartitioner

通过selectChannels方法来决定元素所属的分区

public int[] selectChannels(SerializationDelegate<StreamRecord<T>> record) {
    K key;
    try {
        key = keySelector.getKey(record.getInstance().getValue());
    } catch (Exception e) {
        throw new RuntimeException("Could not extract key from " + record.getInstance().getValue(), e);
    }
    returnArray[0] = KeyGroupRangeAssignment.assignKeyToParallelOperator(key, maxParallelism, numberOfChannels);
    return returnArray;
}

KeyGroupRangeAssignment的assignKeyToParallelOperator方法

public static int assignKeyToParallelOperator(Object key, int maxParallelism, int parallelism) {
    return computeOperatorIndexForKeyGroup(maxParallelism, parallelism, assignToKeyGroup(key, maxParallelism));
}

computeOperatorIndexForKeyGroup方法

public static int computeOperatorIndexForKeyGroup(int maxParallelism, int parallelism, int keyGroupId) {
    return keyGroupId * parallelism / maxParallelism;
}

assignToKeyGroup方法

public static int assignToKeyGroup(Object key, int maxParallelism) {
    return computeKeyGroupForKeyHash(key.hashCode(), maxParallelism);
}

computeKeyGroupForKeyHash方法

public static int computeKeyGroupForKeyHash(int keyHash, int maxParallelism) {
    return MathUtils.murmurHash(keyHash) % maxParallelism;
}

Rebalance

Rebalance的执行流程和keyBy相同,只不过使用的是RebalancePartitioner

RebalancePartitioner的setUp和selectChannels方法

随机分配元素到partition

public void setup(int numberOfChannels) {
    super.setup(numberOfChannels);

    returnArray[0] = ThreadLocalRandom.current().nextInt(numberOfChannels);
}

public int[] selectChannels(SerializationDelegate<StreamRecord<T>> record) {
    returnArray[0] = (returnArray[0] + 1) % numberOfChannels;
    return returnArray;
}

Cogroup

DataStream的coGroup方法如下

public <T2> CoGroupedStreams<T, T2> coGroup(DataStream<T2> otherStream) {
    return new CoGroupedStreams<>(this, otherStream);
}

方法返回了CoGroupStreams

public CoGroupedStreams(DataStream<T1> input1, DataStream<T2> input2) {
    this.input1 = requireNonNull(input1);
    this.input2 = requireNonNull(input2);
}

CoGroupedStreams包含了两个stream。

where方法设置了keySelector1,equalTo方法设置了keySelector2。

重点是apply方法

public <T> DataStream<T> apply(CoGroupFunction<T1, T2, T> function, TypeInformation<T> resultType) {
    //clean the closure
    function = input1.getExecutionEnvironment().clean(function);

    UnionTypeInfo<T1, T2> unionType = new UnionTypeInfo<>(input1.getType(), input2.getType());
    UnionKeySelector<T1, T2, KEY> unionKeySelector = new UnionKeySelector<>(keySelector1, keySelector2);

    DataStream<TaggedUnion<T1, T2>> taggedInput1 = input1
            // 生成TaggedUnion,输入值map到one变量
            .map(new Input1Tagger<T1, T2>())
            .setParallelism(input1.getParallelism())
            .returns(unionType);
    DataStream<TaggedUnion<T1, T2>> taggedInput2 = input2
            // 生成TaggedUnion,输入值map到two变量
            .map(new Input2Tagger<T1, T2>())
            .setParallelism(input2.getParallelism())
            .returns(unionType);

// 两个stream合并
    DataStream<TaggedUnion<T1, T2>> unionStream = taggedInput1.union(taggedInput2);

// stream里one或two值相同的TaggedUnion元素,会被分到同一个分区中
// 此处keyby问题为如果两个one值相同的话,也会被分入同一个分区中,也就是说同一个stream的元素会自己join
    // we explicitly create the keyed stream to manually pass the key type information in
    windowedStream =
            new KeyedStream<TaggedUnion<T1, T2>, KEY>(unionStream, unionKeySelector, keyType)
            .window(windowAssigner);

    if (trigger != null) {
        windowedStream.trigger(trigger);
    }
    if (evictor != null) {
        windowedStream.evictor(evictor);
    }
    if (allowedLateness != null) {
        windowedStream.allowedLateness(allowedLateness);
    }

    return windowedStream.apply(new CoGroupWindowFunction<T1, T2, T, KEY, W>(function), resultType);
}

UnionKeySelector的getKey方法

public KEY getKey(TaggedUnion<T1, T2> value) throws Exception{
    if (value.isOne()) {
        return keySelector1.getKey(value.getOne());
    } else {
        return keySelector2.getKey(value.getTwo());
    }
}

如果TaggedUnion的one有值,则把one作为key,否则把two值作为key。

public void apply(KEY key,
        W window,
        Iterable<TaggedUnion<T1, T2>> values,
        Collector<T> out) throws Exception {

    List<T1> oneValues = new ArrayList<>();
    List<T2> twoValues = new ArrayList<>();

    for (TaggedUnion<T1, T2> val: values) {
        if (val.isOne()) {
            oneValues.add(val.getOne());
        } else {
            twoValues.add(val.getTwo());
        }
    }
    wrappedFunction.coGroup(oneValues, twoValues, out);
}

CoGroupWindowFunction.apply

@Override
public void apply(KEY key,
        W window,
        Iterable<TaggedUnion<T1, T2>> values,
        Collector<T> out) throws Exception {

    List<T1> oneValues = new ArrayList<>();
    List<T2> twoValues = new ArrayList<>();

    for (TaggedUnion<T1, T2> val: values) {
        if (val.isOne()) {
            oneValues.add(val.getOne());
        } else {
            twoValues.add(val.getTwo());
        }
    }
    wrappedFunction.coGroup(oneValues, twoValues, out);
}

两组数据分别以集合形式提供

Join

public <T2> JoinedStreams<T, T2> join(DataStream<T2> otherStream) {
    return new JoinedStreams<>(this, otherStream);
}
/**
 * Completes the join operation with the user function that is executed
 * for each combination of elements with the same key in a window.
 *
 * <p>Note: This method's return type does not support setting an operator-specific parallelism.
 * Due to binary backwards compatibility, this cannot be altered. Use the
 * {@link #with(JoinFunction, TypeInformation)}, method to set an operator-specific parallelism.
 */
public <T> DataStream<T> apply(JoinFunction<T1, T2, T> function, TypeInformation<T> resultType) {
    //clean the closure
    function = input1.getExecutionEnvironment().clean(function);

    coGroupedWindowedStream = input1.coGroup(input2)
        .where(keySelector1)
        .equalTo(keySelector2)
        .window(windowAssigner)
        .trigger(trigger)
        .evictor(evictor)
        .allowedLateness(allowedLateness);

    return coGroupedWindowedStream
            .apply(new JoinCoGroupFunction<>(function), resultType);
}

可见join内部使用cogroup实现的

JoinCoGroupFunction.apply

@Override
public void coGroup(Iterable<T1> first, Iterable<T2> second, Collector<T> out) throws Exception {
    for (T1 val1: first) {
        for (T2 val2: second) {
            out.collect(wrappedFunction.join(val1, val2));
        }
    }
}

两组数据分别以笛卡尔积的形式提供(排列组合)

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 161,601评论 4 369
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 68,367评论 1 305
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 111,249评论 0 254
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 44,539评论 0 217
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,967评论 3 295
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,929评论 1 224
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 32,098评论 2 317
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,825评论 0 207
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,609评论 1 249
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,796评论 2 253
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,282评论 1 265
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,603评论 3 261
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,277评论 3 242
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,159评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,959评论 0 201
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 36,079评论 2 285
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,874评论 2 277

推荐阅读更多精彩内容

  • Data Sources 源是程序读取输入数据的位置。可以使用 StreamExecutionEnvironmen...
    Alex90阅读 2,921评论 0 1
  • 首先需要编程应用的四层抽象: 最底下的一层对用户是不可见的, 通过ProcessFunction集成到DataSt...
    君剑阅读 940评论 0 1
  • 1. 主要内容 本文主要是将用户写的java程序如何生成Flink JobGraph的过程与逻辑追踪了一下,欢迎有...
    ni_d58f阅读 1,274评论 0 1
  • Basic API Concepts Flink程序是实现基于分布式采集的转换程序(如:过滤器,映射,更新状态,连...
    MiyoungCheng阅读 1,848评论 0 0
  • 一只飞走在峭壁的鸡 定然不是一只普通意义上的鸡 它有自己专属的名字 褐马鸡 国家级保护动物 就图而言 它正为起飞做...
    东子年华阅读 288评论 0 3