Spark中存在的各种2G限制

motivation 动机

The various 2G limit in Spark. Spark中存在的各种2G限制问题.

  1. When reading the data block is stored in the hard disk, the following code fragment is called. 获取缓存在本地硬盘的数据块时,会调用以下代码片段
  val iterToReturn: Iterator[Any] = {
    val diskBytes = diskStore.getBytes(blockId)
    if (level.deserialized) {
      val diskValues = serializerManager.dataDeserializeStream(
        blockId,
        diskBytes.toInputStream(dispose = true))(info.classTag)
      maybeCacheDiskValuesInMemory(info, blockId, level, diskValues)
    } else {
      val stream = maybeCacheDiskBytesInMemory(info, blockId, level, diskBytes)
        .map {_.toInputStream(dispose = false)}
        .getOrElse { diskBytes.toInputStream(dispose = true) }
      serializerManager.dataDeserializeStream(blockId, stream)(info.classTag)
    }
  }

  def getBytes(blockId: BlockId): ChunkedByteBuffer = {
    val file = diskManager.getFile(blockId.name)
    val channel = new RandomAccessFile(file, "r").getChannel
    Utils.tryWithSafeFinally {
      // For small files, directly read rather than memory map
      if (file.length < minMemoryMapBytes) {
        val buf = ByteBuffer.allocate(file.length.toInt)
        channel.position(0)
        while (buf.remaining() != 0) {
          if (channel.read(buf) == -1) {
            throw new IOException("Reached EOF before filling buffer\n" +
              s"offset=0\nfile=${file.getAbsolutePath}\nbuf.remaining=${buf.remaining}")
          }
        }
        buf.flip()
        new ChunkedByteBuffer(buf)
      } else {
        new ChunkedByteBuffer(channel.map(MapMode.READ_ONLY, 0, file.length))
      }
    } {
      channel.close()
    }
  }

The above code has the following problems: 上面的代码存在以下问题:
* Channel.map(MapMode.READ_ONLY, 0, file.length) returns an instance of MappedByteBuffer. the size of MappedByteBuffer can not exceed 2G. channel.map(MapMode.READ_ONLY, 0, file.length) 返回的实例是MappedByteBuffer. MappedByteBuffer的大小不能超过2G
* When a Iterator[Any] is generated, need to load all the data into the memory,this may take up a lot of memory. 获取Iterator[Any]时需要把全部数据加载到内存中, 这可能会导致占用很多堆外内存.
* MappedByteBuffer map a file to memory, and it's controlled by operator system, JVM can't control the memory. MappedByteBuffer 使用系统缓存,系统缓存不可控.

  1. When using kryo serialized data, the following code fragment is called: 在使用kryo序列化数据时, 会调用以下代码片段:

  override def serialize[T: ClassTag](t: T): ByteBuffer = {
    output.clear()
    val kryo = borrowKryo()
    try {
      kryo.writeClassAndObject(output, t)
    } catch {
      case e: KryoException if e.getMessage.startsWith("Buffer overflow") =>
        throw new SparkException(s"Kryo serialization failed: ${e.getMessage}. To avoid this, " +
          "increase spark.kryoserializer.buffer.max value.")
    } finally {
      releaseKryo(kryo)
    }
    ByteBuffer.wrap(output.toBytes)
  }

The above code has the following problems: 上面的代码存在以下问题:
* The serialization data is stored in the output internal byte[], the size of byte[] can not exceed 2G. 序列化t时会把序列化后的数据存储在output内部byte[]里, byte[]的大小不能超过2G.

  1. When RPC writes data to be sent to a Channel, the following code fragment is called: 在RPC把要发送的数据写入到Channel时会调用以下代码片段:
  public long transferTo(final WritableByteChannel target, final long position) throws IOException {
    Preconditions.checkArgument(position == totalBytesTransferred, "Invalid position.");
    // Bytes written for header in this call.
    long writtenHeader = 0;
    if (header.readableBytes() > 0) {
      writtenHeader = copyByteBuf(header, target);
      totalBytesTransferred += writtenHeader;
      if (header.readableBytes() > 0) {
        return writtenHeader;
      }
    }

    // Bytes written for body in this call.
    long writtenBody = 0;
    if (body instanceof FileRegion) {
      writtenBody = ((FileRegion) body).transferTo(target, totalBytesTransferred - headerLength);
    } else if (body instanceof ByteBuf) {
      writtenBody = copyByteBuf((ByteBuf) body, target);
    }
    totalBytesTransferred += writtenBody;
    return writtenHeader + writtenBody;
  }

The above code has the following problems: ~~上面的代码存在以下问题: ~~
* the size of ByteBuf cannot exceed 2G. ByteBuf的大小不能超过2G
* cannot transfer data over 2G in memory. ~~无法传输内存中超过2G的数据 ~~

  1. When decodes the RPC message received, the following code fragment is called: 解码RPC接收的消息时调用以下代码片段:
public final class MessageDecoder extends MessageToMessageDecoder<ByteBuf> {

  private static final Logger logger = LoggerFactory.getLogger(MessageDecoder.class);

  @Override
  public void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) {
    Message.Type msgType = Message.Type.decode(in);
    Message decoded = decode(msgType, in);
    assert decoded.type() == msgType;
    logger.trace("Received message {}: {}", msgType, decoded);
    out.add(decoded);
  }

  private Message decode(Message.Type msgType, ByteBuf in) {
    switch (msgType) {
      case ChunkFetchRequest:
        return ChunkFetchRequest.decode(in);

      case ChunkFetchSuccess:
        return ChunkFetchSuccess.decode(in);

      case ChunkFetchFailure:
        return ChunkFetchFailure.decode(in);

      default:
        throw new IllegalArgumentException("Unexpected message type: " + msgType);
    }
  }
}

The above code has the following problems: 上面的代码存在以下问题:
* the size of ByteBuf cannot exceed 2G. ByteBuf的大小不能超过2G
* Must be in the receiver to complete the data can be decoded. 必须在接收到全部数据时才能解码.

Goals

  • Setup for eliminating the various 2G limit in Spark. 解决Spark中存在的各种2G限制问题. (The 2G limit 1,2,3,4)
  • Support back-pressure flow control for remote data reading(experimental goal). ~~远程数据读取支持back-pressure flow control(实验目标). ~~ (The 2G limit 4)
  • Add buffer pool(long-range goal). 添加缓存池(远期目标).

Design

Setup for eliminating the various 2G limit in Spark. 解决Spark中存在的各种2G限制问题.

Replace ByteBuffer with ChunkedByteBuffer. 使用 ChunkedByteBuffer 替换 ByteBuffer. (The 2G limit 1,2)

ChunkedByteBuffer Introduction: ChunkedByteBuffer 介绍:

  • Store data with multiple ByteBuffer instance. 用多个ByteBuffer存储数据
  • Support reference counting, a necessary condition to the feature of the buffer pool. 支持引用计数,实现资源池必要条件
    Reference counted objects
  • Support serialization for easy transport. 支持序列化,方便传输
  • Support slice duplicate and copy operation. 支持类似于ByteBuffer的切片(slice), 副本(duplicate)和复制(copy)等操作, 方便处理
  • Can be efficiently converted to InputStream, ByteBuffer, byte[] and ByteBuf, etc. 可以高效转换成InputStream, ByteBuffer, byte[]ByteBuf等,便于和其他接口对接
  • 可以方便的写入数据到OutputStream
  1. Move the ChunkedByteBuffer class to common/network-common/src/main/java/org/apache/spark/network/buffer/. ~~把ChunkedByteBuffer类移动到 common/network-common/src/main/java/org/apache/spark/network/buffer/. ~~
  2. Modify ManagedBuffer.nioByteBuffer's return value to ChunkedByteBuffer instance. 修改ManagedBuffer.nioByteBuffer的返回值为ChunkedByteBuffer实例. (The 2G limit 1)
  3. Further standardize the use of ManagedBuffer and ChunkedByteBuffer. 进一步规范ManagedBufferChunkedByteBuffer的使用.
  • Data in memory, network, disk and other sources are represented with ManagedBuffer, 内存,网络,硬盘和其他来源的数据使用ManagedBuffer表示.
  • ChunkedByteBuffer only represents the data in the memory. ChunkedByteBuffer只表示内存中的数据.
  • ManagedBuffer.nioByteBuffer is called only when there is sufficient memory. 只有在确认有足够的内存保存数据时才会调用ManagedBuffer.nioByteBuffer.
  1. Modify the parameter of SerializerInstance.deserialize and the return value of SerializerInstance.serialize to ChunkedByteBuffer instance.
    修改SerializerInstance.deserialize方法的参数和SerializerInstance.serialize方法的返回值改为ChunkedByteBuffer实例. (The 2G limit 2)
def serialize[T: ClassTag](t: T): ChunkedByteBuffer = {
  output.clear()
  val out = ChunkedByteBufferOutputStream.newInstance()
  // The data is output to the OutputStream, rather than the internal byte[] in the output object.
  // ~~序列化后的数据输出到OutputStream,而不是到output对象的内部字节数组里.~~
  output.setOutputStream(out)
  val kryo = borrowKryo()
  kryo.writeClassAndObject(output, t)
  output.close()
  out.toChunkedByteBuffer
}
  1. Other changes. 其他修改.
Replace ByteBuf with InputStream. 使用 InputStream 替换 ByteBuf.
  1. Add InputStreamManagedBuffer class, used to convert InputStream instance to ManagedBuffer instance. 添加InputStreamManagedBuffer类,用于把InputStream转换成ManagedBuffer实例. (The 2G limit 4)
  2. Modify NioManagedBuffer.convertToNetty method returns InputStream instances when the size of data is larger than Integer.MAX_VALUE. 修改NioManagedBuffer.convertToNetty方法在数据量大于Integer.MAX_VALUE时返回InputStream实例. (The 2G limit 3)
  3. Modify MessageWithHeader classes, support processing InputStream instance (The 2G limit 3) 修改MessageWithHeader类, 支持处理InputStream类型的body对象
  • 2.3.的修改结合起来支持传输内存中超过2G的数据.
  1. Modify the parameters of the Encodable.encode method to OutputStream instance. 修改Encodable.encode方法的参数为OutputStream实例. (The 2G limit 4)
    5.It can handle mixed storage data. ~~UploadBlock添加toInputStream方法,支持处理混合存储数据(The 2G limit 3) ~~
public InputStream toInputStream() throws IOException {
  ChunkedByteBufferOutputStream out = ChunkedByteBufferOutputStream.newInstance();
  Encoders.Bytes.encode(out, type().id());
  encodeWithoutBlockData(out);
  // out.toChunkedByteBuffer().toInputStream() data in memory 
  // blockData.createInputStream() data in hard disk(FileInputStream)
  return new SequenceInputStream(out.toChunkedByteBuffer().toInputStream(),
      blockData.createInputStream());
}
  • 2, 3, 4 and 5 are combined to resolve the 2G limit in RPC message encoding and sending process. 2. 3. 4.5.组合起来解决RPC消息编码和发送过程中的2G限制.
  1. Modify the parameters of the decode method of the classes who implement the Encodable interface to InputStream instance. ~~修改实现Encodable接口子类的decode方法参数为InputStream实例. (The 2G limit 4) ~~
  2. Modify TransportFrameDecoder class, use LinkedList<ByteBuf> to represent the Frame, remove the size limit of Frame. ~~修改TransportFrameDecoder类,使用LinkedList<ByteBuf> 来表示Frame,移除Frame的大小限制. ~~ (The 2G limit 4)
  3. Add ByteBufInputStream class, used to convert LinkedList<ByteBuf> instance to InputStream instance. 添加ByteBufInputStream类,用于把LinkedList<ByteBuf>包装成InputStream实例. 在读取完一个ByteBuf的数据时就会调用ByteBuf.release 方法释放ByteBuf. (The 2G limit 4)
  4. Modify the parameters of RpcHandler.receive method to InputStream instance. 修改RpcHandler.receive方法的参数为InputStream实例. (The 2G limit 4)
  • 6, 7, 8 and 9 are combined to resolve the 2G limit in RPC message receiving and decoding process. 6. 7. 8.9.组合起来解决RPC消息接收和解码的过程中的2G限制

Read data

Local data
  1. Only the data stored in the memory is represented by ChunkedByteBuffer, the other is represented by ManagedBuffer. 只有存储在内存中的数据用 ChunkedByteBuffer 表示,其他的数据都使用 ManagedBuffer 表示. (The 2G limit 1)
  • Modify DiskStore.getBytes's return value type to ManagedBuffer instance, which calls ManagedBuffer.nioByteBuffer only when the memory has enough space to store the ManagedBuffer data. 修改DiskStore.getBytes的返回值为ManagedBuffer实例, 只有在内存有足够的空间储存ManagedBuffer数据时才会调用ManagedBuffer.nioByteBuffer方法.
Remote Data (The 2G limit 4)

There are three options: 有三个可选方案:

  1. Add InputStreamInterceptor to support propagate back-pressure to shuffle server(The option has been implemented): 添加InputStreamInterceptor支持propagate back-pressure 到 shuffle server端(该方案已经实现):
  • When the number of ByteBuf in the cache exceeds a certain amount, call channel.config ().SetAutoRead (false) disable AUTO_READ, no longer automatically call channle.read (). ~~在缓存的 ByteBuf 数量超过一定数量时调用 channel.config().setAutoRead(false) 禁用AUTO_READ, 不再自动调用 channle.read(). ~~
  • When the number of ByteBuf in the cache is smaller than a certain amount, call channel.config().setAutoRead(true) enable AUTO_READ . ~~在缓存的 ByteBuf 数量小于一定数量时调用channel.config().setAutoRead(true) 激活AUTO_READ. ~~
  • The advantage of this option is to support propagate back-pressure; drawback is that can lead semantic change the existing API, in some cases the IO retry function is invalid. 该方案的优点是支持propagate back-pressure; 缺点是会导致现有API的语义改变, 某些情况下导致错误重试功能失效.
  • 参考文档:
    - Netty的read事件与AUTO_READ模式
    - TCP/IP详解--举例明白发送/接收缓冲区、滑动窗口协议之间的关系
    - TCP 滑动窗口协议 详解
  • InputStreamInterceptor设计方案:
    • 创建一固定大小线程安全缓存池
    • netty线程接收到ByteBuf放到缓存池, 如果缓存的ByteBuf超过缓存容量的90%时,调用channel.config().setAutoRead(false), 不在自动接收数据. 对端写入堵塞.
    • 数据处理线程从缓冲池中取出ByteBuf, 如果缓存的ByteBuf数量少于缓存池容量的10%,调用channel.config().setAutoRead(true), 激活数据自动读取.
    • 如果处理完一个ByteBuf,释放该ByteBuf, 并调用channle.read() 接收数据.
  1. When the size of message is greater than a certain value, the message is written to disk, not take up memory. ~~在消息大小大于一定值时,把消息写到硬盘上,不再占用内存. ~~
  • The advantage of this options is to take up very little memory, the disadvantage is to increase the disk IO. 该方案的优点是占用很少的内存,缺点是增加磁盘IO.
  1. Combined with buffer pool, qs far as possible stores data in memory. ~~结合缓存池,尽可能的把数据存储在内存里. ~~
  • Write message to the buffer pool when there has enough memory, otherwise write on disk. ~~把消息写到缓存池, 在缓存池中有足够的内存时,内存不足时才写到硬盘上. ~~

Add buffer pool

The buffer pool can reduce memory allocation, reduce GC time, improve the performance of spark core. 缓存池能够减少内存分配占用, 减少GC时间,提升程序性能

  1. Reduce the number of large objects created in the Eden area, according to experience twitter using buffer pools can significantly reduce the number of GC. 减少在eden区创建大对象的次数,根据twitter的经验,使用缓存池能显著减少GC次数.
    Netty 4 Reduces GC Overhead by 5x at Twitter
  2. Use buffer pool to reduce the number of memory allocations and wiping zero. 使用缓存池能够减少内存分配和抹零次数.
    Using as a generic library
实现该功能的难点有:
  1. Spark在使用ByteBuffer时没有考虑释放问题, 由java GC回收.
  2. 添加引用计数主动释放, 减少GC压力, 需要添加引用计数和内存泄露检测相关代码, 改动大.
  3. 复用netty buffer代码,支持内存泄露检查和动态调整大小.
介绍文档:
  1. Netty Buffers
  2. 深入浅出Netty内存管理 PoolChunk
  3. jemalloc源码解析-核心架构jemalloc源码解析-内存管理
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 157,298评论 4 360
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 66,701评论 1 290
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 107,078评论 0 237
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,687评论 0 202
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,018评论 3 286
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,410评论 1 211
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,729评论 2 310
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,412评论 0 194
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,124评论 1 239
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,379评论 2 242
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 31,903评论 1 257
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,268评论 2 251
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 32,894评论 3 233
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,014评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,770评论 0 192
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,435评论 2 269
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,312评论 2 260

推荐阅读更多精彩内容

  • **2014真题Directions:Read the following text. Choose the be...
    又是夜半惊坐起阅读 8,541评论 0 23
  • 画了一幅自创画《民国时的约会》: 民国时期确切地说是从中华民国成立的1912年到新中国成立的1949年间。 那是一...
    雪盈禅心阅读 599评论 1 7
  • 2015年8月3日 失眠。 总是会周期性地极度焦虑烦躁,厌恶自己的电子产品依赖症、厌恶浪费时间碌碌无为,而这一次甚...
    巽角阅读 254评论 0 1
  • 庄子不二传 第十五回 老保安甲死了。都说是马上风:早起还好好的,自己上厕大便呢,不知怎的就中风倒下了。林场众人...
    徐不二阅读 645评论 1 3