21-SparkSQL02

字数 180阅读 24

DataFrame

python pandas

R

RDD MapReduce

DataFrame vs Dataset(1.6)

DS: Java  Scala

DF: 4

SchemaRDD < 1.3

==>

DataFrame

A Dataset is a distributed collection of data.

剥洋葱式分析

A DataFrame is a distributed collection of data

organized into named columns

table in a relational database

DataFrame = Dataset[Row]

DataFrame vs RDD vs Dataset

概念  collection

API    map  filter  flatMap .....

数据结构

textFile(path)

RDD[Person]

name age height

spark.sql("").show()

Spark SQL入口点

<2: SQLContext  HiveContext

>=2: SparkSession

spark.read.format("json").load(path)

spark.read.format("text").load(path)

spark.read.format("parquet").load(path)

spark.read.format("orc").load(path)

源码面前 了无秘密

infos.txt ==> DataFrame

val students = sc.textFile("file:///home/hadoop/data/student.data").map(_.split("\\|")).map(x=>Student(x(0),x(1),x(2),x(3))).toDF()

show()

=> show(20,true)

show(5)

推荐阅读更多精彩内容