首页 | 新闻 | 新品 | 文库 | 方案 | 视频 | 下载 | 商城 | 开发板 | 数据中心 | 座谈新版 | 培训 | 工具 | 博客 | 论坛 | 百科 | GEC | 活动 | 主题月 | 电子展
返回列表 回复 发帖

Spark 2.0, high level concept

Spark 2.0, high level concept

Entry point and basic abstraction

For Spark base
main entry point: SparkContext
basic abstraction: RDD

For Spark SQL
main entry point: SparkSession
basic abstraction: DataFrame

For Spark Streaming
Main entry point:
basic abstraction: DStream

For Spark ML
Main entry point:
Core Classes

    Spark base

    pyspark.SparkContext
    Main entry point for Spark functionality.

    pyspark.RDD
    A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.

    Spark Streaming

    pyspark.streaming.StreamingContext
    Main entry point for Spark Streaming functionality.

    pyspark.streaming.DStream
    A Discretized Stream (DStream), the basic abstraction in Spark Streaming.

    Spark SQL and DataFrame

    pyspark.sql.SQLContext
    Main entry point for DataFrame and SQL functionality.

    pyspark.sql.DataFrame
    A distributed collection of data grouped into named columns.

Spark running mode
Locally
Cluster
Setup and run/submit job
Locally
Setup
Spark shell and submit job

    ./bin/spark-shell --master local[2]
    OR
    ./bin/pyspark --master local[2]

Submit job

    ./bin/spark-submit examples/src/main/python/pi.py 10
     
    OR
    ./bin/spark-submit examples/src/main/r/dataframe.R

Spark Stand alone cluster
Spark YARN cluster
What ?

:paste
:help

Spark context available as sc.
SQL context available as sqlContext.

Read csv files as Dataframe in Apache Spark with spark-csv package. after loading data to Dataframe save dataframe to parquetfile.

    val df = sqlContext.read
          .format("com.databricks.spark.csv")
          .option("header", "true")
          .option("inferSchema", "true")
          .option("mode", "DROPMALFORMED")
          .load("/home/myuser/data/log/*.csv")
    df.saveAsParquetFile("/home/myuser/data.parquet")

    val df_1 = sqlContext.read.parquet("/Users/user_name/Work/tmp/sample.parquet")
    df.dtypes
    df.show()
返回列表