Core and Spark SQL
API updates
SPARK-17864: Data type APIs are stable APIs.
SPARK-18351: from_json and to_json for parsing JSON for string columns
SPARK-16700: When creating a DataFrame in PySpark, Python dictionaries can be used as values of a StructType.
Performance and stability
SPARK-17861: Scalable Partition Handling. Hive metastore stores all table partition metadata by default for Spark tables stored with Hive’s storage formats as well as tables stored with Spark’s native formats. This change reduces first query latency over partitioned tables and allows for the use of DDL commands to manipulate partitions for tables stored with Spark’s native formats. Users can migrate tables stored with Spark’s native formats created by previous versions by using the MSCK command.
SPARK-16523: Speeds up group-by aggregate performance by adding a fast aggregation cache that is backed by a row-based hashmap.
Other notable changes
SPARK-9876: parquet-mr upgraded to 1.8.1
Programming guides: Spark Programming Guide and Spark SQL, DataFrames and Datasets Guide.
Structured Streaming
API updates
SPARK-17346: Kafka 0.10 support in Structured Streaming
SPARK-17731: Metrics for Structured Streaming
SPARK-17829: Stable format for offset log
SPARK-18124: Observed delay based Event Time Watermarks
SPARK-18192: Support all file formats in structured streaming
SPARK-18516: Separate instantaneous state from progress performance statistics
Stability
SPARK-17267: Long running structured streaming requirements
Programming guide: Structured Streaming Programming Guide.
MLlib
API updates
SPARK-5992: Locality Sensitive Hashing
SPARK-7159: Multiclass Logistic Regression in DataFrame-based API
SPARK-16000: ML persistence: Make model loading backwards-compatible with Spark 1.x with saved models using spark.mllib.linalg.Vector columns in DataFrame-based API
Performance and stability
SPARK-17748: Faster, more stable LinearRegression for < 4096 features
SPARK-16719: RandomForest: communicate fewer trees on each iteration
Programming guide: Machine Learning Library (MLlib) Guide.
SparkR
The main focus of SparkR in the 2.1.0 release was adding extensive support for ML algorithms, which include:
New ML algorithms in SparkR including LDA, Gaussian Mixture Models, ALS, Random Forest, Gradient Boosted Trees, and more
Support for multinomial logistic regression providing similar functionality as the glmnet R package
Enable installing third party packages on workers using spark.addFile (SPARK-17577).
Standalone installable package built with the Apache Spark release. We will be submitting this to CRAN soon.
Programming guide: SparkR (R on Spark).
GraphX
SPARK-11496: Personalized pagerank
Programming guide: GraphX Programming Guide.
Deprecations
MLlib
SPARK-18592: Deprecate unnecessary Param setter methods in tree and ensemble models
Changes of behavior
Core and SQL
SPARK-18360: The default table path of tables in the default database will be under the location of the default database instead of always depending on the warehouse location setting.
SPARK-18377: spark.sql.warehouse.dir is a static configuration now. Users need to set it before the start of the first SparkSession and its value is shared by sessions in the same application.
SPARK-14393: Values generated by non-deterministic functions will not change after coalesce or union.
SPARK-18076: Fix default Locale used in DateFormat, NumberFormat to Locale.US
SPARK-16216: CSV and JSON data sources write timestamp and date values in ISO 8601 formatted string. Two options, timestampFormat and dateFormat, are added to these two data sources to let users control the format of timestamp and date value in string representation, respectively. Please refer to the API doc of DataFrameReader and DataFrameWriter for more details about these two configurations.
SPARK-17427: Function SIZE returns -1 when its input parameter is null.
SPARK-16498: LazyBinaryColumnarSerDe is fixed as the the SerDe for RCFile.
SPARK-16552: If a user does not specify the schema to a table and relies on schema inference, the inferred schema will be stored in the metastore. The schema will be not inferred again when this table is used.
Structured Streaming
SPARK-18516: Separate instantaneous state from progress performance statistics
MLlib
SPARK-17870: ChiSquareSelector now accounts for degrees of freedom by using pValue rather than raw statistic to select the top features.
Known Issues
SPARK-17647: In SQL LIKE clause, wildcard characters ‘%’ and ‘_’ right after backslashes are always escaped.
SPARK-18908: If a StreamExecution fails to start, users need to check stderr for the error.
特别声明:本文为中国直播网直播号作者或机构上传并发布,仅代表该作者或机构观点,不代表中国直播网的观点或立场,中国直播网仅提供信息发布平台。
版权声明:版权归著作权人,转载仅限于传递更多信息,如来源标注错误侵害了您的权利,请来邮件通知删除,一起成长谢谢
欢迎加入:直播号,开启无限创作!一个敢纰漏真实事件,说真话的创作分享平台,一个原则:只要真实,不怕事大,有线索就报料吧!申请直播号请用电脑访问https://zbh.zhibotv.com.cn。