标签: spark

Spark 通用数据访问

##Data abstractions RDD is the core abstraction in Apache Spark. It is an immutable, fault-tolerant distributed collection of statically typed objects that are usually stored in-memory. DataFrame abstraction is built on top of RDD and it adds “named” columns. Moreover, the Catalyst optimizer, under the hood, compiles the operations and generates JVM bytecode for efficient execution.

xkrivzooh大约 10 分钟

理解Compressed Sparse Column Format (CSC)

最近在看《Spark for Data Science》这本书，阅读到《Machine Learning》这一节的时候被稀疏矩阵的存储格式CSC给弄的晕头转向的。所以专门写一篇文章记录一下我对这种格式的理解。

##目的 Compressed Sparse Column Format (CSC)的目的是为了压缩矩阵，减少矩阵存储所占用的空间。这很好理解，手法无法就是通过增加一些"元信息"来描述矩阵中的非零元素存储的位置(基于列)，然后结合非零元素的值来表示矩阵。这样在一些场景下可以减少矩阵存储的空间。

xkrivzooh大约 4 分钟