稀疏矩阵的存储

如果将整个稀疏矩阵都存储进入内存中的话, 那将会占用相当大的空间, 根据稀疏矩阵中非0元素在矩阵中的分布以及个数, 可以使用不同的数据结构来储存矩阵, 这样能大大减小占用的内存数量, 但这种轻量化存储的劣势是如果要访问其中某个特定位置的元素, 往往更复杂, 而且需要额外的structure从而将原始矩阵清晰地还原, 再去访问其中特定的元素。

大体来说, 促成稀疏矩阵的格式共有两大类:

支持对原始稀疏矩阵高效修改的数据结构, 例如DOK (Dictionary of keys), LIL (List of lists), or COO (Coordinate list)等, 用这种数据结构可以很方便地对往原始矩阵中假如新的元素, 或者从原始矩阵中删去元素 (增加元素指的是将原来为0的元素变为非0, 同样删除同理)
支持快速访问到特定位置的元素的数据结构以及支持基于矩阵形式的操作, 即两个稀疏矩阵相加或者相减等, 例如CSR (Compressed Sparse Row) 和 CSC (Compressed Sparse Column).

Dictionary of keys (DOK)

DOK 是一个字典, 其中每个key是(row_index, column_index), 而value则是在该位置上非0元素的值。这种格式的储存方式非常适合增删改, 即如果要将增加某个位置的非0元素直接向字典中加一个键值对就行, 同样删除同理, 但如果要查某一个位置的元素是否是非0元素, 那么就需要遍历整个字典的keys, 较为耗时, 而且字典的key是无序的, 如下为Scipy中实现DOK的代码, 注意DOK代码的意思是我们输入进去一个稀疏矩阵, scipy帮你把这个大的稀疏矩阵压缩成dok_matrix中的形式, 输出的mtx对象就是压缩后的DOK矩阵对象, 访问DOK矩阵中的元素的话需要用(row_index, col_index)的元组进行索引。

from scipy import sparse
import numpy as np
#除此之外dok_matrix还接收数组形式的array作为输入
mtx = sparse.dok_matrix((5, 5), dtype=np.float64)
print(mtx)

for ir in range(5):
    for ic in range(5):
        mtx[ir, ic] = 1.0 * (ir != ic)

print(mtx)
print(mtx.todense())

#结果是(注意第一个print输出的是空值, 因为其就构造了一个(5,5)的零矩阵, 其中没有非0元素自然没有非零元素键值对):


  (0, 1)    1.0
  (0, 2)    1.0
  (0, 3)    1.0
  (0, 4)    1.0
  (1, 0)    1.0
  (1, 2)    1.0
  (1, 3)    1.0
  (1, 4)    1.0
  (2, 0)    1.0
  (2, 1)    1.0
  (2, 3)    1.0
  (2, 4)    1.0
  (3, 0)    1.0
  (3, 1)    1.0
  (3, 2)    1.0
  (3, 4)    1.0
  (4, 0)    1.0
  (4, 1)    1.0
  (4, 2)    1.0
  (4, 3)    1.0

[[0. 1. 1. 1. 1.]
 [1. 0. 1. 1. 1.]
 [1. 1. 0. 1. 1.]
 [1. 1. 1. 0. 1.]
 [1. 1. 1. 1. 0.]]

如果想从mtx中取键值对的话, 可以采用如下方法:

print(mtx[1, 1])
>>> 0.0

print(mtx[1, 1:3])
>>> (0, 1)  1.0

print(mtx[1, 1:3].todense())
>>> [[0. 1.]]

print(mtx[[2,1], 1:3].todense())
#注意此处是先取列表中第一个元素2, 取得[2,1:3]的值, 然后再取第二个元素1, 取得[1,1:3]的值
>>> [[1. 0.]
    [0. 1.]]

List of lists (LIL)

LIL的数据结构是列表, 每一行对应着的矩阵中该行, 而LIL中每一行都是一个列表, 包含在该行出现非0元素的列位置以及非0元素的值, 可能的实现方式是每一行都包含一个列表, 该列表是个升序排序后的列表, 每个元素代表在原矩阵中在该行非0元素出现的列位置

Advantages of the LIL format

supports flexible slicing
changes to the matrix sparsity structure are efficient

Disadvantages of the LIL format

arithmetic operations LIL + LIL are slow (consider CSR or CSC)
slow column slicing (consider CSC)
slow matrix vector products (consider CSR or CSC)

Intended Usage

LIL is a convenient format for constructing sparse matrices
once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
consider using the COO format when constructing large matrices

Data Structure

An array (self.rows) of rows, each of which is a sorted list of column indices of non-zero elements.
The corresponding nonzero values are stored in similar fashion in self.data.

Scipy代码如下:

from scipy import sparse
from numpy.random import rand
import numpy as np
mtx = sparse.lil_matrix((4, 5))
data = np.round(rand(2, 3))

print(data)
>>> [[0. 1. 0.]
    [0. 1. 0.]]
mtx[:2, [1, 2, 3]] = data

print(mtx)
>>> (0, 2)  1.0
    (1, 2)  1.0

print(mtx.todense())
>>> [[0. 0. 1. 0. 0.]
     [0. 0. 1. 0. 0.]
     [0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0.]]

print(mtx.toarray())
>>> [[0. 0. 1. 0. 0.]
     [0. 0. 1. 0. 0.]
     [0. 0. 0. 0. 0.]
     [0. 0. 0. 0. 0.]]
mtx = sparse.lil_matrix([[0, 1, 2, 0], [3, 0, 1, 0], [1, 0, 0, 1]])

print(mtx.todense()) 
>>> [[0 1 2 0]
     [3 0 1 0]
     [1 0 0 1]]
     
print(mtx)
>>>   (0, 1)    1
      (0, 2)    2
      (1, 0)    3
      (1, 2)    1
      (2, 0)    1
      (2, 3)    1
      
print(mtx[:2, :])
>>> (0, 1)  1
    (0, 2)  2
    (1, 0)  3
    (1, 2)  1

print(mtx[:2, :].todense())
>>> [[0 1 2 0]
     [3 0 1 0]]

print(mtx[1:2, [0,2]].todense())
>>> [[3 1]]

print(mtx.todense())
>>> [[0 1 2 0]
     [3 0 1 0]
     [1 0 0 1]]

Coordinate list (COO)

COO是一个列表, 其元素为元组, 每个元组的格式是(row_index,column_index,value), 整个列表首先按row_index排序, 之后在此基础上再按column_index排序, 排序的意义在于可以更快的查找到特定元素, 同样这种数据结构很适合在构建matrix的时候往matrix中一步一步加元素, 即incremental matrix construction。

Advantages of the COO format

facilitates fast conversion among sparse formats
permits duplicate entries (see example)
very fast conversion to and from CSR/CSC formats

Disadvantages of the COO format

does not directly support:
- arithmetic operations
- slicing

Intended Usage

COO is a fast format for constructing sparse matrices
Once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)

代码如下:

from scipy import sparse
import numpy as np
mtx = sparse.coo_matrix((3, 4), dtype=np.int8)
print(mtx.todense())
>>> [[0 0 0 0]
     [0 0 0 0]
     [0 0 0 0]]

row = np.array([0, 3, 1, 0])
col = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
mtx = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
print(mtx)
>>>   (0, 0)    4
      (3, 3)    5
      (1, 1)    7
      (0, 2)    9

print(mtx.todense())
>>> [[4 0 9 0]
     [0 7 0 0]
     [0 0 0 0]
     [0 0 0 5]]

row = np.array([0, 0, 1, 3, 1, 0, 0])
col = np.array([0, 2, 1, 3, 1, 0, 0])
data = np.array([1, 1, 1, 1, 1, 1, 1])
mtx = sparse.coo_matrix((data, (row, col)), shape=(4, 4))
print(mtx.todense())
>>> [[3 0 1 0]
     [0 2 0 0]
     [0 0 0 0]
     [0 0 0 1]]

#不支持索引
print(mtx[2, 3])
>>> Traceback (most recent call last):
  File "/Users/shenyi/Documents/Untitled.py", line 21, in <module>
    print(mtx[2, 3])
TypeError: 'coo_matrix' object is not subscriptable

可以用scipy中的sparse中的csr_matrix和csc_matrix, 其中CSR代表Compressed Sparse Row matrix而CSC代表Compressed Sparse Column matrix

Compressed sparse row和Compressed sparse column

CSR或者 compressed row storage (CRS)用三个列表来表达稀疏矩阵, 其跟COO较为相似, 但比COO更进一步的是其还将每一行的空间压缩了, 这种类型的数据结构支持快速row access和矩阵的相加, 同理 CSC支持快速column access和矩阵的相加:

我们用CSR举例, 首先第一种最简单的即生成全是0的矩阵:

import numpy as np
from scipy.sparse import csr_matrix
csr_matrix((3, 4), dtype=np.int8).toarray()

#结果是:
>>> array([[0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0]], dtype=int8)

如果想往这里边加非0元素, 可以传入这些非0元素的数值以及其所以在行和列的下标位置:

row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
#即第(0,0)位置上的元素是1, 第(0,2)位置上的元素是2, 以此类推
#注意此处调用csr_matrix的时候如果是传入行和列的列表的话要以元组形式传入, 然后再将该元组与data组成一个新的元祖
csr_matrix((data, (row, col)), shape=(3, 3)).toarray()

#结果是:
>>> array([[1, 0, 2],
           [0, 0, 3],
           [4, 5, 6]])

另外一种调用方式是更复杂一些的, 首先传入数据列表, 接着传入indices和indptr, 如下:

indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 1, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
#注意此处的csr_matrix调用时传参方法与上述不同, 上面用col和row的方法中是将col和row打包成一个元组, 此处没有如此操作
csr_matrix((data, indices, indptr), shape=(3, 3)).toarray()

#结果是:
>>> array([[1, 0, 2],
           [0, 3, 0],
           [4, 5, 6]])

其中我们首先看indices, indices代表每一行的非0元素的位置, indptr列表中最后一个元素代表整个矩阵中所有非0元素的数量, 除了最后一个元素外, 其他每个元素从第一个元素开始依次指代每一行的数据在indices中开始的位置。

我们来看indptr中第一个元素, 第一个元素代表最后生成的矩阵的第一行, 而该元素等于0代表矩阵中第一行是从indices中第0个元素开始; indptr中第二个元素代表矩阵中的第二列, 而该元素等于2代表在indices这个列表中第二个元素是矩阵中第二列的起始元素

也就是indptr中第二个元素代表其对应着矩阵中第二列, 该元素等于2, 代表indices[2]是第二列的第一个非零元素的位置, 发现indices[2]=1, 说明在矩阵中该行第一个非0元素出现在该行的第一个位置(以0为起始下标), 因为indices和data的下标是一一对应的, 所以此时data[2]=3, 所以矩阵的(1,1)的值为3

附上一张图:

CSR

其中用csr矩阵的形式的优势和劣势为:

Advantages of the CSR format

efficient arithmetic operations CSR + CSR, CSR * CSR, etc.
efficient row slicing
fast matrix vector products

Disadvantages of the CSR format

slow column slicing operations (consider CSC)
changes to the sparsity structure are expensive (consider LIL or DOK)

>>> import numpy as np
>>> from scipy.sparse import csc_matrix
>>> csc_matrix((3, 4), dtype=np.int8).toarray()
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

>>> row = np.array([0, 2, 2, 0, 1, 2])
>>> col = np.array([0, 0, 1, 2, 2, 2])
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> csc_matrix((data, (row, col)), shape=(3, 3)).toarray()
array([[1, 0, 4],
       [0, 0, 5],
       [2, 3, 6]])

>>> indptr = np.array([0, 2, 3, 6])
>>> indices = np.array([0, 2, 2, 0, 1, 2])
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> csc_matrix((data, indices, indptr), shape=(3, 3)).toarray()
array([[1, 0, 4],
       [0, 0, 5],
       [2, 3, 6]])

csc矩阵与csr矩阵的构造方式雷同, 只不过之前是一行一行开始扫描indptr和indices的, 现在变为一列一列了, 如下, 我们看indptr[1]=2, 代表indices中第2个元素开始是矩阵第二列的范畴, 而indices[2]=2, 所以说矩阵第二列中从上往下数第二个元素是data[2]=3, 所以矩阵(2,1)元素为3

>>> import numpy as np
>>> from scipy.sparse import csc_matrix
>>> csc_matrix((3, 4), dtype=np.int8).toarray()
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

>>> row = np.array([0, 2, 2, 0, 1, 2])
>>> col = np.array([0, 0, 1, 2, 2, 2])
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> csc_matrix((data, (row, col)), shape=(3, 3)).toarray()
array([[1, 0, 4],
       [0, 0, 5],
       [2, 3, 6]])

>>> indptr = np.array([0, 2, 3, 6])
>>> indices = np.array([0, 2, 2, 0, 1, 2])
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> csc_matrix((data, indices, indptr), shape=(3, 3)).toarray()
array([[1, 0, 4],
       [0, 0, 5],
       [2, 3, 6]])

而其唯一区别即一个是slow column slicing operation一个是slow row slicing operation

如果想从matrix形式变回原来row和col的形式, 可以用如下方法:

Here's your array mtr:

In [11]: mtr Out[11]: \<3x3 sparse matrix of type '\<class 'numpy.int64'\>' with 6 stored elements in Compressed Sparse Row format\> In [12]: mtr.A Out[12]: array([[1, 0, 2], [0, 0, 3], [4, 5, 6]], dtype=int64)

Convert to COO format, and access the data, row and col attributes.

In [13]: c = mtr.tocoo() In [14]: c.data Out[14]: array([1, 2, 3, 4, 5, 6], dtype=int64) In [15]: c.row Out[15]: array([0, 0, 1, 2, 2, 2], dtype=int32) In [16]: c.col Out[16]: array([0, 2, 2, 0, 1, 2], dtype=int32)

或者用如下方法:

import numpy as np
from scipy.sparse import csr_matrix
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
mtr = csr_matrix((data, (row, col)))

print(mtr.todense())

rows, cols = mtr.nonzero()
data = mtr[rows, cols]

print(rows, cols, data)

#结果是:
[[1 0 2]
 [0 0 3]
 [4 5 6]]
[0 0 1 2 2 2] [0 2 2 0 1 2] [[1 2 3 4 5 6]]