ShoelessCai - 神经网络模型代码解读

神经网络模型代码解读

- 2023 -
03/28
22:13

零号员工

发表时间：2023.03.28 作者：Jingyi 来源：ShoelessCai 阅读：160

2022年9月9日，河南洛阳，二里头遗址祭祀区以西区域考古发掘现场。2023年3月27日至28日，经过项目汇报会、综合评议，最终评委投票选出2022年度全国十大考古新发现。国家文物局副局长关强宣布湖北十堰学堂梁子遗址、山东临淄赵家徐姚遗址、山西兴县碧村遗址、河南偃师二里头都邑多网格式布局、河南安阳殷墟商王陵及周边遗存、陕西旬邑西头遗址、贵州贵安新区大松山墓群、吉林珲春古城村寺庙址、河南开封州桥及附近汴河遗址、浙江温州朔门古港遗址等十个项目入选2022年度十大考古新发现。图：视觉中国

01 神经网络 BP 及其衍生模型 CODE

本篇参考周志华《机器学习》，知乎《机器学习（周志华）课后习题——第五章——神经网络》，代码点击获取。

02 伪代码解释

BPNN 类，除了初始化、fit(), predict() 之外。

关于神经网络传播，前向传播1层，前向传播L层，后向传播1层，后向传播L层。

优化器 Gradient Descent，Stochastic Gradient Descent(SGD), SGD with Momentum, SGD with Adam.

使用数据： Iris。特征数（X_） 4 个，输出（y_） 3 个分类。

1 函数-初始化 xavier_initializer

函数：xavier_initializer(layer_dims_, seed=16)
output parameter_

tmp_w = layer_dims[1]随机向量 * np.sqrt(1 / layer_dims_[0])
tmp_b = np.zeros((1, layer_dims_[1]))

parameters_['W1'] = tmp_w
parameters_['b1'] = tmp_b
# 这里的下标可以从 0 取到 20 如果 20层神经网络

2 函数-损失 compute_cost

compute_cost( y_hat_, y_ )
output cost # value

if (1):
cost = cross_entry_sigmoid( y_hat_, y_ ) # value
elif(2):
cost = cross_entry_softmax( y_hat_, y_ ) # value

3 函数-信息熵 cross_entry_sigmoid

cross_entry_sigmoid(y_hat_, y_)
output loss

loss = -(np.dot(y_.T, np.log(y_hat_)) + np.dot(1 - y_.T, np.log(1 - y_hat_))) / m
# loss is value

4 函数-信息熵 cross_entry_softmax

cross_entry_softmax(y_hat_, y_)
output loss

loss = -np.sum(y_ * np.log(y_hat_)) / m
# loss is value

# 以上两个函数 cross_entry_sigmoid 和 cross_entry_softmax 主要是逐个计算信息熵，append 进 loss_vec
# 这里展开这两个函数，主要用于帮助读者理解信息熵的计算方法。BTW 和周志华的书会有些不一样，先存个疑吧。

5 函数-反向传播激活 sigmoid_backward

sigmoid_backward( da_, cache_z )
output dz_

dz_ = da_ * a * (1-a) # Error，西瓜书 P103-105

6 函数-反向传播激活 softmax_backward

softmax_backward( y_, cache_z )
output dz_

tmp_z = cache_z - np.exp(cache_z)
# cache = ( 前一层激活输出 a_pre_, 权重 w_, 右端项 b_, 激活后再映射 z_ ) 见本文之后的函数
# 激活后再映射 = w.T * a_pre_ + b_
a = np.exp(tmp_z) / np.sum( np.exp(tmp_z), axis=1, keepdims=True )
dz_ = a - y_

7 函数-反向传播激活 relu_backward

relu_backward( da_, cache_z )
output dz

dz = 复制 da_ # da_ 为激活输出的偏差
dz[cache_z<=0]=0

8 参数 parameters_

W1 3X4 Vector # 使用时 W1.T * a_pre
b1 1X3 Vector
W2 3X3 Vector
b2 1X3 Vector

9 函数-前向传 forward_L_layer

forward_L_layer( X_, parameters_ )
output: a_last, caches # 变量具体含义，请参考函数

a_last 1X3 Vector (上一层系数)
caches 2X4 Vector

a_ = X_
FOR 逐个计算
w_ = parameters_['W1']
b_ = parameters_['b1']
a_pre_ = a_
a_, cache_ = forward_one_layer( a_pre_, w_, b_, 'relu' )
caches.append( cache_ )

# 最后一个 w, b 为上一层的值
w_last = 最后一个权重
b_last = 最后一个截距
a_last, cache_ = forward_one_layer(a_, w_last, b_last, 'relu')
caches.append(cache_)

# basic idea 就是，传L层，一层一层传

10 函数-反向传 backward_L_layer

backward_L_layer( a_last, y_, caches )
output: grads = ['da2', 'dW2', 'db2', 'da1', 'dW1', 'db1']

# 第二层输出层
da2 1X3 Vector
dW2 3X3 Vector
db2 1X3 Vector

# 第一层输入隐藏层
da1 1X3 Vector
dW1 1X4 Vector
db1 1X3 Vector

# 最后一层
if(二分类)
da_last = - (对 a_last 平滑) # activated value margin
da_pre_L_1, dwL_, dbL_ = backward_one_layer( da_last, 上一层caches，'relu' )
elif(多分类)
da_pre_L_1, dwL_, dbL_ = backward_one_layer( da_last, 上一层caches，'relu' )
# 最后一层，存值

grads['da'] = da_pre_L_1
grads['dW'] = dwL_
grads['db'] = dbL_

FOR L 到 1：
da_pre_, dw, db = backward( grads['da2'], caches[0], 'relu' )
grads['da2'] = da_pre_
grads['dW2'] = dw
grads['db2'] = db

11 函数-前向传 forward_one_layer

forward_one_layer( a_pre_, w_, b_, activation_ )
output: a_, cache_

# 上一层的 activated value 作为输入
z_ = np.dot( a_pre_, w_.T ) + b_
a_ = activation_func( z_ )
cache_ = (a_pre_, w_, b_, z_)
# cache panel definition 本算法暂时尚未用到此函数

12 函数-反向传 backward_one_layer

backward_one_layer( da_, cache_, activation_ )

output: da_pre, dw, db

# 解析 cache，获取 a_pre_ 上一层 activation_value
(a_pre_, w_, b_, z_) = cache_

# 反向传播，都是 margin calculation
dz_ = activations_func( da_, z_ )
dw = np.dot( dz_.T, a_pre_ ) / m # value
db = np.sum( dz_, axis=0, keepdims=True ) / m
da_pre = np.dot( dz_, w_ )

13 函数-更新参数 update_parameters_with_gd

update_parameters_with_gd( parameters_, grads, learning_rate )

output: parameters_

W -= learning_rate * dW
b -= learning_rate * db
# W, b 来自于 parameters_; dW, db 来自于 grads

13 函数-更新参数 update_parameters_with_sgd

函数：update_parameters_with_sgd( parameters_, grads, learning_rate )

output: parameters_

W -= learning_rate * dW
b -= learning_rate * db
# W, b 来自于 parameters_; dW, db 来自于 grads
# 这里更新 W, b 的时候，没有加上随机值

15 函数-更新参数 update_parameters_with_momentum

update_parameters_with_momentum( parameters, grads, velocity, beta, learning_rate )

output: parameters_, velocity

# 输入层
velocity['dW1'] = beta * velocity['dW1'] + (1-beta) * grads['dW1']
velocity['db1'] = beta * velocity['db1'] + (1-beta) * grads['db1']
parameters['W1'] -= learning_rate * velocity['dW1']
parameters['b1'] -= learning_rate * velocity['db1']

# 输出层
velocity['dW2'] = beta * velocity['dW2'] + (1-beta) * grads['dW2']
velocity['db2'] = beta * velocity['db2'] + (1-beta) * grads['db2']
parameters['W2'] -= learning_rate * velocity['dW2']
parameters['b2'] -= learning_rate * velocity['db2']

16 函数-参数更新 update_parameters_with_sgd_adam

update_parameters_with_sgd_adam( parameters_, grads_, velocity, square_grad, epoch, learning_rate=0.1, beta1=0.9, beta2=0.999, epsilon=1e-8 )

output： parameters_, velocity, square_grad

# （1）输入层
# 更新 velocity：dW1, db1 和 grads 的 dW1, db1 加权平均
velocity['dW1'] = beta1 * velocity['dW1'] + (1-beta1) * grads_['dW1']
velocity['db1'] = beta1 * velocity['db1'] + (1-beta1) * grads_['db1']

# 修正 velocity dW1, db1
vw_corr = velocity['dW1'] / (1-beta1^epoch)
vb_corr = velocity['db1'] / (1-beta1^epoch)

# 更新 square_grad：dW1, db1 和 grads_ 的 dW1, db1 加权平均
square_grad['dW1'] = beta2 * square_grad['dW1'] + (1-beta2) * (grads_['dW1']**2)
square_grad['db1'] = bera2 * square_grad['db1'] + (1-beta2) * (grads_['db1']**2)

# 修正 square_grad dW, db
sw_corr = square_grad['dW1'] / (1-beta2^epoch)
sb_corr = square_grad['db1'] / (1-beta2^epoch)

parameters_['W1'] -= learning_rate * vw_corr / np.sqrt( sw_corr + epsilon )
parameters_['b1'] -= learning_rate * vb_corr / np.sqrt( sb_corr + epsilon )

# （2）输出层
# 更新 velocity： dW2, db2 和 grads 的 dW2，db2 加权平均
velocity['dW2'] = beta1 * velocity['dW2'] + (1-beta1) * grads_['dW2']
velocity['db2'] = beta1 * velocity['db2'] + (1-beta1) * grads_['db2']

# 修正 velocity 的 dW, db
vw_corr = velocity['dW2'] / (1-beta1^epoch)
vb_corr = velocity['db2'] / (1-beta1*epoch)

# 更新 square_grad： dW2, db2 和 grads 的 dW2, db2 的平方的加权平均
square_grad['dW2'] = beta2 * square_grad['dW2'] + (1-beta2) * (grads_['dW2']**2)
square_grad['db2'] = beta2 * square_grad['db2'] + (1-beta2) * (grads_['db2']**2)

# 修正 square_grad 的 dW, db
sw_corr = square_grad['dW2'] / (1-beta2^epoch)
sb_corr = square_grad['db2'] / (1-beta2^epoch)

parameters_['W2'] -= learning_rate * vw_corr / np.sqrt( sw_corr + epsilon )
parameters_['b2'] -= learning_rate * vb_corr / np.sqrt( sb_corr + epsilon )

# 函数 6~9 在优化分类器中调用

17 函数-优化算法 optimizer_gd

optimizer_gd( X_, y_, parameters_, epochs, learning_rate, seed )

output: parameters_, 开销

for {

# （1）随机index
生成随机向量 random_index

# （2）采样，更新 grads
# forward_L 返回每一层的加权向量 a_last, 相应更新的值&权重 caches
a_last, caches = self.forward_L_layer(X_[[random_index], :], parameters_)

# 输入前向输出的结果 a_last, caches, 输出梯度边际 grads['dW', 'db']
grads = self.backward_L_layer(a_last, y_[[random_index], :], caches)

# Wi, bi 用学习率迭代
parameters_ = update_parameters_with_sgd(parameters_, grads, learning_rate)

# （3）全量，计算开销
# 再传一层，返回加权向量，用于计算开销
a_last_cost, _ = self.forward_L_layer(X_, parameters_)
cost = self.compute_cost(a_last_cost, y_)
cost_vec.append(cost)

} // for

18 函数-优化算法 optimizer_sgd_momentum

optimizer_sgd_momentum( X_, y_, parameters_, beta, epoch, learning_rate, seed )

output parameters_, 开销

初始化

for {

# （1）随机index
random_index = np.random.randint(0, m_)

# （2）采样，更新 grads
# forward_L 返回每一层的加权向量 a_last, 相应更新的值&权重 caches
# 输入前向输出的结果 a_last, caches, 输出梯度边际 grads['dW', 'db']
# Wi, bi 用学习率迭代
a_last, caches = self.forward_L_layer(X_[[random_index], :], parameters_)
grads = self.backward_L_layer(a_last, y_[[random_index], :], caches)
parameters_, v_ = update_parameters_with_sgd_momentum(
parameters_, grads, velocity, beta, learning_rate)

# （3）全量，计算开销
a_last_cost, _ = self.forward_L_layer(X_, parameters_)
cost = self.compute_cost(a_last_cost, y_)
cost_vec.append(cost)

} // for

19 函数-优化算法 optimizer_sgd_adam

optimizer_sgd_adam( X_, y_, beta1, beta2, epsilon, epoch, learning_rate, seed )

output: parameters_，开销

初始化

for {

# （1）随机index
random_index = np.random.randint(0, m_)

# （2）采样，更新 grads
# forward_L 返回每一层的加权向量 a_last, 相应更新的值&权重 caches
# 输入前向输出的结果 a_last, caches, 输出梯度边际 grads['dW', 'db']
# Wi, bi 用学习率迭代
a_last, caches = self.forward_L_layer(X_[[random_index], :], parameters_)
grads = self.backward_L_layer(a_last, y_[[random_index], :], caches)
parameters_ = update_parameters_with_sgd_adam(
parameters_, grads, velocity, square_grad,
epoch + 1, learning_rate, beta1, beta2, epsilon)

# （3）全量，计算开销
a_last_cost, _ = self.forward_L_layer(X_, parameters_)
cost = self.compute_cost(a_last_cost, y_)
cost_vec.append(cost)

} // for

原文链接

长按/扫码，有您的支持，我们会更加努力！

TOP 5 精选

回到顶部回上一级

写文章

有你的鼓励
ShoelessCai 将更努力

文档免费。保护知识产权，保护创新。