使用 python 的 `pytorch`
训练一个强化学习模型 (可能为 Meta Reinforcement Learning (meta-learning 策略), or maybe Episodic Meta-RL, 如果不能达到效果也可以尝试其他的RL, 但课上需求里必须要一个meta-RL), 让这个模型在相似任务训练过后达到更快的训练速度 (例如例如第一次训练需要100次,而第50次训练类似任务可能只需要50次, 至是举例)
看到这个网站有类似的描述 (two-armed bandit problem) :`https://medium.com/hackernoon/learning-policies-for-learning-policies-meta-reinforcement-learning-rl%C2%B2-in-tensorflow-b15b592a2ddf`
在这个例子中假设一个老鼠面前有两个水管(左边80%概率出水,右边20%)在每次准确率达到 70% 后调换这两个水管的概率 (左边20%概率出水,右边80%)
查看 Meta-RL 相关的模型是否能让达到所需准确率的训练所需的次数减少。 (并用 matplotlib 显示每次达到准确率所需要的次数 + 其他相关数据)
根据课上描述大概写了一下环境是怎么样 (可以修改)
```python
"""
env: WaterPipeEnv
The task consisted of two blocks:
the left high return block, in which there was an 80% probability of obtaining a reward for licking the left spout
and a 20% probability for licking the right spout,
and the right high return block, in which there was an 80% probability of obtaining a reward for licking the right spout
and a 20% probability for licking the left spout.
At the start of each training day, mice were randomly assigned to one of the two blocks,
and the block type was switched when the mouse obtained rewards from the high-return side in at least 70% of the last 30 attempts.
(used in reverse method)
We assumed that each attempt made by the mouse required a certain amount of energy,
so if the mouse did not receive a reward for that attempt, the gain value was -0.25,
while if the mouse received a reward, the gain value was 1.
Therefore, if the mouse continues to lick the low-return side, the expected gain value should be 0,
because the mouse has a 20% chance of receiving water by licking that side repeatedly.
TODO:
测试时 probs 可以为任意概率, 只需要两个加起来为 1 such as [0.1, 0.9], [0.15, 0.85], ... 只是暂时为了简便先定了 0.8 with 0.2
"""
class WaterPipeEnv:
def __init__(self):
# Initialize with a random block type
self.probs = [0.8, 0.2] if random.random() < 0.5 else [0.2, 0.8]
self.history = [] # To track the last 30 attempts
self.reverse_count = 0 # To track the total number of reversals
def reverse(self):
# Switch the probabilities
self.probs = [self.probs[1], self.probs[0]]
self.reverse_count += 1
# save the history (maybe use reverse_count `data\xxx_{reverse_count}.xxx`) as a file
self.history = [] # Reset history after reversal
def should_reverse(self):
# Check if the block needs to be switched
if len(self.history) >= 30:
success_rate = 1 # TODO: Calculate the success rate of the last 30 attempts
return success_rate >= 0.7
return False
def step(self, action):
# Simulate the step based on the action taken
reward = 1 if random.random() < self.probs[action] else 0
self.history.append(reward)
# Compute the gain
gain = 1 if reward == 1 else -0.25
return gain, reward, action
```
咨询 Alpha 小助手,获取更多课业帮助