Skip to content

Commit

Permalink
support for info dict (backward incompatibile)
Browse files Browse the repository at this point in the history
  • Loading branch information
yannbouteiller committed Mar 28, 2021
1 parent fad8aee commit df4a3e0
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 28 deletions.
24 changes: 13 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ This method defines the core mechanism of Real-Time Gym environments:
Time-steps are being elastically constrained to their nominal duration. When this elastic constraint cannot be satisfied, the previous time-step times out and the new time-step starts from the current timestamp.
This happens either because the environment has been 'paused', or because the system is ill-designed:
- The inference duration of the model, i.e. the elapsed duration between two calls of the step() function, may be too long for the time-step duration that the user is trying to use.
- The procedure that retrieves observations may take too much time or may be called too late (the latter can be tweaked in the configuration dictionary). Remember that, if observation capture is too long, it must not be part of the get_obs_rew_done() method of your interface. Instead, this method must simply retrieve the latest available observation from another process, and the action buffer must be long enough to handle the observation capture duration. This is described in the Appendix of [Reinforcement Learning with Random Delays](https://arxiv.org/abs/2010.02966).
- The procedure that retrieves observations may take too much time or may be called too late (the latter can be tweaked in the configuration dictionary). Remember that, if observation capture is too long, it must not be part of the get_obs_rew_done_info() method of your interface. Instead, this method must simply retrieve the latest available observation from another process, and the action buffer must be long enough to handle the observation capture duration. This is described in the Appendix of [Reinforcement Learning with Random Delays](https://arxiv.org/abs/2010.02966).


## Tutorial
Expand Down Expand Up @@ -96,7 +96,7 @@ from rtgym import RealTimeGymInterface

The [RealTimeGymInterface](https://github.com/yannbouteiller/rtgym/blob/969799b596e91808543f781b513901426b88d138/rtgym/envs/real_time_env.py#L12) is all you need to implement in order to create your custom real-time Gym environment.

This class has 6 abstract methods that you need to implement: ```get_observation_space```, ```get_action_space```, ```get_default_action```, ```reset```, ```get_obs_rew_done``` and ```send_control```.
This class has 6 abstract methods that you need to implement: ```get_observation_space```, ```get_action_space```, ```get_default_action```, ```reset```, ```get_obs_rew_done_info``` and ```send_control```.
It also has a ```wait``` and a ```render``` methods that you may want to override.
We will implement them all to understand their respective roles.

Expand Down Expand Up @@ -218,7 +218,7 @@ class MyRealTimeInterface(RealTimeGymInterface):
def reset(self):
pass

def get_obs_rew_done(self):
def get_obs_rew_done_info(self):
pass

def wait(self):
Expand Down Expand Up @@ -305,7 +305,7 @@ Ok, in this case this is actually equivalent, but you get the idea. You may want

---
The ```get_observation_space``` method outputs a ```gym.spaces.Tuple``` object.
This object describes the structure of the observations returned from the ```reset``` and ```get_obs_rew_done``` methods of our interface.
This object describes the structure of the observations returned from the ```reset``` and ```get_obs_rew_done_info``` methods of our interface.

In our case, the observation will contain ```pos_x``` and ```pos_y```, which are both constrained between ```-1.0``` and ```1.0``` in our simple 2D world.
It will also contain target coordinates ```tar_x``` and ```tar_y```, constrained between ```-0.5``` and ```0.5```.
Expand All @@ -324,7 +324,7 @@ def get_observation_space(self):

---
We can now implement the RL mechanics of our environment (i.e. the reward function and whether we consider the task ```done``` in the episodic setting), and a procedure to retrieve observations from our dummy drone.
This is done in the ```get_obs_rew_done``` method.
This is done in the ```get_obs_rew_done_info``` method.

For this tutorial, we will implement a simple task.

Expand All @@ -339,22 +339,24 @@ The task is easy, but not as straightforward as it looks.
Indeed, the presence of random communication delays and the fact that the drone keeps moving in real time makes it difficult to precisely reach the target.

---
```get_obs_rew_done``` outputs 3 values:
```get_obs_rew_done_info``` outputs 4 values:
- ```obs```: a list of all the components of the last retrieved observation, except for the action buffer
- ```rew```: a float that is our reward
- ```done```: a boolean that tells whether the episode is finished (always False in the non-episodic setting)
- ```info```: a dictionary that contains any additional information you may want to provide

For our simple task, the implementation is fairly straightforward.
```obs``` contains the last available coordinates and the target, ```rew``` is the negative distance to the target, and ```done``` is True when the target has been reached:
```obs``` contains the last available coordinates and the target, ```rew``` is the negative distance to the target, ```done``` is True when the target has been reached, and since we don't need more information ```info``` is empty:
```python
def get_obs_rew_done(self):
def get_obs_rew_done_info(self):
pos_x, pos_y = self.rc_drone.get_observation()
tar_x = self.target[0]
tar_y = self.target[1]
obs = [pos_x, pos_y, tar_x, tar_y]
rew = -np.linalg.norm(np.array([pos_x, pos_y], dtype=np.float32) - self.target)
done = rew > -0.01
return obs, rew, done
info = {}
return obs, rew, done, info
```
We did not implement the 100 time-steps limit here because this will be done later in the configuration dictionary.

Expand Down Expand Up @@ -416,8 +418,8 @@ The ```rtgym``` environment will ensure that the control frequency sticks to thi

The ```"start_obs_capture"``` entry is usually the same as the ```"time_step_duration"``` entry.
It defines the time at which an observation starts being retrieved, which should usually happen instantly at the end of the time-step.
However, in some situations, you will want to actually capture an observation in ```get_obs_rew_done``` and the capture duration will not be negligible.
In such situations, if observation capture is less than 1 time-step, you can do this and use ```"start_obs_capture"``` in order to tell the environment to call ```get_obs_rew_done``` before the end of the time-step.
However, in some situations, you will want to actually capture an observation in ```get_obs_rew_done_info``` and the capture duration will not be negligible.
In such situations, if observation capture is less than 1 time-step, you can do this and use ```"start_obs_capture"``` in order to tell the environment to call ```get_obs_rew_done_info``` before the end of the time-step.
If observation capture is more than 1 time-step, it needs to be performed in a parallel process and the last available observation should be used at each time-step.

In any case, keep in mind that when observation capture is not instantaneous, you should add its maximum duration to the maximum delay, and increase the size of the action buffer accordingly. See the [Reinforcement Learning with Random Delays](https://arxiv.org/abs/2010.02966) appendix for more details.
Expand Down
27 changes: 14 additions & 13 deletions rtgym/envs/real_time_env.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,17 +64,18 @@ def wait(self):
"""
self.send_control(self.get_default_action())

def get_obs_rew_done(self):
"""Returns observation, reward and done from the device.
def get_obs_rew_done_info(self):
"""Returns observation, reward, done and info from the device.
Returns:
obs: list
rew: scalar
done: boolean
done: bool
info: dict
Note: Do NOT put the action buffer in obs (automated).
"""
# return obs, rew, done
# return obs, rew, done, info

raise NotImplementedError

Expand Down Expand Up @@ -129,7 +130,7 @@ def render(self):
# start_obs_capture should be the same as "time_step_duration" unless observation capture is non-instantaneous and
# smaller than one time-step, and you want to capture it directly in your interface for convenience. Otherwise,
# you need to perform observation capture in a parallel process and simply retrieve the last available observation
# in the get_obs_rew_done() and reset() methods of your interface
# in the get_obs_rew_done_info() and reset() methods of your interface
"time_step_timeout_factor": 1.0, # maximum elasticity in (fraction or number of) time-steps
"ep_max_length": 1000, # maximum episode length
"real_time": True, # True unless you want to revert to the usual turn-based RL setting (not tested yet)
Expand Down Expand Up @@ -317,6 +318,7 @@ def __init__(self, config: dict=DEFAULT_CONFIG_DICT):
self.__obs = None
self.__rew = None
self.__done = None
self.__info = None
self.__o_set_flag = False

# environment benchmark:
Expand Down Expand Up @@ -431,31 +433,31 @@ def __update_obs_rew_done(self):
observation of this step()
"""
self.__o_lock.acquire()
o, r, d = self.interface.get_obs_rew_done()
o, r, d, i = self.interface.get_obs_rew_done_info()
if not d:
d = (self.current_step >= self.ep_max_length)
elt = o
if self.obs_prepro_func:
elt = self.obs_prepro_func(elt)
elt = tuple(elt)
self.__obs, self.__rew, self.__done = elt, r, d
self.__obs, self.__rew, self.__done, self.__info = elt, r, d, i
self.__o_set_flag = True
self.__o_lock.release()

def _retrieve_obs_rew_done(self):
"""Waits for new available o r d and retrieves them.
def _retrieve_obs_rew_done_info(self):
"""Waits for new available o r d i and retrieves them.
"""
c = True
while c:
self.__o_lock.acquire()
if self.__o_set_flag:
elt, r, d = self.__obs, self.__rew, self.__done
elt, r, d, i = self.__obs, self.__rew, self.__done, self.__info
self.__o_set_flag = False
c = False
self.__o_lock.release()
if self.act_in_obs:
elt = tuple((*elt, *tuple(self.act_buf),))
return elt, r, d
return elt, r, d, i

def init_action_buffer(self):
for _ in range(self.act_buf_len):
Expand Down Expand Up @@ -502,8 +504,7 @@ def step(self, action):
self.act_buf.append(action)
if not self.real_time:
self._run_time_step(action)
obs, rew, done = self._retrieve_obs_rew_done()
info = {}
obs, rew, done, info = self._retrieve_obs_rew_done_info()
if self.real_time:
self._run_time_step(action)
if done and self.wait_on_done:
Expand Down
5 changes: 3 additions & 2 deletions rtgym/tuto/tuto.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,15 @@ def reset(self):
self.target[1] = np.random.uniform(-0.5, 0.5)
return [pos_x, pos_y, self.target[0], self.target[1]]

def get_obs_rew_done(self):
def get_obs_rew_done_info(self):
pos_x, pos_y = self.rc_drone.get_observation()
tar_x = self.target[0]
tar_y = self.target[1]
obs = [pos_x, pos_y, tar_x, tar_y]
rew = -np.linalg.norm(np.array([pos_x, pos_y], dtype=np.float32) - self.target)
done = rew > -0.01
return obs, rew, done
info = {}
return obs, rew, done, info

def wait(self):
self.send_control(self.get_default_action())
Expand Down
4 changes: 2 additions & 2 deletions tests/test_all.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ def send_control(self, control):
def reset(self):
return [time.time(), self.control, self.control_time]

def get_obs_rew_done(self):
return [time.time(), self.control, self.control_time], 0.0, False
def get_obs_rew_done_info(self):
return [time.time(), self.control, self.control_time], 0.0, False, {}

def get_observation_space(self):
ob = gym.spaces.Box(low=np.array([0.0]), high=np.array([np.inf]), dtype=np.float32)
Expand Down

0 comments on commit df4a3e0

Please sign in to comment.
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy