Deep Reinforcement Learning techniques demonstrate exciting results in robotic applications such as dexterous inhand manipulation. One of the challenging factors which come into play when performing real-world tasks is a combination of different sensory modalities (vision, preconception, and haptic) which support each other for identification of the position and orientation of the manipulated object. This mutual support is important when some of the modalities are occluded or noisy. Furthermore, update rates of different sensory modalities may not match each other. While we assume that vision alone can determine the state perfectly, it suffers from slow update rates and it is susceptible to dropout due to visual occlusion (eg palm over the object). On the other hand, haptic by means of touch and proprioceptive information is always present with a high update rate but suffers from ambiguous perception (eg the cube in Fig. 1 can take various possible orientations without a change in the haptic and proprioceptive perception). Therefore, we present an approach to infer the state of the object through a unified, synchronized, multisensory perception of position and orientation of a manipulated object. Our approach builds upon the recent work of learning dexterous in-hand manipulation [1], where an agent with a model-free policy was able to learn complex in-hand manipulation tasks using proprioceptive and touch feedback plus visual information about the manipulated object. In this work, the agent can perform vision-based object reorientation on a physical Shadow-Hand in a simulated environment. However, pose reconstruction of a manipulated …