Farrajota, MiguelRodrigues, Joãodu Buf, J. M. H.2020-07-242020-07-242019-111433-7541http://hdl.handle.net/10400.1/14203Action recognition is of great importance in understanding human motion from video. It is an important topic in computer vision due to its many applications such as video surveillance, human-machine interaction and video retrieval. One key problem is to automatically recognize low-level actions and high-level activities of interest. This paper proposes a way to cope with low-level actions by combining information of human body joints to aid action recognition. This is achieved by using high-level features computed by a convolutional neural network which was pre-trained on Imagenet, with articulated body joints as low-level features. These features are then used to feed a Long Short-Term Memory network to learn the temporal dependencies of an action. For pose prediction, we focus on articulated relations between body joints. We employ a series of residual auto-encoders to produce multiple predictions which are then combined to provide a likelihood map of body joints. In the network topology, features are processed across all scales which capture the various spatial relationships associated with the body. Repeated bottom-up and top-down processing with intermediate supervision of each auto-encoder network is applied. We demonstrate state-of-the-art results on the popular FLIC, LSP and UCF Sports datasets.engHuman actionHuman poseConvNetNeural networksAuto-encodersLSTMHuman action recognition in videos with articulated pose information by deep networksjournal article10.1007/s10044-018-0727-y