Benefitting from the development of deep generative networks, modern fake news generation methods called Deepfake rapidly go viral over the Internet, calling for efficient detection methods. Existing Deepfake detection methods basically use binary classification networks trained on frame-level inputs and lack leveraging temporal information in videos. Besides, the accuracy of these methods will rapidly decrease when processing low-quality data. In this work, we propose a two-stream network to detect Deepfake in video level with the capability of handling low-quality data. The proposed architecture firstly divides the input video into segments and then feeds selected frames of each segment into two streams: The first stream takes RGB information as input and tries to learn the semantic inconsistency. The second stream parallelly leverages noise features extracted by spatial rich model (SRM) filters. Additionally, our experiments found that traditional SRM filters with fixed weights contribute insignificant improvement, we thus design novel learnable SRM filters, which can better fit the noise inconsistency in tampered regions. Segmental fusion and stream fusion are conducted at last to combine the information from segments and streams. We evaluate our algorithm on the existing largest Deepfake dataset FaceForensics++ and the experimental results show that we obtain state-of-the-art performance.