Measuring user engagement in interactive tasks can facilitate numerous applications toward optimizing user experience, ranging from eLearning to gaming. However, a significant challenge is the lack of non-contact methods that are robust in unconstrained environments. We present FaceEngage, a non-intrusive engagement estimator leveraging user facial recordings during actual gameplay in naturalistic conditions. Our contributions are three-fold. First, we show the potential of using front-facing videos as training data to build the engagement estimator. We compile FaceEngage Dataset with over 700 user-contributed YouTube picture-in-picture gaming videos (i.e., with full-screen game scenes and time-synchronized user facial recordings in subwindows). Second, we develop FaceEngage system which captures relevant gamer facial features from front-facing recordings to infer task engagement. We implement two pipelines: an estimator trained on user facial motion features inspired by prior psychological works, and a deep learning-enabled estimator. Lastly, we conduct extensive experiments and conclude: (i) certain user facial motion cues (e.g., blink rates) are engagement-indicative; (ii) our deep learning-enabled pipeline can automatically extract informative features, outperforming the facial motion feature-based pipeline; (iii) FaceEngage is robust to various video lengths and users/game genres. Despite the challenging nature of realistic videos, FaceEngage attains the accuracy of 83.8% and leave-one-user-out precision of 79.9%, both of which are superior to the face motion-based pipeline.