Alexandre BENOIT

   Docteur / Ingénieur en traitement d'image et des signaux

mise à jour du
29 novembre 2011
Advanced Video and Signal Based Surveillance
AVSS 2005:207-212
Hypovigilance analysis: open or closed eye or mouth?
Blinking or yawning frequency?
Benoit, A. Caplier, A.    
LIS-INPG, Grenoble, France
IEEE Conference on 15-16 Sept. 2005
pdf de cet article


This paper proposes a frequency method to estimate the state open or closed of eye and mouth and to detect associated motion events such as blinking and yawning. The context of that work is the detection of hypovigilence state of a user such as a driver, a pilot... In [1] we proposed a method for motion detection and estimation which is based on the processing achieved by the human visual system.
The motion analysis algorithm the filtering step occurring at the retina level and the analysis done at the visual cortex level. This method is used to estimate the motion of eye and mouth: blinking are related to fast vertical motion of the eyelid and yawning is related to large vertical mouth opening. The detection of the open or closed state of the feature is based on the analysis of the total energy of the image at the output of the retina filter: this energy is higher for open features.
The absolute level of energy associated to a specific state being different from a person to another and for different illumination conditions, the energy level associated to each state open or closed is adaptive and is updated each time a motion event (blinking or yawning) is detected. No constraint about motion is required. The system is working in real time and under all type of lighting conditions since the retina filtering is able to cope with illumination variations. This allows to estimate blinking and yawning frequencies which are clues of hypovigilance.
Introduction The aim of the presented work is the development of a real time algorithm for hypovigilence analysis. The degree of vigilance of a user can be related to the state open or closed of his eyes and mouth and to the frequency of his blinkings and yawnings. Work about eye blinks detection is generally based on temporal image derivative (for motion detection) followed by image binarization analysis [2].
Also, feature point tracking on eyes and mouth is used to detect open / closed state and motion [3]. All these methods are based on spatial analysis of the eye/mouth region, they are sensitive to image noise and generally require a sufficient number of pixels to be accurate. Moreover, these methods often require morphological operations to avoid false blink detections generated by global head motion. Other methods can be used such as one based on «second order change» [4] but they always need binarization and thresholding, the choice of the threshold being of critical influence on the results. Work on mouth shape detection is generally based on lips segmentation: work with lip models such as [5] use color and edge information but these methods are sensitive to lighting and contrast conditions.
Other methods such as parametric curves [6] has been studied. Recently, statistical model approaches such as active shape and appearance models for example [7, 8] have been proposed and give accurate results for lips segmentations. Nevertheless all these methods cannot give information on the mouth state. In the case of mouth motion detection, lips segmentation or feature point tracking [9] can be used but these methods require much processing power and yield to a mouth shape estimation rather than yawnings detection. In this paper, we use the spectral analysis method described in [1] that will allow the detection of eye and mouth states and blink/yawning with the same method. It involves a spatio-temporal filter modelling the human retina and dedicated to the detection of motion stimulus. It is used to estimate the motion of eye and mouth: blinking are related to fast vertical motion of the eyelid and yawning is related to large vertical mouth opening.
The detection of the open or closed state of the feature is based on the analysis of the total energy at the output of the retina filter: this energy is higher for open features. In section 2 the general principle of the motion estimation method is explained and the properties of the motion estimator are given (see [1] for more details). .. Section 3 describes the proposed method to detect eye and mouth motions events (blinks and yawnings) and section 4 describes how to detect the open or closed feature state which is associated to an adaptive updating of the related level of energy of the image spectrum. Section 5 presents some results.
double cliquez sur la video pour démarrer
Vidéo illustrant le système de détection d'ouverture/fermeture de bouche et donne une courbe d'illustration qui montre comment à partir du critère analysé, il est possible de distinguer les états, ouverts/fermés de la bouche, mais aussi les actions, (statique), parole et baillement.
La vidéo montre le fonctionnement normal du système (il s'adapte à la situation d'analyse sans rien connaitre au début, donc quelques erreurs au début de la séquence et après le système est initialisé et fonctionne)
La courbe d'illustration montre l'évolution temporelle du critère d'analyse. Ce critère est une énergie à chaque instant liée à la quantité de contours dans la zone d'analyse (la bouche). on observe sur cette illustration que l'évolution temporelle de ce critère permet d'identifier sans hésitation un bâillement par rapport aux autres cas de figure. (forte et durable élévation de l'énergie lors de l'ouverture et de même lors de la fermeture lors du bâillement.
Mouth state and yawning detection
The same method is applied to mouth yawning detection. Figure 8 shows the results on a sequence in which the mouth exaggerates its open and closed state from frame 1 to frame 300, is closed from frame 301 to frame 500, and opens / closes normally after frame 500 because of natural speech. We can see that the algorithm self adjusts its parameters HighEnergyLevel and LowEnergyLevel before frame 200, this is the initialization period where each mouth state is performed more than 0.5 second by the user in order to correctly initialize these parameters. Then the algorithm updates them with respect to the evolution of the OPL output spectral energy.
The LowEnergyLevel corresponds to closed mouth because the closed lips generate a lower quantity of contours. The HighEnergyLevel corresponds to open mouth which let appear tooth and/or internal mouth details or a black area that generate high energy contours with the lips frontier. Note that during the stable open/closed mouth periods, the HighEnergyLevel and LowEnergyLevel values are adjusted and when the speech periods happen (from frame 500 to the end), these levels are no more or few updated. This allows the correct detection of the mouth state even in case of fast mouth shape variation that occurs during speaking.
PERFORMANCES AND APPLICATION The performances of this facial feature state and motion event detector have been evaluated in various test condition: it detects states and movements events up to 99% success in standard office lighting conditions with the focused object occupying from 60% to 100% of the captured frame (currently 100*100 pixels). In low light conditions or noisy captured frames (Gaussian white noise of variance 0.04), the algorithm is able to detect the motion events and states with 80% success.
Moreover, even if the algorithm is 'lost' at a moment, since it is adaptive, it automatically corrects its energy levels and works fine when the sequence returns to normal conditions. The algorithm works in real time, reaching up to 80 frames per second on a standard PC desktop Pentium 4 running at 3.0Ghz on which a webcam is installed. The algorithm automatically adjusts its parameters during the analysis. This proposed approach is inspired from the capacities of the human visual system which is adaptive and is able to cope with various illumination and motion conditions.
Conclusion A real time method for facial feature state and motion events detection has been proposed, it works with eye and mouth in the same way. The algorithm inspired from the biological model of the human visual system shows its efficiency in terms of motion detection and analysis: the use of the retina filter prepares the data and yields to a spectrum easy to analyze. The proposed algorithm proves its efficiency to estimate the open or closed state of eye and mouth and the frequency of blinking and yawning. This is well suited for the analysis of a user vigilance. The performances of the algorithm on video sequences of a car driver are under study.
A graphic method of recording the act of yawning and other forms of movement of the mouth
Iu. N. Bordiushkov Biull Eksp Biol Med 1958;46(7):885-887
[1] A. Benoit, A. Caplier. "Motion estimator inspired from biological model for head motion interpretation " WIAMIS 2005, Montreux, Switzerland, April 2005
[2] J. Coutaz, F. Berard, and J. L. Crowley. "Coordination of perceptual processes for computer mediated communication". In Proc. of 2nd Intl Conf. Automatic Face and Gesture Rec., pages 106--111, 1996.
[3] P. Smith, M. Shah, N. da Vitoria Lobo, "Determining Driver Visual Attention with One Camera", Accepted forIEEE Transactions on Intelligent Transportation Systems, 2004.
[4] D. Gorodnichy, "Towards Automatic Retrieval of Blink- Based Lexicon for Persons Suffered from Brain-Stem Injury Using Video Cameras," Proceedings of the First IEEE Computer Vision and Pattern Recognition (CVPR) Workshop on Face Processing in Video. Washington, District of Columbia, USA. June 28, 2004. NRC 47138.
[5] P. Delmas, N.Eveno, and M. Lievin, "Towards Robust Lip Tracking", International Conference on Pattern Recognition (ICPR'02),Québec City, Canada, August 2002
[6] N.Eveno, A. Caplier, and P-Y Coulon, "Jumping Snakes and Parametric Model for Lip Segmentation", International Conference on Image Processing, Barcelona, Spain, September 2003
[7] T. F. Cootes. "Statistical models of appearance for computer vision"
[8] P. Gacon, P.-Y. Coulon, G. Bailly. "Statistical Active Model for Mouth Components Segmentation", 2005 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'05), Philadelphia, USA, 2005.
[9] Y. Tian, T. Kanade, J.F. Cohn "Robust Lip Tracking by Combining Shape, Color and Motion" Proc. of the 4th Asian Conference on Computer Vision (ACCV'00), January, 2000
[10] W.H.A. Beaudot, "The neural information processing in the vertebrate retina: A melting pot of ideas for artificial vision", PhD Thesis in Computer Science, INPG (France) december 1994
[11] J. Ritcher&S.Ullman. "A model for temporal organization of X- and Y-type receptive fields in the primate retina". Biological Cybernetics, 43:127-145,1982.
[12] Barron J.L., Fleet D.J. and Beauchemin S.S., "Performance of Optical Flow Techniques", International Journal of Computer Vision, Vol. 12, No. 1, pp. 43-77, 1994.

Monitoring mouth movement for driver fatigue or distraction with one camera
Wang Rongben Guo Lie Tong Bingliang Jin Lisheng
Transp. Coll., Jilin Univ., Changchun, China
Intelligent Transportation Systems, 2004. Proceedings. The 7th International IEEE Conference (3-6 Oct. 2004;314 -319)
Abstract This paper proposed to locate and track a driver's mouth movement using a dashboardmounted CCD camera. Study on monitoring and recognizing a driver's yawning fatigue state and distraction state due to talking or conversation. Firstly determining the interest of area for mouth by detecting face using color analysis, then segmenting skin and lip pixels by fisher classifier, and detecting driver's mouth and extracting lip features by connected component analysis, tracking driver's mouth via Kalman filtering in real time. Taking the mouth region's geometric features to make up an eigenvector as the input of a BP ANN, then we acquire the BP ANN output of three different mouth states that represent normal, yawning or talking state respectively. The experiment results show that this new method can inspect the driver's mouth region accurately and quickly, and gives a warning sign when it find driver's yawning fatigue state and distraction state due to talking or conversation.

Yawning detection for determining driver drowsiness
Tiesheng Wang Pengfei Shi
Inst. of Image Process. & Pattern Recognition, Shanghai Jiao Tong Univ., China
VLSI Design and Video Technology, 2005. Proceedings of 2005 IEEE International Workshop (28-30 May 2005:373-376)
Abstract A system aiming at detecting driver drowsiness or fatigue on the basis of video analysis is presented. The focus of this paper is on how to extract driver yawning. A real time face detector is implemented to locate driver's face region. Subsequently, Kalman filter is adapted to track face region. Further, mouth window is localized within face region and degree of mouth openness is extracted based on mouth features to determine driver yawning in video. The system will reinitialize when occlusion or miss-detect on happen. Experiments are conducted to evaluate the validity of the described method.

Determining driver visual attention with one camera
Smith, P. Shah, M. da Vitoria Lobo, N.
Dept. of Comput. Sci., Central Florida Univ., Orlando, FL, USA
Intelligent Transportation Systems, IEEE 2003;4(4):205-218)
Abstract This paper presents a system for analyzing human driver visual attention. The system relies on estimation of global motion and color statistics to robustly track a person's head and facial features. The system is fully automatic, it can initialize automatically, and reinitialize when necessary. The system classifies rotation in all viewing directions, detects eye/mouth occlusion, detects eye blinking and eye closure, and recovers the three dimensional gaze of the eyes. In addition, the system is able to track both through occlusion due to eye blinking, and eye closure, large mouth movement, and also through occlusion due to rotation. Even when the face is fully occluded due to rotation, the system does not break down. Further the system is able to track through yawning, which is a large local mouth motion. Finally, results are presented, and future work on how this system can be used for more advanced driver visual attention monitoring is discussed.

Comparison of impedance and inductance ventilation sensors on adults during breathing, motion, and simulated airway obstruction
Cohen, K.P. Ladd, W.M. Beams, D.M. Sheers, W.S. Radwin, R.G. Tompkins, W.J. Webster, J.G.
Lincoln Lab., MIT, Lexington, MA, USA
Biomedical Engineering, IEEE 1997; 44(7 ):555-566
Abstract The goal of this study was to compare the relative performance of two noninvasive ventilation sensing technologies on adults during artifacts. The authors recorded changes in transthoracic impedance and cross-sectional area of the abdomen (abd) and ribcage (rc) using impedance pneumography (IP) and respiratory inductance plethysmography (RIP) on ten adult subjects during natural breathing, motion artifact, simulated airway obstruction, yawning, snoring, apnea, and coughing. The authors used a pneumotachometer to measure air flow and tidal volume as the standard. They calibrated all sensors during natural breathing, and performed measurements during all maneuvers without changing the calibration parameters. No sensor provided the most-accurate measure of tidal volume for all maneuvers. Overall, the combination of inductance sensors [RIP(sum) ] calibrated during an isovolume maneuver had a bias (weighted mean difference) as low or lower than all individual sensors and all combinations of sensors. The IP(rc) sensor had a bias as low or lower than any individual sensor. The cross-correlation coefficient between sensors was high during natural breathing, but decreased during artifacts. The cross correlation between sensor pairs was lower during artifacts without breathing than it was during maneuvers with breathing for four different sensor combinations. The authors tested a simple breathdetection algorithm on all sensors and found that RIP(sum) resulted in the fewest number of false breath detections, with sensitivity of 90.8% and positive predictivity of 93.6%

Public speaking in virtual reality: facing an audience of avatars
Slater, M. Pertaub, D.-P. Steed, A. pdf
Dept. of Comput. Sci., Univ. Coll. London, UK
Computer Graphics and Applications, IEEE 1999;19(2):6-9
Abstract What happens when someone talks in public to an audience they know to be entirely computer generated-to an audience of avatars? If the virtual audience seems attentive, wellbehaved, and interested, if they show positive facial expressions with complimentary actions such as clapping and nodding, does the speaker infer correspondingly positive evaluations of performance and show fewer signs of anxiety? On the other hand, if the audience seems hostile, disinterested, and visibly bored, if they have negative facial expressions and exhibit reactions such as head-shaking, loud yawning, turning away, falling asleep, and walking out, does the speaker infer correspondingly negative evaluations of performance and show more signs of anxiety? We set out to study this question during the summer of 1998. We designed a virtual public speaking scenario, followed by an experimental study. We wanted mainly to explore the effectiveness of virtual environments (VEs) in psychotherapy for social phobias. Rather than plunge straight in and design a virtual reality therapy tool, we first tackled the question of whether real people's emotional responses are appropriate to the behavior of the virtual people with whom they may interact. The project used DIVE (Distributive Interactive Virtual Environment) as the basis for constructing a working prototype of a virtual public speaking simulation. We constructed as a Virtual Reality Modeling Language (VRML) model, a virtual seminar room that matched the actual seminar room in which subjects completed their various questionnaires and met with the experimenters.

Hidden markov model based dynamic facial action recognition
Arsic, D. Schenk, J. Schuller, B. Wallhoff, F. Rigoll, G.
Technische Universität München, Institute for Human Machine Communication, Arcisstrasse 16, 80333 München, Germany.
Image Processing, 2006 IEEE 2006: 673-676
Abstract Video based analysis of a persons' mood or behavior is in general performed by interpreting various features observed on the body. Facial actions, such as speaking, yawning or laughing are considered as key features. Dynamic changes within the face can be modeled with the well known Hidden Markov Models (HMM). Unfortunately even within one class examples can show a high variance because of unknown start and end state or the length of a facial action. In this work we therefore perform a decomposition of those into so called submotions. These can be robustly recognized with HMMs, applying selected points in the face and their geometrical distances. Additionally the first and second derivation of the distances is included. A sequence of submotions is then interpreted with a dictionary and dynamic programming, as the order may be crucial. Analyzing the frequency of sequences shows the relevance of the submotions order. In an experimental section we show, that our novel submotion approach outperforms a standard HMM with the same set of features by nearly 30% absolute recognition rates
A non-rigid motion estimation algorithm for yawn detection in human drivers
Mohanty, M, Mishra, A, Routray, A.
Int. J. Computational Vision and Robotics
This work focuses on the estimation of possible fatigue or drowsiness by detecting the occurrence of yawns with human drivers. An image processing technique has been proposed to analyse the deformation occurring on driver's face and accurately identify the yawn from other types of mouth opening such as talking and singing. The algorithm quantifies the degree of deformation on lips when a driver yawns.
The image processing methodology is based on study of non-rigid motion patterns on 2D images. The analysis is done on a temporal sequence of images acquired by a camera. A shape-based correspondence of templates on contours of a particular region is established on the basis of curvature information. The shape similarity between the contours is analysed, after decomposing with wavelets at different levels.
Finally, the yawn is correlated with fatigue-induced behaviour of drivers on simulaton
Détécter les bâillements du chauffeur
Cette équipe de recherche indienne essaie de mettre au point un détecteur de bâillements qui pourrait réduire le nombre d'accidents de la route causés par la somnolence d'un conducteur au volant.
En cours d'élaboration, cette technologie indo-américaine est intégrée à l'intérieur de l'automobile, ont indiqué les concepteurs, dont les travaux sont publiés dans The International Journal of Computational Vision and Robotics.
Le nouveau système est constitué d'une caméra et d'un logiciel qui analyse instantanément les images du visage, captées à intervalles réguliers. En plus d'analyser les modifications du visage du chauffeur, l'appareil distingue les bâillements des autres mouvements faciaux, comme les actions de sourire, de discuter ou de chanter.
À partir du moment où le conducteur pousse des bâillements, le logiciel se met à calculer leur fréquence. Si ces derniers se répètent trop souvent, un signal d'avertissement est déclenché. Aux États-Unis seulement, 100 000 accidents de la route sont causés, chaque année, par la fatigue d'un conducteur, selon la National Highway Traffic Safety Administration (NHTSA).
De l'électrode à la lentille : au cours des dernières années, d'autres systèmes de détection de fatigue ont été mis au point. Ceux-ci enregistraient notamment l'activité du cerveau ou les pulsations cardiaques. Les inventeurs du détecteur de bâillements soutiennent que leur système de caméra est moins encombrant que ces appareils, souvent munis d'un casque chargé d'électrodes devant être porté par le conducteur.
En général, les conducteurs ont tendance à sous-estimer leur fatigue et les conséquences de celle-ci sur leur disposition à conduire un véhicule. En revanche, ils surestiment leur capacité à combattre le sommeil qui s'empare d'eux.
Multimodal focus attention and stress detection and feedback in an augmented driver simulator
Alexandre Benoit, Laurent Bonnaud , Alice Caplier, Phillipe Ngo , Lionel Lawson , Daniela G. Trevisan, Vjekoslav Levacic, Céine Mancas, Guillaume Chanel
Pers Ubiquit Comput (2009) 13:33&endash;41
This paper presents a driver simulator, which takes into account the information about the user's state of mind (level of attention, fatigue state, stress state). The user's state of mind analysis is based on video data and biological signals. Facial movements such as eyes blinking, yawning, head rotations, etc., are detected on video data: they are used in order to evaluate the fatigue and the attention level of the driver. The user's electrocardiogram and galvanic skin response are recorded and analyzed in order to evaluate the stress level of the driver. A driver simulator software is modified so that the system is able to appropriately react to these critical situations of fatigue and stress: some audio and visual messages are sent to the driver, wheel vibrations are generated and the driver is supposed to react to the alert messages. A multi-threaded system is proposed to support multi-messages sent by the different modalities. Strategies for data fusion and fission are also provided. Some of these components are integrated within the first prototype of OpenInterface: the multimodal similar platform.
The major goal of this project is the use of multimodal signals and video processing to provide an augmented user's interface for driving. In this paper, we are focusing on passive modalities. The term augmented here can be understood as an attentive interface supporting the user interaction. So far at the most basic level, the system should contain at least five components:
1. sensors for determining the user's state of mind;
2. modules for features or data extraction;
3. a fusion process to evaluate incoming sensor information;
4. an adaptive user interface based on the results of step 3;
5. an underlying computational architecture to integrate these components.
In this paper, we address the following issues:
• Which driver simulator to use?
• How to characterize a user's state of fatigue or stress?
• Which biological signals to take into account?
• What kind of alarms to send to the user?
• How to integrate all these pieces&emdash;data fusion and
fission mechanism?
• Which software architecture is the most appropriate to
support such kind of integration?
A software architecture supporting real time processing is the first requirement of the project because the system has to be interactive. A distributed approach supporting multi-threaded server can address such needs. We are focusing on stress and fatigue detection. The detection is based on video information and/or on biological information. From the video data we extract relevant information to detect fatigue states while the biological signals provide data for stress detection. The following step is the definition of the alarms to be provided to the user. Textual and vocal messages and wheel vibrations are considered to alert the user. The rest of the paper is organized as follows: first, we present the global architecture of the demonstrator, then we describe how it is possible to detect driver's hypo-vigilance states by the analysis of video data, then we present how to detect driver's stress states by the analysis of some biological signals. Finally the data fusion and fission strategies are presented and the details about the demonstrator implementation are given.
3 Hypo-vigilance detection based on video data The state of hypo-vigilance (either related to fatigue or inattention) is detected by the analysis of video data. The required sensor is a camera facing the driver. Three indices are considered as hypo-vigilance signs: yawning, head rotations and eyes closing for more than 1 s.
3.1 Face detection In this paper, we are not focusing on face localization. The face detector should be robust (no error in face localization) and should work in real time. We chose to use the free toolbox MPT [5]. This face detector extracts a squarebounding box around each face in the processed image. The MPT face detector works nearly at 30 frames per second for pictures of size 320 · 200 pixels, which is not the case of other face detectors such as OpenCV [13] for example.
3.2 Head motion analysis Once a bounding box around the driver's face has been detected, head motion such as head rotations, eyes closing and yawning are detected using an algorithm working in a way close to the human visual system. In a first step, a filter coming from the modeling of the human retina is applied. This filter enhances moving contours and cancels static ones. In a second step, the FFT of the filtered image is computed in the log polar domain as a modeling of the process occurring in the primary visual cortex. Details about the proposed method are described in [1, 2]. As a result of retinal filtering, noise and luminance variations are attenuated and moving contours are enhanced. For example on Fig. 2, after retina filter, all the details are visible even in the darkest area of the image Fig. 3.
The modeling of the primary visual cortex consists of a frequency analysis of the spectrum of the retina filters output in each region of interest of the face: global head, eyes area and mouth area only. In order to estimate the rigid head rotations, the proposed method analyses the spectrum of the retina filter output in the log polar domain. It detects head motion events and is able to extract its orientation (see [1, 3]). The main idea is that the spectrum reports high energy only for the moving contours perpendicular to the motion direction. Indeed, the retina filter removes static contours and enhances contours perpendicular to the motion direction. As a result, in the log-polar spectrum, the orientation related to the highest energy also gives the motion direction. For the detection of yawning or eyes closing, same processing is done on each region of interest (each eye and the mouth) [4]. A spectrum analysis is carried out, but this time we are looking for vertical motion only since eyes closure or mouth yawning are related to such a motion.
3.3 Eyes and mouth detection The mouth is supposed to belong to the lower half of the detected bounding box of the face.
Concerning the eyes, the spectrum analysis in the region of interest is accurate only if each eye is correctly localized. Indeed around the eyes, several vertical or horizontal contours can generate false detection (hair boundary for example). The MPT toolbox proposes an eye detector but it requires too much computing time (frame rate of 22 fps), hence, it has been discarded. We use another solution: eye region is supposed to be the area in which there are the most energized contours in the log-polar domain. Assuming that the eyes are localized in the two upper quarters of the detected face, we use the retina output. The retina output gives the contours in these areas and due to the fact that the eye region (containing iris and eyelid) is the only area in which there are horizontal and vertical contours, the eye detection can be achieved easily. We use two oriented low pass filters: a horizontal low pass filter and a vertical low pass filter and we multiply their response. The maximum answer is obtained in the area with the most horizontal and vertical contours, that is the eye regions. The eye area detection is performed at 30 frames per second. 3.4 Hypo-vigilance alarms generation Several situations are supposed to be a sign of hypo-vigilance: eyes closure detection, mouth yawning detection and global head motion detection.