The difference of 10,000 hours: Expert surveillance viewers know exactly what to look for

A photograph of a lone person runs on a trail with a spectacular backdrop of craggy mountains. The trail is on the side of the steep hillside, and the sun is slipping out of view.
Trail running is a fantastic way to experience the outdoors. Unfortunately, some trails (especially those in New England, where I live) are rife with rocks, roots, and other hazards that require a specially tuned visual system to spot while on the move. Photo source: Pexels.com

I started trail running a year ago. I was an avid hiker, so I assumed that a marginal increase in my speed wouldn’t pose too much of a challenge. The bruises on my hands and legs served as stark reminders of my naiveté. However, despite the rocky start (pun intended), I gradually learned how to navigate trails littered with leaves, rocks, and roots with more ease. With each run, I noticed my visual attention changing. No longer did I linger on peripheral details like bushes, most trees, the stream next to me, etc. Instead, I was quickly scanning ahead of me—fixating briefly on hazards before focusing on the safest path forward. It was like my entire visual system learned to hone in on the most critical parts of my environment to help me achieve my goal: to slowly shuffle run through the forest without falling over.

This process of visual learning is certainly not unique to me—many careers and hobbies require this type of honing to perform well. One such career is that of a CCTV (Closed-Circuit TV) operator, who scans live video footage to intuit people’s intentions and prevent crimes.

The big open question is what types of visual information these operators rely on—the answer to which will give insight into how expertise changes our visual system writ large.

A photograph of a person wearing a purple cardigan and a white blouse looks out from a large binocular stand on the viewing deck of a tall building. The binoculars are pointed towards the left side of the photograph.
How are CCTV operators able to discern people’s intentions and stop crimes before they happen? One clue may be in how they view surveillance footage. Previous work shows that experts tend to focus on a few key areas in videos, a finding expanded upon by the article featured in this post. Photo source: Pexels.com

In a recent study published in Psychonomic Bulletin & Review, conducted by Yujia Peng, Joseph Burling, Greta Todorova, Catherine Neary, Frank Pollick, and Hongjing Lu (pictured below), the researchers sought to answer this question by tracking the eyes of expert CCTV operators and novices while they watched CCTV footage. The findings, discovered through computational analyses and machine learning techniques, showed that experts focused more on certain types of low-level information and that experts had higher agreement for where they looked than novices. In short, the experts had finely tuned their visual systems to pick up the most relevant information for determining people’s intentions, benign or otherwise.

“CCTV operators [experts] actively attend to visual contents that may be the most effective for the detection of harmful intentions,” said the authors about the importance of their findings.

An array of photographs of the five of the six authors of the paper that is the subject of the blog post. The top row has three photos, and the bottom has two.
Authors of the featured article “Patterns of saliency and semantic features distinguish gaze of expert and novice viewers of surveillance footage,” Top row: Yujia Peng (left), Joseph Burling (center), Greta Todorova (right). Bottom row: Catherine Neary (not pictured), Frank Pollick (left), Hongjing Lu (right).

In the study, two groups of viewers—expert CCTV operators with an average of 10,000 hours of viewing experience and novice community members—watched 36 different videos. Each of these videos was pulled from CCTV cameras pointed at a street and were grouped into four categories based on the behavior of the people in the clips: fight, confrontation, playful, and neutral. The researchers tracked participants’ eyes while they watched the videos to identify what type of visual information participants fixated on the most.

An example of which low-level visual features were extracted from the surveillance footage as part of the current study. This figure is an array of 6 images, arranged in two rows of three photos. The non-salient backgrounds of the images are shown in dark purple, while the brighter green colors indicate which parts of the scene have higher salience of the selected features. Starting in the top left, these features are: luminance (brightness), red-green color channel, yellow-blue color channel, orientation (of lines and edges), texture, and optical flow (motion).
The six low-level visual features extracted from the video surveillance footage. Brighter green colors indicate higher salience of that feature compared to the background. These features were, starting at the top left: luminance (brightness), red-green color channel, yellow-blue color channel, orientation (of lines and edges), texture, and optical flow (motion).

To do this, they broke down the videos into two groups of visual information: low-level sensory information (e.g., colors, brightness, motion, see above) and high-level object information (i.e., what is actually present in the scene: street, people, light posts, etc.). See the video below for more details about how they extracted this information using two different computational models.

They then reported results from two analyses: 1) whether people in the same group (experts versus novices) looked at the same types of information and 2) whether a machine learning algorithm, trained on the eye-tracking output for each group, could reliably tell experts from novices beyond merely guessing.

They found that, overall, experts had higher agreement with each other about what types of information they tended to focus on compared to novices. Notably, experts tended to prioritize low-level information more than novices did. Further, experts focused more on certain high-level object information, especially at the start and end of the videos. This strategic focus likely contributes to their ability to rapidly determine the intentions of the people seen in the footage.

After passing the eye-tracking data through a machine-learning algorithm, the algorithm successfully determined which data belonged to experts versus novices. This was true not only for the fight-coded videos, but for the other three content types as well (i.e., confrontation, playful, and neutral). Upon closer examination, the algorithm used attention to motion as the primary distinguisher between the two groups. In other words, the expert operators were paying very close attention to what was moving in the videos, more so than the novices.

They also found that experts were more likely to use both types of visual information—low-level and high-level—at the beginning of the videos, but they shifted focus to more high-level objects as the videos progressed.

The authors summarized this finding as “… intention inference may start with low-level visual cues and gradually move on to semantic-level visual processing.”

For these CCTV operators, their visual systems have learned exactly what type of visual information they need to focus on, and when they need to focus on it to be able to determine what people might do in a given situation.

Speaking for myself, my visual system has learned to tune out information that doesn’t help me stay moving (and upright) and instead is undergoing rapid identification, “Rock. Hole. Rock. Roots. Leaves. Rock.” to keep the objects most likely to trip me up in full focus. Thank you, visual system, for helping me run safely.

Featured Psychonomic Society article

Peng, Y., Burling, J. M., Todorova, G. K., Neary, C., Pollick, F. E., & Lu, H. (2024). Patterns of saliency and semantic features distinguish gaze of expert and novice viewers of surveillance footage. Psychonomic Bulletin & Review. https://doi.org/10.3758/s13423-024-02454-y

Author

  • Hannah Mechtenberg is a Doctoral student at the University of Connecticut in the Language and Cognition division within the Department of Psychology, advised by Dr. Emily Myers. She studies how our brains untangle and comprehend spoken language, especially when there is uncertainty about what is being said and by whom.

    View all posts

The Psychonomic Society (Society) is providing information in the Featured Content section of its website as a benefit and service in furtherance of the Society’s nonprofit and tax-exempt status. The Society does not exert editorial control over such materials, and any opinions expressed in the Featured Content articles are solely those of the individual authors and do not necessarily reflect the opinions or policies of the Society. The Society does not guarantee the accuracy of the content contained in the Featured Content portion of the website and specifically disclaims any and all liability for any claims or damages that result from reliance on such content by third parties.

You may also like