I started trail running a year ago. I was an avid hiker, so I assumed that a marginal increase in my speed wouldn’t pose too much of a challenge. The bruises on my hands and legs served as stark reminders of my naiveté. However, despite the rocky start (pun intended), I gradually learned how to navigate trails littered with leaves, rocks, and roots with more ease. With each run, I noticed my visual attention changing. No longer did I linger on peripheral details like bushes, most trees, the stream next to me, etc. Instead, I was quickly scanning ahead of me—fixating briefly on hazards before focusing on the safest path forward. It was like my entire visual system learned to hone in on the most critical parts of my environment to help me achieve my goal: to slowly shuffle run through the forest without falling over.
This process of visual learning is certainly not unique to me—many careers and hobbies require this type of honing to perform well. One such career is that of a CCTV (Closed-Circuit TV) operator, who scans live video footage to intuit people’s intentions and prevent crimes.
The big open question is what types of visual information these operators rely on—the answer to which will give insight into how expertise changes our visual system writ large.
In a recent study published in Psychonomic Bulletin & Review, conducted by Yujia Peng, Joseph Burling, Greta Todorova, Catherine Neary, Frank Pollick, and Hongjing Lu (pictured below), the researchers sought to answer this question by tracking the eyes of expert CCTV operators and novices while they watched CCTV footage. The findings, discovered through computational analyses and machine learning techniques, showed that experts focused more on certain types of low-level information and that experts had higher agreement for where they looked than novices. In short, the experts had finely tuned their visual systems to pick up the most relevant information for determining people’s intentions, benign or otherwise.
“CCTV operators [experts] actively attend to visual contents that may be the most effective for the detection of harmful intentions,” said the authors about the importance of their findings.
In the study, two groups of viewers—expert CCTV operators with an average of 10,000 hours of viewing experience and novice community members—watched 36 different videos. Each of these videos was pulled from CCTV cameras pointed at a street and were grouped into four categories based on the behavior of the people in the clips: fight, confrontation, playful, and neutral. The researchers tracked participants’ eyes while they watched the videos to identify what type of visual information participants fixated on the most.
To do this, they broke down the videos into two groups of visual information: low-level sensory information (e.g., colors, brightness, motion, see above) and high-level object information (i.e., what is actually present in the scene: street, people, light posts, etc.). See the video below for more details about how they extracted this information using two different computational models.
They then reported results from two analyses: 1) whether people in the same group (experts versus novices) looked at the same types of information and 2) whether a machine learning algorithm, trained on the eye-tracking output for each group, could reliably tell experts from novices beyond merely guessing.
They found that, overall, experts had higher agreement with each other about what types of information they tended to focus on compared to novices. Notably, experts tended to prioritize low-level information more than novices did. Further, experts focused more on certain high-level object information, especially at the start and end of the videos. This strategic focus likely contributes to their ability to rapidly determine the intentions of the people seen in the footage.
After passing the eye-tracking data through a machine-learning algorithm, the algorithm successfully determined which data belonged to experts versus novices. This was true not only for the fight-coded videos, but for the other three content types as well (i.e., confrontation, playful, and neutral). Upon closer examination, the algorithm used attention to motion as the primary distinguisher between the two groups. In other words, the expert operators were paying very close attention to what was moving in the videos, more so than the novices.
They also found that experts were more likely to use both types of visual information—low-level and high-level—at the beginning of the videos, but they shifted focus to more high-level objects as the videos progressed.
The authors summarized this finding as “… intention inference may start with low-level visual cues and gradually move on to semantic-level visual processing.”
For these CCTV operators, their visual systems have learned exactly what type of visual information they need to focus on, and when they need to focus on it to be able to determine what people might do in a given situation.
Speaking for myself, my visual system has learned to tune out information that doesn’t help me stay moving (and upright) and instead is undergoing rapid identification, “Rock. Hole. Rock. Roots. Leaves. Rock.” to keep the objects most likely to trip me up in full focus. Thank you, visual system, for helping me run safely.
Featured Psychonomic Society article
Peng, Y., Burling, J. M., Todorova, G. K., Neary, C., Pollick, F. E., & Lu, H. (2024). Patterns of saliency and semantic features distinguish gaze of expert and novice viewers of surveillance footage. Psychonomic Bulletin & Review. https://doi.org/10.3758/s13423-024-02454-y