The new calling for video calling
For the class of people referred to as “ knowledge workers ”, the coronavirus pandemic is the biggest experiment in working remotely. Right from the 70s when computers first popped up, there has been talk of “ tele-commuting ”
. We’ve come a long way from the days when this was a dreamy fantasy. Going to the office is increasingly unnecessary, you can get work done anywhere with just a laptop and an internet connection.
One of the focal points of the recent transition has been Zoom — the video calling tool that has exploded, both in usage and in the cultural lexicon. At this moment, Zoom is almost like Xerox or Google, a single brand so ubiquitous that it represents the whole category . We now have Zoom happy hours, Zoom marriages, Zoom fatigue, and more.
The availability of reliable video calling has been both a boon and a bane. Thanks to video-calling, many categories of workers could get on with their occupations in some form . However, the massive increase in usage has brought the flaws in video calling tools to the fore. The frequency of usage is now so much higher that occasional annoyances are now aching sores.
More fundamentally, you could argue that the essential need for which these tools were used has changed i.e. the “ Job to be done ” has changed.
When we could visit offices, travel to meet each other and see each other in person — video calling was a one-off activity for most teams. Only distributed teams had an ongoing need for it . The tool’s purpose was to provide high quality transmission for an occasional burst of communication. In a world where we’re all distributed and meeting each other is not an option, video calling tools have a much taller ask to handle — simulating human presence and providing access to people . Despite video-calling tools today providing decently good video transmission; you’ll still hear people say, “ It’s just not the same when you’re interacting over a call ”.
Let’s dig into those differences from two perspectives that I’ve personally experienced — as a speaker and as a co-worker.
Speaking as a Speaker **** Presence
I enjoy public speaking. Sometime in my early twenties I heard the phrase “ You can double your net worth by improving your public speaking skills” and took it to heart . I’ve managed to do a decent number of talks about my work since 2015. Public speaking requires all the usual elements you’d imagine — great content, voice modulation for emphasis, using space to create a good physical presence, engaging the audience, and reading the room . Giving a good talk is a theatrical performance and some of the best speakers I know, do approach it that way with prepped jokes and strong rehearsals.
So, what happens when this experience is squished down into a video call?
The speaker and content compete for the 1366 * 768 pixels of space. Zoom and Teams both make it hard to blend the speaker and the content (like the way a television newscast does) . This means the speaker decides to give up on showing content altogether or becomes a mouse trapped in a little box in the corner.
How a typical talk looks like on video. Yes, [that’s me in the corner](https://www.youtube.com/watch?v=xwtdh
WltSIg)
I can’t stroll about and bring my body language into the picture. I’m forced to sit and only my face is visible. The speaker is deprived of the opportunity of gesturing, moving about, or expressing their presence physically.
Audience reactions
Stand-up comedy is a good example of audience reactions embedded into the art form. In stand-up, while the audience is reacting to the jokes; it is also the comedians who are observing and reacting to the audience — the deadpan silences and outbursts of laughter are signals to the comedian about how the audience is feeling at the moment.
Look at this short clip by Deon Cole (I’ve linked to the exact moment and only the first 8 seconds are relevant) Deon Cole does explicitly what most comedians do implicitly — he tells the joke and pays hard attention to the audience reaction, using that as feedback for changing the material. Here he is visibly holding a pad and crossing things off, others do it off stage . There isn’t one particular person that he is paying attention to. Rather, he gets a sense that the joke has bombed because of a few muffled laughs and a general non-reaction from the crowd . The space between him and the audience serves as a signal processor that converts every individual reaction into a collective mood of the audience.
In today’s VC tools, when I can’t hear those reactions; I’m completely robbed of this feedback signal. Unmuting yourself and saying something is not as spontaneous as a snarky comment or round of laughter just that bursts out . Flying emojis are a poor replacement, they just aren’t as visceral as louds claps or boos. The audience loses a way to communicate — my laughter or reaction can become a part of a larger crowd reaction, while VC tools largely facilitate only 1:1 audio communication.
Scale and space
A call with 10 people feels the same as one with 100. VC tools are just beginning to incorporate a sense of scale . Livestreams have a total viewer count, Google Meet and Microsoft Teams shows a row of faces at the bottom; yet a strong and intuitive notion of space and (relatedly) scale still eludes us . There are two aspects to this — letting the audience know what setting they are in and second, encouraging/enforcing behaviour appropriate to the setting.
In the real world again, this role is played by space. 100 people gathered in an auditorium know they are there mostly to listen. The distance between them and the speaker (who is the usually only one with a mic) enforces this dynamic. 5 people gathered around a table automatically know that they are mostly equal participants in a conversation, and they can speak up when they need to.
VC tools “flatten” out the 3D space we’re used to, proximity to the speaker is not a scarce quantity anymore. However, with that we lose an opportunity for audience differentiation . Be it for economic reasons, experience quality or status signalling, audience differentiation is everywhere. In VC tools the audience is now one blended mass.
While the flattening is good and will create new social norms, I also imagine many use-cases where differentiation is needed and that’s something we need to figure out. We’re getting early forms of these in the live-streaming apps with names flashing across the screen when you give large gifts to the streamer . We also need to figure how to encourage (or enforce) social norms previously enforced by space. An example of software enforcing the norm of a setting — both Google Meet and Microsoft Teams nudge you to switch off your mic when you’re joining a call with 5+ participants in it.
Jankiness and discontinuity
Finally, there are just the unpolished edges of today’s tools which are just adapting to new use-cases we’re putting them through. If you share your screen and full screen the window, you can’t interact with the video calling tool itself . So, you might not notice if you’ve gotten disconnected. The transition from pre-session to the session isn’t as smooth as walking from the backstage to stage . Zoom has a waiting area, else you’re just on . Airmeet and Teams have just added the notion of a backstage area, where hosts and speakers can wait while the session begins.
Catching the Co-Workers **** Availability
Even as many companies have gone remote, only some are adopting remote anywhere. The most widespread practice is allowing remote work but with time zone restrictions and relatedly, expectations of availability during a common set of working hours. This makes sense on multiple counts . While a lot of communication can be asynchronous; sometimes you just need to sort to things out quickly and there’s nothing better than getting onto a call (previously this would’ve been jumping into a meeting room) . As co-workers interacting with each other, a social norm that drives availability and eases interaction does seem beneficial.
But there’s a little catch.
The real question of availability is answered only by knowing the exact state of a person at this very moment . When I need to talk to someone, I need an indication of presence and availability. Can I interact with this person? How deeply are they engaged in what they are doing? The physical world answered both these questions together . You could observe this directly and see if your co-worker is typing away furiously or gazing at the roof. I also got to observe my co-worker’s mood — are they’re flustered? Or up for some coffee and chit-chatting? Availability, as we realise, is not binary . I might be ok to chat for important stuff but not really in the mood for banter.
Today’s availability and presence indicators still resemble what we created with the first instant messengers — they have an Available/Busy/Away setting and nothing more. They’re very rudimentary and fail to communicate the nature of my interruptability.
Presence
Beyond availability for business purposes, there is much deeper human need that offices and other shared physical spaces provided for — human presence. You may not want to know what everyone is up to, but you just want to know that they’re there . This is something we absorb through our peripheral vision . As I walk to my desk, I automatically scan to see if people are in. I’m not looking for anyone specific, I just want a sense of whether people are in today . Thoughts can go from “ Ah! Everybody’s in. It’s a normal day. ” to “ Where is everybody? Are they sick? Is today a holiday? ” depending on what I see.
Our single-person presence indicators again fall short of radiating the warmth of human presence. We don’t know if todays a busy day and everybody’s in early and working furiously to meet a deadline or if today is a lazy Friday with people bumbling around.
Purpose
A space also softly demarcates purpose or at least sets the norm for it. People have more work-related discussions near desks and in their work bays, gossip and banter happens in the cafeterias and hallways . Presence in a space is not only an indicator of purpose to others, but also to self. Being at work, was automatic signal to me, that I’m in work mode . One big complaint about post-pandemic work has been about work being always on. “I dont work from home, I live at work”, goes the gripe . People rarely shut their computers off completely, and their phones are always connected . There is no spatial demarcation between work and play — so there’s nothing that tells me whether I’m supposed to be in work mode or chill mode. I personally solve this dilemma with Toggl — its a time-tracker for freelancers . The running clock reminds me that I’m supposed to be in work mode. But the connection of purpose and space, and thus the ability to change purpose by changing space is something we really miss.
Jankiness at work
Have you even lived if you haven’t said “Can everybody see my screen?”. It’s not like we don’t do soundchecks for physical gathering, but this looks like another area for software magic to come in.
** Bonus: Speculations on what comes next ** People are now living on video and there’s a rush to solve the problems of presence. There’s already been a host of attempts to plug the gaps I mentioned above . Below I stretch out that line completely and try to predict what we might end up with.
Bye presence, hello presence-simulation
The need for a stronger and more high-fidelity human presence will drive the rise of video walls — either dedicated video rooms with screens or portable projectors that cast on a wall in front of you. The larger screen size making the experience more life-like . This could also be achieved using VR headsets, but I feel having a video wall is less burdensome, more continuous, and natural experience for most use-cases.
Video-calling software will do increasing amounts of presence simulation — incorporating aspects and likeness of physical spaces to make the interactions feel more comfortable. Visual simulation will inject people (or rather their streams) into video streams to provide the appearance of a physical space . This is already being done by Around, Together mode in Teams and mmhmm . We’re blending not just video streams of people but also screenshares. We’ll soon also do this with audio; muting will become unnecessary. ML algorithms will do noise cancellation much better, especially with access to camera streams . Our approach to build video conferencing software will move away from just faithfully relaying information to generating and simulating presence . We will move beyond flying emojis on screen to signal engagement and start creating signals that speakers can parse peripherally and intuitively . The work done by space & air will be done by signal processing algorithms. Video streams will use signals like viewer eye direction, pose detection, keyboard fidgetiness, window changes and more to read the room ; these signals will be combined to generate the appropriate audio — chatter when there’s 10s to 100s of distracted people in a call or collective chants and boos when there’s 1000s to millions watching sports streams . Along with generated audio that captures the mood of a room, we’ll also get visual metaphors like thermometers or energy meters representing crowd engagement . And just like any other metric, people will try to hack it to prove they’re the best, and associate high engagement thermometer levels with speaker quality.
Bye Smart TVs, hello digital windows
The TV was a in a way the device that defined the home. “You don’t have a TV?! What’s all your furniture pointed at?” (See this clip from FRIENDS) . The TV used to be an electronic cave fire — a gathering point for groups and families. Analog television had to be watched together, while it took away individual choice, it provided a unifying common experience giving a sense of oneness . The pendulum has swung the other way, dropping costs and the subsequent proliferation of devices in the home meant everyone has their own screen. Between snapping, tweeting, and tiktoking; watching things together has become a rarer event . Unless it is the FIFA World Cup or Game of Thrones. From devices to social media profiles, all our newfound tools in the past 2 decades have been centred around the individual . The individual is defined loose of any association, one of the many shouting in a public square. Recent times have seen a retreat of that notion. Yancey Strickler called it the internet’s transformation into a “dark forest” . Between self-censorship on social media, alt accounts, private/invite only groups, paid newsletters and more; we’re bringing in the notion of private spaces into the digital world . Our desire for private spaces didn’t die away, it was just underserved as we built out our first set of tools on the internet. This makes me bullish on the next generation of TVs providing those collective experiences.
TVs, even the “Smart TVs” are still very much devices meant for video consumption. As video calling explodes and becomes the norm, TVs will become the next video calling tools for families . They’ll come with high quality cameras built for group video calling. Laptops and phones are will become remote controls for Smart TVs. The new TVs will bring home human presence into the home with their huge awespiring screens . The television will be back as the center of the home, but this time not as passive consumption device, but an interaction tool. TVs will have multiple cameras and will be touchscreens . It’ll be a new kind of window and everyone will need one. TVs won’t be portals into the world, they will become portals into our world. A new digital window through which talk to friends . Awkward family video calls where we pass the phone around get replaced with chatter sitting in front of the TV camera. Distributed families will schedule Netflix streaming together and chat about things over an ensuing lunch . The sad but true picture of a family buried in 4 different devices will become less common. Devices and individual consumption experience will still hold their allure, but we’ll see more group interactions as the TV serves as hub of the home . The Digital Window will be the rebirth of the landline, it will belong to the family. Voice-based input and digital assistants will make interacting with TVs feel natural.
Of course, this prediction could be totally right yet totally wrong if the same collective need manifests itself in purely digital/augmented reality spaces (think virtual walls or wearing spectacles to join a group video call).
Remote presence
The need for human presence in a remote setup will lead to the development of refined presence and availability indicators. Our cameras will always be on and the OS will try to gauge the complexity of the task we’re handling . This will help compute mood and presence, allowing apps to signal the nature of our availability. These computations will be local and privacy-preserving.
Increasing network reliability and constant online presence will mean that the notion of a call will go away. Everyone will always be reachable, like they are in the real world ( isn’t that quite true already? ) . We’ll instead knock on the doors of the spaces, walking in and out of conversations. Interacting will be smoother.
People will start their workday explicitly — be it through device mode or time tracking or other rituals, unlike today when you just stumble into it when you open your email app. This will be a method to self-signal a work-mode . We’ll setup digital rooms/video spaces that employees will join to signal their availability at work to both the employer and to themselves.
Scale-sensitivity
Our interfaces will become scale-sensitive — automatically muting participants, generating chatter, tracking focus levels of participants and more to create the appropriate setup for an interaction. The purpose of changing these knobs and toggles will be set the appropriate norm for a conversation and recreate the appropriate space.
This would be something like the below:
- 2–7 participants — Small room/people around a table: Everyone can speak, no signal tuning.- 7- 30 participants — Medium sized meeting room: Focus tracking available, some level of generated chatter.- 30+ participants — Large hall: Heavy amounts of generated chatter. Participants can’t speak with holding a button persistently or making a request to the host.
Live video production
The variance of the live video experience will increase. You’ll get increasing amounts of polished pre-recorded videos (like the ones at Apple WWDC 2020) as well as more chaotic live streams, each with own purpose and character . Many speakers will see the efficiency benefits of recording once and playing anywhere, instead of doing talks live repeatedly. This will increase the demand for video editing, cheap recording studios with prosumer equipment will become a thing . Speaking will become a multi-device experience with the interface incorporating room for notes, visibility into what’s being shown, levels of engagement, screen controls, etc . Green screens become a niche consumer product (or ML will just get good).
** End notes ** Innovative products are just a step ahead of the market and the consumer. Veer too far ahead, then consumers can’t comprehend you and you fail . The coronavirus has gifted the video-calling industry an inversion of this dynamic, consumers are now far ahead — teaching, meeting, catching up and otherwise living their lives across various video, audio, chat and other communication tools which are generic and catch-all . Most critically all our tools fall far short of the rich human presence we’re used to. If you’re not convinced about that, feel free to read this essay again!
The tech industry has its attention on this problem. The technology industry has been an early adopter of remote-working and a power user of these tools . The gripes with these tools are something we personally experience. Also, there’s a shit ton of money to be made in solving these problems — whether it’s building a better enterprise communication tool for or the opportunity of building new kinds of social networks.
Upgrading digital presence by making digital interactions more life-like and enjoyable is going to unleash new levels of human connection. I look forward to tons of experiments, products, and inventions in this area in the coming years.
The UX Collective donates US$1 for each article published in our platform. This story contributed to UX Para Minas Pretas (UX For Black Women), a Brazilian organization focused on promoting equity of Black women in the tech industry through initiatives of action, empowerment, and knowledge sharing
. Silence against systemic racism is not an option. Build the design community you believe in.