The Body Digital (EBK) (Vanessa Chang) » p.5 » Global Archive Voiced Books Online Free

The Body Digital (EBK), page 5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

This confusion between humans and machines may feel new, but it predates such digital forms. The early modern era was the heyday of clockwork automata, which are complex mechanical devices designed to automatically follow a sequence of operations. The term “automata” has long been associated with automatic puppets resembling animals or people, such as the Jaquet-Droz automata I discussed in the previous chapter. From the cuckoo clock to a defecating duck, these proto-robots aped life in sometimes astonishing ways.

In 1906, the German psychiatrist Ernst Jentsch theorized that clockwork automata could induce an uncanny feeling due to their unnerving resemblance to living beings. Such beings, he argued, embody a grotesque breakdown between life and its absence. “Among all the psychical uncertainties that can become an original cause of the uncanny feeling,” he wrote, “there is one in particular that is able to develop a fairly regular, powerful and very general effect: namely, doubt as to whether an apparently living being is animate and, conversely, doubt as to whether a lifeless object may not in fact be animate.”[11] Jentsch cited German fantasy and horror author E.T.A. Hoffman’s 1816 tale, in which the protagonist, Nathaniel, falls in love with a lifelike automaton he believes to be human. This confusion, Jentsch claimed, is “one of the most reliable artistic devices for producing uncanny effects” in storytelling.

Father of psychoanalysis, Sigmund Freud, later expanded on this idea through the etymology of the German unheimlich or unhomely, defining the uncanny as “the class of frightening which leads back to what is known of old and long familiar.”[12] In other words, uncanny feelings are aroused by the familiar cohabiting with the alien, when what we know intimately well becomes estranged.

This logic underpins robotics professor Masahiro Mori’s concept of the uncanny valley, which describes the eerie, unsettled sensation a person feels when they encounter an artificial object that closely resembles a human being but falls short. More specifically, it hypothesizes a phenomenon whereby the more human a robot appears, the more empathy an observer will feel—until a perceptual cliff is reached, and empathy gives way to revulsion. The valley in question here refers to that plunge: the moment when artificial likeness becomes too close for comfort, and yet is not close enough. In other words, something is off with automata, robots, and simulations in the uncanny valley; they are neither human nor machine, neither living nor dead. In an age of digital technologies, this effect extends beyond humanoid robots to virtual reality, augmented reality, 3D avatars, and yes, to voices. These disembodied voices, neither fully human nor purely machine, neither living nor dead, haunt our soundscapes with a deep strangeness under the veneer of humanity.

In 2021, a film chronicling Anthony Bourdain’s life resurrected the renowned chef and travel documentarian from the dead. In Roadrunner: A Film About Anthony Bourdain, he can be heard reading from a letter he wrote to a friend: “My life is sort of shit now,” he says in his familiar gravelly voice. “You are successful, and I am successful, and I’m wondering: Are you happy?” Was this a recording miraculously unearthed from his effects after his tragic suicide? Not hardly: the director of the film, Morgan Neville, commissioned the sound bite from a company that uses artificial intelligence to produce vocal deepfakes from sonic archives. Neville defended this creative choice, but public outrage decried it as a creepy blasphemy of Bourdain’s memory, the worst kind of artistic necrophilia.

Despite this outcry, Bourdain’s virtual Lazarus is but the latest incarnation of technological reanimation. Reproduction technologies have been used to conjure dead celebrities in the past: famous examples include Tupac Shakur’s hologram performing at Coachella and a CGI Fred Astaire tap-dancing through a Dirt Devil vacuum cleaner commercial. Hologram resurrections are almost an industry unto themselves, with high-definition laser and digital technologies conscripted in service of music hologram concerts that merge these crisp reproductions with live orchestration. The aim is to bring these resurrections closer and closer to life. As David Rowell writes in The Endless Refrain, for their audiences, these digital performances can foster a visceral sense of connection with an artist’s onstage persona, and at their best, make old music new again.

Deepfakes, however, bring such revivals into an uncanny new terrain. As voice clones generated by machine learning systems hew ever closer to their human origins, they become indistinguishable to the human ear. For some, this nascent technology holds great promise—offering realistic vocal models for people with speech impairments, more convincing voice assistants, intimate chatbots, and myriad uses in the entertainment industry. For others, a foreboding future looms where vocal deepfakes erode trust in traditional forms of evidence, and herald even more annoying robocalls and phone scams. However, never ceasing to surprise and amaze, such technologies can certainly be used for good: in late 2024, UK mobile operator Virgin Media O2 brought darling granny Daisy to their team. Daisy is a composite of several AI models that responds to phone scammers as a kind, elderly woman; by rambling on about her cats, her family, and her knitting, Daisy wastes their time, a computational effort that translates into fewer people being scammed. Because of the essential connection of the voice with human identity, these vocal avatars threaten to change the very meaning of being human.

Corporate initiatives in AI voice synthesis have proliferated over the past few years. These systems all basically learn to speak by analyzing and replicating human vocal nuance from massive caches of audio data. Researchers have mined existing audio archives to generate voice clones of celebrities and other public figures. In 2019, a pair of Facebook AI researchers, Mike Lewis and Sean Vasquez, released the results of their speech synthesizer, MelNet. Trained on a 452-hour dataset including more than 2,000 TED talks, the machine learning system generated uncanny vocal clones of Bill Gates, Jane Goodall, and George Takei, among other famous voices.

Similarly, a 2016 project from Google DeepMind synthesized voices by sampling existing human speech. Since then, a number of international start-ups and research groups have continued to develop the technology and its applications in ways that test traditional boundaries of identity. Cambridge-based Modulate builds voice skins that allow you to cloak yourself in someone else’s voice. Baidu’s Deep Voice can swap a voice’s gender or accent. Other projects are more altruistic. Through Project Revoice, a partnership with the ALS Association, Montreal-based AI start-up Lyrebird, named for the Australian bird with the remarkable ability to mimic natural and artificial sounds, aims to restore digital voices to people with the disease who might lose their own.

Like sound recording and Auto-Tune before it, this latest evolution of digital vocality stirs cultural anxieties about the authenticity of the voice, and its capacity to survive death. AI-generated voice clones lie beyond the frontier of the uncanny valley. By detaching human voice from body and turning it into an algorithmic object, sound technologies are more able to indulge in fantasies of immortality. Vocal deepfakes, the spectral afterlife of sound recording and reproduction, animate our voices in the hereafter even as they erase our bodies, our breath, from speech and song.

Musician Holly Herndon, who uses AI as a compositional tool, has invoked artistic necrophilia to describe some contemporary uses of artificial intelligence in vocal deepfakes and generative music. But Herndon does not denounce AI wholesale. Instead, she treats machine learning as a creative medium, pressing it to perform in new and imaginative ways. Her creative approach stands in sharp contrast to projects like OpenAI’s Jukebox, a neural net trained on vast datasets spanning music from almost every genre, which pirates original artists by generating songs and lyrics in the style of artists both living and dead. In such systems, machine learning becomes a tool of ventriloquy, pirating the voices and styles of dead performers, and consummating the desire for resurrection that has haunted sound media from the beginning. Used to reanimate the dead, vocal deepfakes raise pressing ethical questions about the agency of the deceased.

For the living, vocal clones can be a means of wresting life from loss. After losing his voice to throat cancer, Val Kilmer partnered with the company Sonantic to create a synthetic voice model trained on his audio archive. Coupled with text-to-speech technology, the resulting voice clone gave him the capacity to speak with his original voice—or something very much like it. But vocal likeness alone shouldn’t necessarily verify someone’s humanity. Synthetic vocality has been a feature of Augmentative and Alternative Communication (AAC) for decades. The physicist Stephen Hawking, one of the technology’s most prominent users, was readily identifiable by his synthesized voice, even making several cameos on The Simpsons.

Nonetheless, even as these systems give voice, they also stifle its full expression. In her essay “I Still Have a Voice,” disability activist Alice Wong describes the limits of her text-to-speech app for supporting her communication: “The voice options are robotic, clinical, and white. It mispronounces slang and Chinglish, a mix of Mandarin and English which is part of my culture. It also fails to capture my personality, cadence, and emotions.”[13] Wong’s critique of her AAC tools points to what voice clones might become: a tool that affirms rather than flattens identity, capable of conveying not just intelligibility, but cadence, feelings, culture, and self. This would require the enthusiastic participation of the original speakers in the data collection and encoding, and the integration of cultural context into the process—personal, social, and political meaning—which many current systems sorely lack.

Much of the public outcry over Anthony Bourdain’s voice clone stemmed from a shared sense of violation: this crass reincarnation could not be what the man himself would have wanted. In contrast, Kilmer’s collaboration with Sonantic to create a voice model that resonated with his own then very-much-alive body suggests a different paradigm for the future of human vocality. Based on the sum recordings of a life, a voice model offers not just a digital artifact, but a form of agency: an opportunity to define that afterlife in advance. The question becomes whether the ghost in the deepfake machine is genuinely our own or just a database zombie. While the legal terrain remains murky, we can look proactively beyond our own deaths and collaborate with AI from beyond the veil.

Here, musicians serve as compelling test cases for reimagining the future of vocality. AI is us; it is human labor concealed. Choices about how to curate and label the data used to train deep learning algorithms are just as central to a legacy as the archive itself. Projects like Jukebox demonstrate how musical corpora—spanning genres, cultures, and decades—feed data-hungry platforms not just sound, but style, phrasing, and feel. Since that first musical waveform was etched into soot, millions upon millions of songs have been recorded. These massive databases offer machine learning algorithms blueprints for reproducing elements both tangible and ephemeral, such as style, tone, and mood. These creative parameters suggest one approach to installing guardrails on vocal data. Artists might choose to reflect on the affordances and vulnerabilities of their recorded selves, not simply to preserve, but to sculpt a version of their voice for the future. This “vocal will” could encompass the entire landscape of their personal archive, mere fragments, or nothing at all. By defining their estate plans to reckon with the more ephemeral qualities of our vocal identities, their own artistic styles, and the possibility of new creative contexts generated by artificial intelligence, artists could fashion a future from beyond the grave. Such a plan would set the terms for posthumous creativity by delimiting how, why, where, and by whom a sonic archive is used as training data.

One approach would be to reject the extractive logic of big data. According to researcher Kate Crawford, AI systems and the industry that creates them are motivated by a ruthless logic that “everything is data and is there for the taking.”[14] By this logic, discrete objects—whether mugshot or sound bite—are stripped of personal, social, and political meaning and harvested as raw data in service of profit. Easily lifting these artifacts from databases to serve as training data, AI systems appropriate and erase the human labor that went into their making. So determined, an AI resurrection can corrode a person’s living memory with the demands of capital and of infrastructure. In imagining their creative afterlife, someone might approach their sonic archive selectively. Rather than allowing all their data to be fodder for creating vocal models, they might allow only specific parts of their oeuvre that represent eras, themes, or concepts in their biography. A vocal model might then reflect a tightly choreographed slice of a life.

A person might also restrict or envision the contexts in which their digital resurrection appears, whether in terms of genre, purpose, or ethical alignment. They could define the conditions of deployment, forbidding commercial uses, or the voicing of views contrary to their beliefs. They might even preemptively rebuke attempts, like Bourdain’s digital doppelganger, to ventriloquize their own words. While legal frameworks in intellectual property and copyright law have worked to keep pace with new technologies, not all jurisdictions recognize postmortem rights of publicity. Still, by planning ahead, people and their heirs can mitigate the transformation of their voices into pure commodities, and infuse their hereafters with creative intention.

For most people who do not have famous voices, this kind of preemptive wrangling might seem irrelevant. But the stakes are real. Kurt Vonnegut once likened artists to canaries in the coal mine—early detectors of the social and technological shocks that others have yet to notice. In an era characterized by the capture and manipulation of data, artists are also especially vulnerable to the public availability of their archives. Song has been central in the evolution of the human voice from life force to data stream. The very first recordings were musical, and music has scored several turning points in this trajectory. We are not all musicians, but many of us have voices, and increasingly we leave astonishing caches of data in the wake of our lives: voicemails, podcasts, video calls. In the century and a half since the plaintive lyrics of “Au Clair de la Lune” were captured on the phonautograph, we have come to generate countless such artifacts of our embodied existence as a matter of course. In the broadest sense, how we allow that data to be used is how we choose to live, even in death. AI speech synthesis may threaten the unique status of the human voice. But it could also help us to find new ways of expressing our humanity.

Just as John Philip Sousa received the arrival of recorded music with alarm, today we hear AI’s voice clones with awe and unease. But Sousa’s trepidation was misplaced, if not without warrant. Sound technologies have created as many possibilities as they have foreclosed. It’s not simply what a technology is that determines our future but how we use it to speak, to remember, and to create.

Technologies, like the humans who create them, will continue to spawn as-yet-unimagined futures. However thoroughly we prepare for that future, these parameters will be tested in the ever-expanding capacity of technologies, AI or otherwise, to generate new social, creative, and commercial terrains. Media history is a library of ghost stories, where echoes of the past haunt the present and script the future.

3

EAR

ACROSS NEW YORK CITY, SINGERS GATHER IN BASEMENTS, museums, gymnasiums, boats, and art spaces. A chorus forms, as strangers assume assigned parts sent by email, along with scores for that day’s song. They meet, rehearse for a few hours, and then explode in resplendent harmony. Then they post it to Instagram. The Gaia Music Collective gathers one-day choirs composed of effusive strangers who open their voices to one another in one of the most atavistic forms of musical community. Joined in voice, they recall a time before sound technologies remade our ears, and with them, our heard world.

If the voice is how we reach outward, the ear is how we are reached. Where the last chapter followed the voice as it was captured and remade, this one turns to listening—to how it, too, has been reconfigured by technology, and with it, how we attend, connect, and feel close to one another. Our ears, once tuned by acoustic communities, are now calibrated by machines. From choirs to cochlear implants, music boxes to algorithmic playlists, listening has become a mediated act—private, curated, and data-driven. This chapter traces the transformation of listening from an embodied, social experience into a computational process, and asks what that shift means for how we know and love one another.

For millennia, music was a primal, immediate, and often communal experience. Around fires, in homes, theaters and places of worship, at war and peace, punctuating birth and death, rituals across the globe often included music. Always performed live, song and instrumentation took place in ephemeral moments shared between performer and audience. Before the emergence of musical notation, songs were also a form of cultural transmission. Borne aloft on sound waves, music is living memory embodied in the breath.

To hear is to be touched.

Sound is immersive. It surrounds and fills our bodies. Our ears are but one central part of the experience of hearing. The low end—frequencies that plummet below the threshold of human hearing—still vibrate and infuse our bodies. On dance floors, sub-bass vibrations ripple through skin, muscle, and bone; people hug speakers not just to hear the music, but to become part of it. Listening to someone else sing live is profoundly intimate; it is to be alive in the presence of each other’s bodies. A voice—that singular expression of an individual—emanates from a body’s interior and enters that of another.

As choral assemblies like the Gaia Music Collective demonstrate, our ears are social organs. They allow us to communicate with one another and to explore and interpret the world around us. While it has historically been central to human survival, our sense of hearing empowers us to participate in two fundamentally human expressions of the voice—speech and song.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

The Body Digital (EBK), page 5

Other author's books: