Sei Research Initiative

Building the Robot Economy: DePIN Tokenization as Capital Formation for Embodied AI

Sei

03 Apr 2025 • 6 min read

By Vangelis, Ben Marsh- Sei Labs Research for Sei Research Initiative.

From unimodal to multimodal LLMs

LLMs evolved from unimodal systems that process single data types (e.g., text) to multimodal systems capable of understanding and generating diverse modalities such as text, images, audio, video, and more.

The introduction of transformer architecture in 2017 marked a pivotal moment in the history of LLMs. Google’s seminal paper, "Attention Is All You Need", presented the attention mechanism that enabled models to focus dynamically on different parts of input sequences. This architecture laid the foundation for modern LLMs by enabling scalable and efficient learning from large datasets.

In 2019, OpenAI released GPT-2, a generative model that showcased remarkable capabilities in text generation. In 2020, GPT-3 further expanded the possibilities of unimodal systems by scaling up parameters (175 billion) and training data. Its ability to perform zero-shot and few-shot learning made it a general-purpose model for text-based applications, ranging from creative writing to code generation.

These models were limited to textual inputs and outputs. The shift toward multimodal LLMs began with efforts to integrate multiple data types into unified frameworks.

In 2023, OpenAI introduced GPT-4, a groundbreaking model capable of processing both text and images. While GPT-4’s initial release focused on text-based tasks, its vision component (GPT-4V) later enabled applications such as image captioning, document analysis, and visual question answering. Concurrent advancements included Google's PaLM-E, which integrated multimodal inputs for robotic control, and Flamingo by DeepMind, designed for visual-language tasks like image-based dialogue.

Several state-of-the-art multimodal models have emerged:

OpenAI GPT-4V: Processes text and images for tasks such as captioning and document analysis.
Meta’s ImageBind: Integrates six modalities, including thermal imaging and motion sensors, into a unified representation space for comprehensive understanding.
Microsoft Kosmos-1: Aligns perception with language understanding for visual-textual tasks.
Salesforce BLIP-2: Combines visual-language pretraining efficiently, enabling scalable multimodal applications.

Embodied AI: Definition and Multimodal Extension

Embodied AI refers to intelligent systems that integrate physical or virtual bodies to perceive, act, and interact with their environments. Unlike traditional AI models that operate in abstract domains, embodied agents engage directly with the physical world, learning through sensory-motor experiences and adapting their behavior based on environmental feedback. This concept builds on theories of embodied cognition, which emphasize that intelligence emerges from the dynamic interplay between an agent's body and its surroundings.

With the advent of multimodal large models, embodied AI has evolved into Embodied Multimodal Large Models (EMLMs). These systems combine multiple sensory modalities—such as vision, language, audio, and touch—with embodied capabilities to enable rich interactions within physical environments. EMLMs leverage multimodal environment memory (MEM) modules to bridge the gap between high-level reasoning (via large models) and actionable control in real-world tasks. This integration allows embodied agents to perceive diverse inputs, make decisions, and execute actions in dynamic settings.

Architecture of Sensorimotor Loop for Embodied AI

The fundamental difference between embodied systems and traditional agents lies in their ability to deal with the physical consequences of actions and learn from them.

The Sensorimotor Loop Stack

Perception Layer: Embodied agents gather multimodal inputs from their environment using sensors for vision, language, audio, and touch. These inputs form the basis for understanding the current state of the world.

Multimodal Environment Memory (MEM): MEM modules store critical information about the environment—such as object locations, scene layouts, and historical trajectories—in a multimodal format. This memory serves as a bridge between perception and reasoning layers.

Reasoning Layer: Large models process multimodal data to generate actionable plans. They align sensory inputs across modalities (e.g., visual-language alignment) and formulate high-level strategies for task execution.

Control Layer: The control layer translates high-level plans into executable motor commands or navigation instructions. This layer interacts directly with robotic actuators or other physical systems.

Physical Interaction: Actions are executed in the physical world, leading to environmental changes. Feedback from these interactions is looped back into the perception layer for continuous learning and adaptation.

Market Outlook: The Rise of Embodied AI

Morgan Stanley analysts estimate that by 2040, the U.S. may have 8 million working humanoid robots, with a $357 billion impact on wages. By 2050, this number could rise to 63 million, potentially affecting 75% of occupations, 40% of employees, and roughly $3 trillion in payroll.

We are reaching a point in time where economists may no longer use population decline/growth as a leading indicator of economic output. Countries that deploy humanoid robots can solve population issues. Countries that don’t, risk being left behind economically.

Key Challenges in Embodied AI Development

Simulation vs. Reality Gap: Training embodied agents in simulated environments can be more efficient and cost-effective than training in the real world. However, discrepancies between simulation and reality, such as inaccurate physics models or sensor noise, can lead to poor performance when transferring trained agents to real-world tasks.
Data Acquisition: Embodied AI requires vast amounts of multimodal data that capture real-world scenarios. Acquiring diverse, high-quality data can be difficult and expensive, particularly for rare events or edge cases.
Mass production: Embodied AI development requires significant upfront capital investment in hardware, software, and specialized training infrastructure. The financial burden of mass production, including the costs of materials, manufacturing, and ongoing maintenance, can present a barrier to entry for many innovators.

This challenge encompasses both the hardware of the robots and the data required to train them. Training is also very capital intensive and requires either a lot of money, or a lot of manpower.

Crypto has proven to be very effective in efficient and large-scale capital formation as we’ve seen with Helium and Hivermapper.

DePINs enable decentralized fundraising by allowing individuals to invest in the project by purchasing tokens. This distributed ownership structure can then create a strong incentive layer for incentivizing people to purchase, and train embodied AI agents. By incorporating a token, people can then be rewarded by having their robots rented or used by industry for a fee.

Specifically, a future may emerge where retail investors can purchase or co-own robots that are trained by them and/or the community, who collectively earn tokens for their contributions. These robots can then be deployed and rented to industry or individuals to carry out specific tasks, generating revenue for their owners and contributors. This model facilitates faster data collection, better training, and wider accessibility for embodied AI agents, effectively democratizing access to these powerful tools.

Case in point: Frodobots is building a decentralized robotic gaming platform where players can remotely control real-world robots while simultaneously generating valuable training data for embodied AI. By gamifying robot operation through their "Earth Rovers" and "Octo Arms" titles, FrodoBots effectively crowdsources massive amounts of human demonstration data that can be used to train embodied AI systems. Their DePIN tokenization model allows retail investors to purchase or co-own robots, earn tokens for contributing to training through gameplay, and generate revenue by renting these trained robots to industry partners.

This creates a sustainable ecosystem where gamers, robot owners, AI researchers, and industry users all benefit from the shared infrastructure. It shows how DePIN can bridge the capital formation and data acquisition challenges of embodied AI by transforming what would typically be expensive research into an accessible and rewarding experience for a global community. Their approach effectively democratizes both robot ownership and the training process for embodied AI, potentially accelerating development in this capital-intensive field.

Join the Sei Research Initiative

We invite developers, researchers, and community members to join us in this mission. This is an open invitation for open source collaboration to build a more scalable blockchain infrastructure. Check out Sei Protocol’s documentation, and explore Sei Foundation grant opportunities (Sei Creator Fund, Japan Ecosystem Fund). Get in touch - collaborate[at]seiresearch[dot]io

References

https://www.morganstanley.com/ideas/humanoid-robot-market-outlook-2024

https://arxiv.org/abs/2103.04918

https://arxiv.org/abs/2307.15818

https://arxiv.org/abs/2402.00290