UnifoLM-VLA-0 is a Vision–Language–Action (VLA) large model in the UnifoLM series, designed for general-purpose humanoid robot manipulation. It goes beyond the limitations of conventional Vision–Language Models (VLMs) in physical interaction. Through continued pre-training on robot manipulation data, the model evolves from 'vision-language understanding' to an 'embodied brain' equipped with physical common sense. Features spatial semantic enhancement and manipulation generalization across 12 categories of complex manipulation tasks.

UnifoLM-VLA

Implemented Skills

README

Timeline