Creating Images and Videos with Multimodal LLMs

Introduction

Imagine having a conversation with an AI that not only understands your words but can see images, create visual content, and even work with videos. That’s exactly what multimodal Large Language Models (LLMs) can do. These advanced AI systems are revolutionizing how we interact with artificial intelligence by breaking free from text-only limitations.

Multimodal LLMs are AI models that can process and generate multiple types of content – text, images, and in some cases, videos. Think of them as ChatGPT with eyes and artistic abilities. While traditional LLMs like early versions of ChatGPT could only work with text, these new systems can understand and create visual content, making them incredibly versatile tools for creative work.

This guide is designed for those who are already familiar with basic ChatGPT interactions but want to explore the exciting world of AI-powered visual content creation. If you’ve used ChatGPT before and are ready to take your AI journey to the next level, you’re in the right place.

Understanding Multimodal LLMs

Basic Concept Explanation

Multimodal LLMs are like multilingual translators for different types of content. Just as a human can understand both written words and visual scenes, these AI systems can process and connect information across different formats. They use sophisticated neural networks to understand the relationships between text descriptions and visual elements.

How They Differ from Text-only LLMs

Unlike their text-only predecessors, multimodal LLMs can:

Analyze and understand images
Generate visual content from text descriptions
Combine visual and textual understanding for more comprehensive responses
Process multiple types of input simultaneously

Getting Started with Image Creation

Basic Requirements

To begin creating images with multimodal LLMs, you’ll need:

Access to a supported platform (like Midjourney, DALL-E, or Stable Diffusion)
A basic understanding of prompt writing
A clear idea of what you want to create
A payment method for services that require subscriptions or credits

Popular Image Generation Platforms

DALL-E (OpenAI)

Overview: Created by OpenAI, DALL-E is known for its user-friendly interface and high-quality image generation. It’s now integrated into ChatGPT and available as a standalone service.

Getting Started:

Visit OpenAI’s website and create an account
Navigate to the DALL-E section or access it through ChatGPT Plus
Purchase credits or subscribe to ChatGPT Plus for integrated access

Pricing:

Pay-as-you-go model with credit purchases
Free tier available with limited generations
ChatGPT Plus ($20/month) includes DALL-E access

Strengths:

Excellent at photorealistic images
Easy to use for beginners
Integrated with text capabilities through ChatGPT
Strong safety filters for appropriate content

Limitations:

Less community support compared to alternatives
Some creative restrictions due to content policies
Limited customization options for advanced users

Midjourney

Overview: Midjourney excels at creating artistic, stylized images with impressive aesthetic quality. It operates primarily through Discord.

Getting Started:

Create a Discord account if you don’t have one
Join the Midjourney Discord server
Navigate to one of the “newbie” channels
Use the “/imagine” command followed by your prompt

Pricing:

Subscription-based model ($10-50/month depending on tier)
No permanent free tier (occasional trial periods)
Higher tiers offer faster generation and more features

Strengths:

Exceptional artistic quality and aesthetic results
Strong community of users sharing prompts and techniques
Regular updates with new features and improvements
Excellent for creative and artistic applications

Limitations:

Discord-based interface might be unfamiliar to some users
Learning curve for advanced features
No free tier for ongoing use

Stable Diffusion

Overview: An open-source image generation model that can be run locally or accessed through various web interfaces. Offers maximum flexibility and customization.

Getting Started:

Choose your access method:
- Web services: Platforms like Stability AI, DreamStudio, or RunwayML
- Local installation: Set up on your computer (requires technical knowledge and good GPU)
- Community interfaces: Hugging Face Spaces offers free access
Create an account on your chosen platform

Pricing:

Web services: Credit-based systems (typically $10-25 for hundreds of images)
Local installation: Free after hardware costs
Some free options available through community implementations

Strengths:

Complete control and customization potential
No content restrictions when run locally
Active open-source community developing extensions
Can run without internet once installed locally

Limitations:

Technical knowledge required for local setup
Higher hardware requirements for local installation
Image quality can vary between implementations
Interface experience less polished on some platforms

First Steps

Setting Up Your Workspace

Choose a platform that matches your needs and skill level
Create necessary accounts and ensure proper access
Familiarize yourself with the interface and basic commands
Consider setting up folders or collections to organize your generated images
Bookmark resources and communities for learning and troubleshooting

Choosing the Right Platform

When selecting a platform, consider these factors:

For Beginners:

DALL-E offers the simplest entry point with intuitive interfaces
Midjourney provides excellent results with simpler prompts
Web interfaces for Stable Diffusion require less technical knowledge

For Professional/Commercial Use:

Check licensing terms carefully – they differ across platforms
Midjourney offers commercial licenses with higher-tier subscriptions
DALL-E now permits commercial use of generated images
Stable Diffusion’s open license offers flexibility for commercial projects

For Artistic Projects:

Midjourney excels at artistic styles and aesthetic quality
Stable Diffusion offers more customization for specific artistic visions
DALL-E performs well with clear, detailed artistic descriptions

For Technical/Scientific Visualization:

DALL-E generally produces more accurate technical images
Stable Diffusion with specific fine-tuned models can excel in specialized domains
Consider the platform’s ability to understand domain-specific terminology

Budget Considerations:

For occasional use: DALL-E’s pay-as-you-go model is economical
For frequent use: Subscription services like Midjourney may be more cost-effective
For unlimited use: Local Stable Diffusion installation has no ongoing costs

Writing Effective Prompts

Key Elements of a Good Image Prompt

Be specific about subject matter
Include style references
Specify composition details
Mention lighting and mood
Include technical preferences

Common Mistakes to Avoid

Being too vague
Overcomplicating prompts
Ignoring important details
Using ambiguous terms
Forgetting to specify image style

Example Prompts and Results

Example 1: “A serene Japanese garden at sunset with cherry blossoms, traditional stone lanterns, and a small wooden bridge over a koi pond”

Example 2: “A modern minimalist workspace with a MacBook, white desk, and natural lighting from large windows, shot from above”

Creating Your First Images

Step-by-Step Guide

Selecting Your Tool
- Choose a platform based on your specific needs
- Consider starting with user-friendly options like DALL-E
- Ensure you have necessary credits or subscription
Writing Your Prompt
- Start with a clear, detailed description
- Include specific style references
- Mention important details about composition
Reviewing and Refining Results
- Analyze the generated image
- Identify areas for improvement
- Adjust your prompt accordingly

Practice Exercises

Simple Object Generation

Start with basic objects:

A red apple on a white background
A vintage leather book
A ceramic coffee mug

Scene Creation

Progress to more complex scenes:

A cluttered office with bookshelf
A bustling city street
A peaceful beach at sunset

Style Variations

Practice generating the same subject in different styles:

Photorealistic
Watercolor painting
Digital art
Abstract interpretation

Watercolor Styled LLM Image Generation Example

Working with Videos

Understanding Current Limitations

While video generation technology is rapidly evolving, it’s important to understand that current capabilities are still limited compared to image generation. Most systems focus on short clips (typically 2-10 seconds), simple animations, or image-to-video conversions. Full-length, complex video generation with perfect coherence is still in development.

Available Tools and Platforms

Runway Gen-2

Overview: One of the most advanced AI video generation platforms currently available, offering text-to-video, image-to-video, and sophisticated editing capabilities.

Getting Started:

Visit Runway’s website and create an account
Choose a subscription plan that fits your needs
Use their intuitive interface to generate videos from text prompts or images

Pricing:

Free tier: Limited generations with watermarks
Standard plan: $12/month for more generations and higher quality
Pro plan: $28/month for maximum quality and priority access

Strengths:

Excellent video quality compared to competitors
Intuitive interface with powerful editing capabilities
Multiple generation methods (text-to-video, image-to-video)
Robust professional features for content creators

Pika Labs

Overview: A rapidly improving AI video generation platform known for its creative capabilities and accessible Discord interface.

Getting Started:

Join Pika’s Discord server
Use simple commands in designated channels to generate videos
Choose between text-to-video or image-to-video generation

Pricing:

Free tier with daily limited generations
Premium plans available with increased generation limits and quality

Strengths:

Community-focused with active Discord user base
Frequent updates and improvements
Accessible interface for beginners
Supports various video styles and animations

Luma AI Dream Machine

Overview: Specializes in high-quality, realistic 3D video generation with emphasis on natural motion and lighting.

Getting Started:

Visit Luma AI’s website and request access
Once approved, use their web interface to create video content
Download and use the generated videos

Pricing:

Credit-based system with free starter credits
Premium plans for professional use

Strengths:

Superior 3D environment generation
Realistic motion and physics
High-quality lighting and textures
Good for product visualization and virtual environments

Sora (OpenAI)

Overview: Though still not publicly available at scale, OpenAI’s Sora represents the cutting edge of video generation technology, capable of creating highly realistic one-minute videos from text prompts.

Status: Currently in limited testing with select creators; broader access expected in the future.

Capabilities:

Creates longer, more coherent videos (up to 60 seconds)
Maintains impressive scene consistency
Handles complex motion and physics
Generates realistic human movements

Basic Video Generation Techniques

Text-to-Video Generation

The simplest way to create AI videos is through text prompts. Here are practical examples with explanations:

Basic Example: “A red balloon floating up into a blue sky”

Why it works: Simple subject, clear motion, uncomplicated background
Best platform: Works well on all platforms, good starting point
Actual prompt to try: “A single red balloon slowly floating upward against a clear blue sky, gentle breeze, cinematic lighting, 4K quality”

Red Balloon LLM Generated Video

Intermediate Example: “Camera slowly panning through an autumn forest with golden leaves falling”

Why it works: Specifies camera movement, setting, and action
Best platform: Runway Gen-2 handles natural environments particularly well
Actual prompt to try: “Slow horizontal camera pan through a dense autumn forest, golden leaves gently falling, dappled sunlight, film look, 24fps”

Advanced Example: “Time-lapse of a city transitioning from day to night with lights turning on”

Why it works: Includes time transformation, multiple lighting states
Best platform: Luma AI and Runway handle lighting changes effectively
Actual prompt to try: “Time-lapse sequence of downtown Chicago skyline transitioning from late afternoon to night, buildings gradually illuminating, golden hour to blue hour, aerial perspective, cinematic quality”

Image-to-Video Techniques

Start with a reference image and specify how it should animate:

Static Object Animation:

Upload an image of a still object (like a flower or product)
Prompt: “Add gentle movement as if in a slight breeze”
Example: For a product image, try “Rotate the [product] 360 degrees slowly, maintain professional lighting”

Scene Extension:

Upload a landscape or environment image
Prompt: “Extend this scene with gentle camera movement to the right”
Example: For a beach sunset, try “Pan slowly across the beach, add subtle wave movement, maintain golden hour lighting”

Character Animation:

Upload an image containing a character or animal
Prompt: “Add subtle lifelike movement while maintaining the pose”
Example: For a pet photo, try “Add gentle breathing movement and slight head tilt while maintaining the composition”

Animation Parameters to Specify

For more control over your generated videos, specify these parameters:

Motion type: “Smooth and flowing motion like underwater movement” vs “Sharp, mechanical movements like a robot”
Camera movement: “Steady drone shot slowly rising above the scene” or “Handheld camera following the subject with subtle shake”
Timing: “Slow motion capture of water droplets falling” or “Hyperlapse of clouds moving across the sky”
Transitions: “Gradual dissolve from the forest scene to the beach scene” or “Quick cuts between urban environments”
Style consistency: “Maintain the watercolor painting style throughout all scene changes”

Example of a Complete Parameter Set:

Generate a video with:
- Motion: Smooth, dreamlike movement
- Camera: Gentle dolly zoom out
- Timing: Slightly slower than real-time
- Style: Maintain film noir aesthetic with high contrast
- Subject: A detective walking through fog at night
- Duration: 4 seconds at 24fps

Working with Different Platforms

Runway Gen-2 Specific Tips:

Use the “Director Mode” for more precise camera control
Specify aspect ratio in your prompt (16:9, 9:16 for vertical, 1:1)
Add “high resolution, detailed texture” for better quality

Pika Labs Specific Tips:

Use the /animate command for image-to-video
Add “style=cinematic” or “style=anime” parameters
Specify “4K” or “HD” for resolution control

Luma AI Specific Tips:

Use “Dream Machine” mode for most creative freedom
Specify “realistic lighting” for photorealistic results
Use “extend scene in [direction]” for spatial expansion

Practical Video Generation Examples

For Professional Use Cases:

Product Showcase: “A sleek smartphone rotating 360 degrees on a minimalist white surface with subtle lighting changes highlighting its features”
Real Estate Tour: “Smooth camera movement through a modern living room with natural lighting, focusing on architectural details and spacious design”
Explainer Video Base: “Simple 3D animation of a gear mechanism demonstrating how a watch movement works, with close-up details”

For Creative Projects:

Music Visualization: “Abstract color patterns flowing and pulsing in rhythm, transitioning from cool blues to warm oranges, dreamlike quality”
Artistic Expression: “Impressionist-style animation of rain falling on a Parisian street, colors blending and shifting with each raindrop”
Story Element: “A mystical glowing doorway in a forest clearing, with magical particles drifting through, suggesting a portal to another world”

For Social Media:

Short Loop: “A steaming cup of coffee on a rustic table with morning light streaming in, perfect loop for profile background”
Attention Grabber: “Text appearing dramatically through fog with dynamic lighting, revealing a product name”
Mood Setting: “Calming ocean waves washing onto shore at sunset, slow motion, meditative quality, 9:16 aspect ratio”

Limitations to Be Aware Of

Temporal coherence: Objects may change appearance between frames
Human figures: Faces, hands, and complex movements remain challenging
Text rendering: Words and text often appear distorted
Extended duration: Quality typically decreases in longer videos
Physical accuracy: Complex physics interactions may look unnatural

Best Practices and Tips

Optimizing Your Prompts

Use clear, descriptive language
Include specific details about composition
Reference artistic styles or techniques
Specify technical parameters when needed
Build prompts in logical segments

Our prompt engineering guide has more information on improving LLM prompts.

Troubleshooting Common Issues

Unclear or distorted outputs
- Break down complex prompts into simpler elements
- Use more specific descriptions
- Avoid contradictory instructions
Inconsistent results
- Maintain consistent style references
- Use standardized prompt structures
- Document successful prompts for future use

Getting Better Results

Iterative Improvement

Start with basic versions
Analyze results carefully
Make incremental adjustments
Keep track of successful prompts
Learn from unsuccessful attempts

Conclusion

Key Takeaways

Multimodal LLMs are powerful tools for visual content creation
Success requires understanding both technical aspects and creative principles
Practice and experimentation are essential for improvement
The field is rapidly evolving with new capabilities emerging regularly

Next Steps

Choose your primary platform
Start with simple projects
Build a prompt library
Join community discussions
Experiment with different styles and techniques

Higherpass

Introduction

Understanding Multimodal LLMs

Basic Concept Explanation

How They Differ from Text-only LLMs

Getting Started with Image Creation

Basic Requirements

Popular Image Generation Platforms

DALL-E (OpenAI)

Midjourney

Stable Diffusion

First Steps

Setting Up Your Workspace

Choosing the Right Platform

Writing Effective Prompts

Key Elements of a Good Image Prompt

Common Mistakes to Avoid

Example Prompts and Results

Creating Your First Images

Step-by-Step Guide

Practice Exercises

Simple Object Generation

Scene Creation

Style Variations

Working with Videos

Understanding Current Limitations

Available Tools and Platforms

Runway Gen-2

Pika Labs

Luma AI Dream Machine

Sora (OpenAI)

Basic Video Generation Techniques

Text-to-Video Generation

Image-to-Video Techniques

Animation Parameters to Specify

Working with Different Platforms

Practical Video Generation Examples

Limitations to Be Aware Of

Best Practices and Tips

Optimizing Your Prompts

Troubleshooting Common Issues

Getting Better Results

Iterative Improvement

Conclusion

Key Takeaways

Next Steps

Written By

Craig

You May Also Like

Creating Custom Tools with CrewAI

Getting Started with Crew.ai: Building Crews in Python

Coercing LLM Agents into Structured Responses using Pydantic AI