Generate Embeddings with OpenAI API for Generative AI

Embeddings are the unsung heroes of generative AI, transforming raw data—whether text, images, or audio—into vector representations that power the intelligence behind creative systems. With the OpenAI API, you can harness these embeddings to unlock a universe of possibilities, from generating nuanced text to building sophisticated search tools or recommendation engines. Whether you’re a developer crafting AI-driven media, a data enthusiast analyzing technical corpora, or a creator exploring machine learning art, generating embeddings with OpenAI is a foundational skill. In this comprehensive guide, we’ll walk you through the process: setting up the OpenAI API with the Pythonopenai library, processing a technical corpus (like code documentation), generating embeddings (yes, this post self-references /generative-ai/core-tools/generate-embeddings-openai for depth), storing 1536-dimensional vectors in a NumPy array, computing pairwise similarities with Python, and testing embedding quality with tech terms to ensure your setup shines.

Tailored for those eager to dive into the technical heart of generative AI, this tutorial builds on Embeddings in Generative Systems and equips you for advanced workflows, like Text-to-Vector Pipeline. By the end, you’ll wield OpenAI embeddings like a pro, ready to integrate them into your projects or scale with Setup Pinecone Vector Storage—let’s dive into this vector-powered adventure, step by technical step, as of April 10, 2025.

Why Generate Embeddings with OpenAI API?

Embeddings are vector representations—numerical arrays that distill complex inputs into a compact, meaningful form, enabling machines to process and compare text semantically. With OpenAI’s API, you tap into a state-of-the-art transformer model—like text-embedding-ada-002—trained on vast datasets up to April 2023, delivering 1536-dimensional vectors that encode everything from “print statement” to “neural network” with precision. Imagine a technical phrase—“print(): Outputs text”—its embedding captures its semantic essence, placing it near “input(): Reads data” in a latent space—a mathematical realm where similarity reflects meaning.

This matters because embeddings power semantic search (e.g., finding related docs), clustering (grouping similar terms), and recommendation systems—all with a free tier ($5 credit) or low cost (~$0.0001/1000 tokens for ada-002). Processing a technical corpus—like code docs—unlocks domain-specific insights, while NumPy and similarity metrics make analysis practical. OpenAI’s contextual embeddings—unlike static Word2Vec—adapt to surrounding text, making “print” in code distinct from “print” in publishing—see What Is Generative AI and Why Use It?. Let’s start with the corpus.

Step 1: Process a Technical Corpus (e.g., Code Docs)

A technical corpus—like code documentation—provides the raw material for your embeddings, reflecting real-world, domain-specific text that ensures your vectors are relevant and useful.

Preparing Your Python Environment

You’ll need Python (3.8 or higher) and pip—essential for running scripts and managing libraries. Open a terminal—VS Code is recommended for its integrated terminal and debugging tools—and confirm your setup:

python --version

Expect “Python 3.11.7” (stable as of April 2025)—if missing, download from python.org, ensuring “Add to PATH” is checked during installation to make python accessible globally. Then verify pip:

pip --version

See “pip 23.3.1” or similar—if absent, install with python -m ensurepip --upgrade followed by python -m pip install --upgrade pip. Pip connects to PyPI—the Python Package Index—a vast repository hosting libraries like the ones we’ll use.

Install the required packages:

pip install openai numpy python-dotenv
  • openai: The official OpenAI library—version 0.28.1 or later—under 1 MB—enables API calls to generate embeddings—explained for its core role.
  • numpy: A powerful library for numerical computing—handles 1536D vectors with C-based efficiency—around 20 MB—explained for its array management.
  • python-dotenv: Loads environment variables from .env—small (~100 KB)—keeps your API key secure—explained for its security utility.

Verify with pip show openai—you’ll see details like “Version: 0.28.1”—this confirms your environment is ready to connect to OpenAI’s cloud-hosted models, trained on diverse texts up to April 2023—explained to ensure setup clarity.

Securing Your OpenAI API Key

To access the OpenAI API, you need an API key—a unique string that authenticates your requests, tying them to your account and usage limits. Visit platform.openai.com, sign up or log in—new users get $5 in free credits as of April 2025, covering approximately 50 million tokens with text-embedding-ada-002—and navigate to “API Keys” under your profile. Click “Create new secret key,” name it (e.g., “EmbedTech2025”), and copy the generated key—e.g., sk-abc123xyz. This key is your digital passport—it grants access to OpenAI’s embedding endpoint and incurs costs (free tier: ~$0.0001/1000 tokens)—keep it confidential to prevent unauthorized use.

Create a project folder—e.g., “EmbedBot” with mkdir EmbedBot && cd EmbedBot—and store the key in a .env file:

OPENAI_API_KEY=sk-abc123xyz

This file—hidden by its dot prefix—keeps your key secure, avoiding hardcoding in scripts where it could be exposed (e.g., in a GitHub commit)—a best practice enforced by tools like GitGuardian—explained for its security importance.

Building a Technical Corpus

A technical corpus—in this case, code documentation—represents a collection of texts you’ll embed, reflecting real-world, domain-specific language. Define a small but representative sample in Python:

corpus = [
    "print(): Outputs text to the console for display or debugging.",
    "def function(): Defines a reusable block of code with a name.",
    "class Object: Creates a blueprint for instances with attributes.",
    "import module: Loads external libraries or modules into the script.",
    "for loop: Iterates over a sequence to perform repeated actions."
]
  • Content: 5 snippets—each ~20-25 words—mimic Python documentation, such as from Python Docs—e.g., print() as a function—explained for its relevance to technical domains.
  • Size: Total ~100 words—well within OpenAI’s 8192-token context limit (~6000 words)—a token is a word piece (e.g., “running” splits into “run” and “##ing”)—explained for its practical fit.
  • Purpose: Represents code-related concepts—ensures embeddings capture technical semantics—e.g., “print” vs. “def”—explained for its domain specificity.

This corpus isn’t arbitrary—it’s a deliberate sample—small for learning, scalable for real docs—explained fully—next, we’ll generate embeddings.

Step 2: Generate Embeddings with OpenAI API

With your technical corpus ready, let’s generate embeddings using the OpenAI API—specifically text-embedding-ada-002—producing 1536D vectors for each text. This builds on Generate Embeddings with OpenAI API, where setup details are mirrored—here, we focus on execution.

Coding the Embedding Generation

Create embed_tech.py in your EmbedBot folder:

import openai
import numpy as np
from dotenv import load_dotenv
import os

# Load API key from .env
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# Technical corpus
corpus = [
    "print(): Outputs text to the console for display or debugging.",
    "def function(): Defines a reusable block of code with a name.",
    "class Object: Creates a blueprint for instances with attributes.",
    "import module: Loads external libraries or modules into the script.",
    "for loop: Iterates over a sequence to perform repeated actions."
]

# Generate embeddings
response = openai.Embedding.create(
    model="text-embedding-ada-002",
    input=corpus
)

# Extract embeddings as NumPy arrays
embeddings = [np.array(item["embedding"]) for item in response["data"]]

# Display results
print("=== Generated Embeddings ===")
for i, text in enumerate(corpus):
    print(f"Text {i+1}: {text}")
    print(f"Embedding (first 5 values): {embeddings[i][:5]}")
    print(f"Dimensions: {len(embeddings[i])}")
    print("-" * 20)

Run python embed_tech.py—expect output like:

=== Generated Embeddings ===
Text 1: print(): Outputs text to the console for display or debugging.
Embedding (first 5 values): [-0.0123  0.0456 -0.0789  0.0012  0.0345]
Dimensions: 1536
--------------------
Text 2: def function(): Defines a reusable block of code with a name.
Embedding (first 5 values): [ 0.0234 -0.0678  0.0123 -0.0890  0.0567]
Dimensions: 1536
--------------------
...

How Embeddings Are Generated

  • openai.Embedding.create: Calls the embeddings endpoint (/v1/embeddings)—a specialized OpenAI API feature for generating vector representations—distinct from text completion—explained for its specific role.
  • model="text-embedding-ada-002": A transformer-based model—optimized for embeddings—outputs 1536D vectors—trained on texts up to April 2023—fast (~$0.0001/1000 tokens) and efficient—chosen for its balance of quality and cost—see OpenAI API Reference.
  • input=corpus: Sends the list of texts—processes all 5 snippets in one API call—OpenAI’s 8192-token context (~6000 words) easily fits our ~100-word corpus—batch processing saves time and credits—explained for its efficiency.
  • embeddings = [np.array(item["embedding"])]: Extracts vectors from response["data"]—a list of dictionaries—each “embedding” key holds a 1536-float array—converted to NumPy for numerical operations—each vector (~12 KB) captures semantic meaning—explained for its data structure.
  • Display: Shows first 5 values—1536D is too long—confirms dimensions—explained for verification.

This isn’t speculative—ada-002 generates contextual embeddings—e.g., “print()” as a function—explained fully—next, store them.

Step 3: Store 1536D Vectors in a NumPy Array

NumPy arrays provide a fast, efficient way to store your 1536D embeddings, enabling subsequent analysis like similarity computation—explained with purpose.

Storing the Embeddings

Update embed_tech.py:

import openai
import numpy as np
from dotenv import load_dotenv
import os

# Load API key from .env
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# Technical corpus
corpus = [
    "print(): Outputs text to the console for display or debugging.",
    "def function(): Defines a reusable block of code with a name.",
    "class Object: Creates a blueprint for instances with attributes.",
    "import module: Loads external libraries or modules into the script.",
    "for loop: Iterates over a sequence to perform repeated actions."
]

# Generate embeddings
response = openai.Embedding.create(
    model="text-embedding-ada-002",
    input=corpus
)

# Store embeddings in a NumPy array
embeddings = np.array([item["embedding"] for item in response["data"]])

# Save to file and display storage details
np.save("tech_embeddings.npy", embeddings)
print("=== Embedding Storage ===")
print(f"Array shape: {embeddings.shape}")
print(f"First embedding (first 5 values): {embeddings[0][:5]}")
print(f"Total embeddings: {len(embeddings)}")
print(f"Saved to 'tech_embeddings.npy' for later use!")

Run python embed_tech.py—expect:

=== Embedding Storage ===
Array shape: (5, 1536)
First embedding (first 5 values): [-0.0123  0.0456 -0.0789  0.0012  0.0345]
Total embeddings: 5
Saved to 'tech_embeddings.npy' for later use!

How Storage Works

  • embeddings = np.array([...]): Converts the list of vectors into a 2D NumPy array—shape (5, 1536)—5 texts, each with 1536 dimensions—NumPy leverages C-based efficiency for fast operations—explained for its numerical power.
  • np.save("tech_embeddings.npy", embeddings): Saves as a .npy file—a binary format optimized for NumPy—~30 KB for 5 vectors—reload with np.load("tech_embeddings.npy")—preserves data integrity across sessions—explained for its persistence.
  • embeddings.shape: Returns (5, 1536)—confirms array structure—5 rows (texts), 1536 columns (dimensions)—explained for its verification utility.
  • Display: Shows shape, sample values, and count—ensures successful storage—explained for its transparency.

This isn’t a vague process—NumPy arrays are structured and reusable—explained fully—next, compute similarities.

Step 4: Computing Pairwise Similarities with Python

Pairwise similarities—using cosine similarity—quantify how semantically close your embeddings are, revealing relationships within your technical corpus—explained with intent.

Coding Pairwise Similarities

Update embed_tech.py to include similarity computation:

import openai
import numpy as np
from scipy.spatial import distance
from dotenv import load_dotenv
import os

# Load API key from .env
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# Technical corpus
corpus = [
    "print(): Outputs text to the console for display or debugging.",
    "def function(): Defines a reusable block of code with a name.",
    "class Object: Creates a blueprint for instances with attributes.",
    "import module: Loads external libraries or modules into the script.",
    "for loop: Iterates over a sequence to perform repeated actions."
]

# Generate embeddings
response = openai.Embedding.create(
    model="text-embedding-ada-002",
    input=corpus
)
embeddings = np.array([item["embedding"] for item in response["data"]])

# Compute pairwise similarities
similarities = []
for i in range(len(corpus)):
    for j in range(i + 1, len(corpus)):
        sim = 1 - distance.cosine(embeddings[i], embeddings[j])
        similarities.append((i, j, sim))

# Display similarities
print("=== Pairwise Similarities ===")
for i, j, sim in similarities:
    print(f"Text {i+1} vs Text {j+1}: {sim:.3f}")
    print(f"  {corpus[i]}")
    print(f"  {corpus[j]}")
    print("-" * 20)

Run pip install scipy if needed—then python embed_tech.py—expect output like:

=== Pairwise Similarities ===
Text 1 vs Text 2: 0.845
  print(): Outputs text to the console for display or debugging.
  def function(): Defines a reusable block of code with a name.
--------------------
Text 1 vs Text 3: 0.792
  print(): Outputs text to the console for display or debugging.
  class Object: Creates a blueprint for instances with attributes.
--------------------
Text 2 vs Text 4: 0.876
  def function(): Defines a reusable block of code with a name.
  import module: Loads external libraries or modules into the script.
...

How Similarity Computation Works

  • from scipy.spatial import distance: Imports distance.cosine from SciPy—a scientific computing library—computes cosine distance—explained for its tool selection.
  • sim = 1 - distance.cosine(embeddings[i], embeddings[j]): Calculates cosine similaritydistance.cosine returns distance (0 to 2)—1 - distance converts to similarity (1 = identical, 0 = orthogonal)—cosine measures angle between vectors—explained for its mathematical basis—see SciPy Spatial Distance.
  • for i in range(len(corpus)): Nested loops—pairwise comparisoni vs j > i avoids duplicates (e.g., 1 vs 2, not 2 vs 1)—explained for its efficiency.
  • similarities.append((i, j, sim)): Stores tuples—e.g., (0, 1, 0.845)—tracks indices and scores—explained for its organization.
  • Display: Shows texts and similarity scores—e.g., print() vs def (0.845)—high similarity—both involve code output/control—explained for its interpretability.

This isn’t a vague metric—cosine similarity reflects semantic relationships—explained fully—next, test quality.

Step 5: Testing Embedding Quality with Tech Terms

Let’s test embedding quality—comparing tech terms like “for loop” and “while loop” to ensure semantic accuracy and validate your embeddings’ effectiveness—explained with purpose.

Adding a Test Term and Testing Quality

Update embed_tech.py:

import openai
import numpy as np
from scipy.spatial import distance
from dotenv import load_dotenv
import os

# Load API key from .env
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# Technical corpus with an additional test term
corpus = [
    "print(): Outputs text to the console for display or debugging.",
    "def function(): Defines a reusable block of code with a name.",
    "class Object: Creates a blueprint for instances with attributes.",
    "import module: Loads external libraries or modules into the script.",
    "for loop: Iterates over a sequence to perform repeated actions."
]
test_term = "while loop: Controls repetition with a condition."

# Generate embeddings for corpus and test term
response = openai.Embedding.create(
    model="text-embedding-ada-002",
    input=corpus + [test_term]
)
embeddings = np.array([item["embedding"] for item in response["data"]])

# Test similarity between "for loop" (index 4) and "while loop" (index 5)
for_embedding = embeddings[4]  # "for loop"
while_embedding = embeddings[5]  # "while loop"
similarity = 1 - distance.cosine(for_embedding, while_embedding)

# Display test results
print("=== Embedding Quality Test ===")
print(f"Text: {corpus[4]}")
print(f"Test Term: {test_term}")
print(f"Similarity Score: {similarity:.3f}")
print("==============================")

Run python embed_tech.py—expect output like:

=== Embedding Quality Test ===
Text: for loop: Iterates over a sequence to perform repeated actions.
Test Term: while loop: Controls repetition with a condition.
Similarity Score: 0.917
==============================

How Testing Embedding Quality Works

  • test_term = "while loop: Controls repetition with a condition.": Adds a related tech term—both loops—tests semantic similarity—explained for its relevance to the corpus.
  • input=corpus + [test_term]: Combines corpus and test term—processes in one API call—saves credits—ada-002’s 8192-token limit (~6000 words) fits easily—explained for its batch efficiency.
  • for_embedding = embeddings[4], while_embedding = embeddings[5]: Extracts vectors—index 4 (“for loop”), index 5 (“while loop”)—explained for its specificity.
  • similarity = 1 - distance.cosine(...): Computes cosine similarity—0.917—high score—both are loop constructs—1 (identical), 0 (unrelated)—explained for its quantitative measure.
  • Display: Shows texts and score—0.917 confirms embedding qualityada-002 captures tech nuance—explained for its validation purpose.

This isn’t speculative—high similarity (e.g., 0.917) proves semantic accuracy—your embeddings are reliable—explained fully.

Next Steps: Leveraging Your Embeddings

Your embeddings are generated, stored, compared, and tested—ready to power insights! Scale with Setup Pinecone Vector Storage for vector databases or explore Social Media Posts with AI for content creation. You’ve mastered embedding generation—keep analyzing and innovating!

FAQ: Common Questions About Generating Embeddings with OpenAI

1. Do I need a paid OpenAI plan to start?

No—free tier ($5 credit) at platform.openai.com—covers ~50 million tokens—explained for its accessibility to beginners.

2. Why use text-embedding-ada-002 specifically?

It’s fast, cost-effective—1536D captures context—newer models (e.g., text-embedding-3) offer more but cost more—explained for its practical balance.

3. What if my corpus is too big for one call?

Split into chunks—8192-token limit—e.g., 5K-word batches—explained for its scalability solution.

4. How does cosine similarity differ from other metrics?

Angle-based—focuses direction (meaning), not magnitude—unlike Euclidean—see SciPy Spatial Distance—explained for its semantic focus.

5. Can I store embeddings differently?

Yes—CSV, HDF5—NumPy’s .npy is fast, compact—explained for its storage options.

6. Why test with tech terms like “while loop”?

Validates domain accuracy—e.g., loops should align—ensures technical relevance—explained for its quality assurance purpose.

Your embedding questions answered—generate with confidence!