Patent Analysis Assistant - USA

How is the patent data indexed and embeddings created?

Patent data is downloaded from USPTO Bulk Data Storage System (BDSS) system, which provides access to issued patents and published applications.
Data contains full-text of all patents from 1790 to present.
The data is preprocessed by cleaning up special characters, stop words and applying stemming to the words.
Data is then indexed using LangChain’s RecursiveCharacterTextSplitter with chunk_size as 1000, chunk_overlap as 100 and length_function as len.
Embeddings are created using OpenAI text-embedding-ada-002 model and have a dimension of 1536.
These embeddings are then safely stored in a database optimized for high-dimensional data, ensuring efficient storage and retrieval.

How do we keep the data updated?

Our system is integrated with USPTO Bulk Data Storage System (BDSS). Every week, a process runs to check for new patents or updates to existing patents. New and updated data is preprocessed, indexed, and embeddings are generated before it’s stored in our vector database. This ensures our AI embeddings always have the freshest patent data.

Python Code

You can access the embedding using EmbedElite python package or curl request.

# this returns a list of embeddings
from embedelite import load_embedding

embeddings = load_embedding("uspto-patents")
print(len(embeddings))
print(embeddings[1])

# this returns an object which can be directly inserted into Qdrant
result = load_embedding("uspto-patents", embed_for="qdrant")
# result is {"embeddings": [], "documents": [] "ids": []}
print(result["embeddings"])
print(result["documents"])
print(result["ids"])

Curl

Request

curl -X POST -H "Content-Type: application/json" -d '{
"doc_id": "uspto-patents"
}' https://api.embedelite.com/v1/embeddings/download/

Response

{
    "mappings": {
        "properties": {
        "doc_source": {"type": "keyword"},
        "sentence": { "type": "text" },
        "embeddings": { "type": "dense_vector", "dims": 1536, "index": False },
        "doc_source": {"type": "keyword"}
        }
    }

Data credit to USPTO Bulk Data Storage System (BDSS) without which this embedding wouldn’t be possible.

Intellectual Property

USPTO Patent Analysis Assistant - USA

Data Amount

Data Freshness