Setup and Load Data into Elasticsearch
This script reads a CSV file containing documents, generates embeddings for a specified "contents" field using a Sentence Transformers model, and indexes the documents into an Elasticsearch index.
Features
- .env Configuration: Optionally reads Elasticsearch host, credentials, index name, and CSV path from a
.env
file. - Index Management: Can optionally create a new index using a default mapping file if
CREATE_INDEX
is set toTrue
. IfCREATE_INDEX
isFalse
, the script verifies that the index exists. - CSV Ingestion: Reads documents from a CSV file and verifies the existence of a
contents
column. If the column is not found, the script exits. - Embeddings Generation: Uses a
SentenceTransformer
model (paraphrase-MiniLM-L6-v2
) to generate 384-dimensional embeddings for each document’s contents.
Requirements
- Python 3.8+
requests
python-dotenv
sentence-transformers
- A running instance of Elasticsearch (e.g.,
Elasticsearch 7.x+
orElasticsearch 8.x+
), accessible at the specifiedELASTIC_HOST
.
Setup
- Install Dependencies
Install Python dependencies using: ```bash pip install -r requirements.txt
Running the script
```bash python set_up_elasticsearch.py