group-wbl/.venv/lib/python3.13/site-packages/chromadb/experimental/density_relevance.ipynb

543 lines
360 KiB
Plaintext
Raw Normal View History

2026-01-09 09:12:25 +08:00
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Density based retrieval relevance\n",
"\n",
"An important aspect of using embeddings-based retreival systems like Chroma is knowing whether there are relevant results to a given query in the existing dataset. As application developers, we would like to know when the system doesn't have enough information to complete a given query or task - we want to know what we don't know. \n",
"\n",
"This is particularly important in the case of retrieval-augmented generation, since it's [often been observed](https://arxiv.org/abs/2302.00093) that supplying irrelevant context serves to confuse the generative model, leading to the degredation of application performance in ways that are difficult to detect. \n",
"\n",
"Unlike a relational database which will not return results if none match the query, a vector search based retrieval system will return the $k$ nearest neighbors to any given query, whether they are relevant or not. \n",
"\n",
"One possible approach one might take is to tune a distance threshold, and reject any results which fall further away from the query. This might be suitable for certain kind of fixed datasets, but in practice such thresholds tend to be very brittle, and often serve to exclude many relevant results while not always excluding irrelevant ones. Additionally, the threshold will need to be continously adapted as the data changes. Additionally, such distance thresholds are not comparable across embedding models for a given dataset, nor across datasets for a given embedding model. \n",
"\n",
"We would prefer to find a data driven approach which can:\n",
"- produce a uniform and comparable measure of relevance for any dataset \n",
"- automatically adapt as the underlying data changes \n",
"- is relatively inexpensive to compute\n",
"\n",
"This notebook demonstrates one possible such approach, which relies on the distribution of distances (pseudo 'density') between points in a given dataset. For a given result, we use compute the percentile the result's distance to the query falls into with respect to the overall distribution of distances in the dataset. This approach produces a uniform measure of relevance for any dataset, and is relatively cheap to compute, and can be computed online as data mutates. \n",
"\n",
"This approach is still very preliminary, and we welcome contributions and alternative approaches - some ideas are listed at the end of this notebook."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preliminaries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Install required packages\n",
"\n",
"import sys\n",
"!{sys.executable} -m pip install chromadb numpy umap-learn[plot] matplotlib tqdm datasets"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataset\n",
"\n",
"As a demonstration we use the [SciQ dataset](https://arxiv.org/abs/1707.06209), available from [HuggingFace](https://huggingface.co/datasets/sciq). \n",
"\n",
"Dataset description, from HuggingFace:\n",
"\n",
"> The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Found cached dataset sciq (/Users/antontroynikov/.cache/huggingface/datasets/sciq/default/0.1.0/50e5c6e3795b55463819d399ec417bfd4c3c621105e00295ddb5f3633d708493)\n",
"Loading cached processed dataset at /Users/antontroynikov/.cache/huggingface/datasets/sciq/default/0.1.0/50e5c6e3795b55463819d399ec417bfd4c3c621105e00295ddb5f3633d708493/cache-9181e6e3516ba4ed.arrow\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of questions with support: 10481\n"
]
}
],
"source": [
"# Get the SciQ dataset from HuggingFace\n",
"from datasets import load_dataset\n",
"\n",
"dataset = load_dataset(\"sciq\", split=\"train\")\n",
"\n",
"# Filter the dataset to only include questions with a support\n",
"dataset = dataset.filter(lambda x: x['support'] != '')\n",
"\n",
"print(\"Number of questions with support: \", len(dataset))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data loading \n",
"\n",
"We load the dataset into a local persistent instance of Chroma, into a collection called `sciq`. We use Chroma's [default embedding function](https://docs.trychroma.com/embeddings#default-all-minilm-l6-v2), all-MiniLM-L6-v2 from [sentence tranformers](https://www.sbert.net/)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import chromadb\n",
"from chromadb.config import Settings\n",
"\n",
"chroma_client = chromadb.PersistentClient(path=\"./chroma)\")\n",
"\n",
"collection = chroma_client.get_or_create_collection(name=\"sciq\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the data into Chroma and persist, if it hasn't already been loaded and previously. "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0df53f502e3a450783f7cbc3b3c658ea",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/11 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load the data and persist \n",
"collection.delete()\n",
"\n",
"from tqdm.notebook import tqdm\n",
"batch_size = 1000\n",
"for i in tqdm(range(0, len(dataset), batch_size)):\n",
" collection.add(ids=[str(i) for i in range(i, min(i + batch_size, len(dataset)))], documents=dataset['support'][i:i + batch_size], metadatas=[{'type': 'support'} for _ in range(i, min(i + batch_size, len(dataset)))])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Computing a distribution over distances (pseudo density function)\n",
"\n",
"We would like to understand the distribution of distances between points in the dataset. \n",
"\n",
"To do so, we:\n",
"\n",
"1. Get the computed embeddings of each supporting sentence in the dataset. \n",
"2. Use Chroma to efficiently find the distance to each sentence's nearest neighbor.\n",
"3. Compute a cumulative density function over distances. \n",
"\n",
"Subsequently we can use this cumulative density function to estimate query relevance, by finding the percentile of a given result's distance from the query according to the CDF. \n",
"A lower percentile means that "
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# Get the embeddings for the support documents from the collection\n",
"support_embeddings = collection.get(include=['embeddings'])['embeddings']"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We query the collection using the embeddings for each element, returning the distances. Note that we query for two results, since the first (nearest) result will be the element we're querying with. "
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"dists = collection.query(query_embeddings=support_embeddings, n_results=2, include=['distances'])"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# Flatten the distances list, excluding the first element (which is an element's distance to itself)\n",
"flat_dists = [item for sublist in dists['distances'] for item in sublist[1:]]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"There are some details to note here. Because we query with each element, when two elements are each other's nearest neighbors, the same distance will appear in the result twice. This isn't necessarily a problem if we're computing a cumulative density, as the doubling is taken into account when we normalize to get a cumulative distribution function. \n",
"\n",
"However, it is not always the case that the nearest neighbor of some element $a$, will have $a$ as its own nearest neighbor. This could be taken into account by appropriately filtering pairwise matches using the element IDs, but for simplicity we ignore it here. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualization\n",
"\n",
"It can be helpful to visualize the embeddings to get a sense of how they might be distributed and see if there is any obvious structure. We use the [UMAP library](https://umap-learn.readthedocs.io/en/latest/plotting.html) to fit a 2D mainfold to the high-dimensional embedding data, and visualize it. \n",
"A brighter color indicates a shorter distance to the nearest neighbor."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:>"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoQAAAKACAYAAAAFJmlZAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzddZzd133n/9f5wmUaZkkzGjFbtmXGmOIkDlPbQNNs02zatNtu++tuu6Utw7bbNltIkzhuYsexY2YmgSVZzBpmvIxfOL8/riywWBrPyNJ5Ph5+aDz3+/3ec+/MvfO+Bz5HSCkliqIoiqIoyiVLm+kGKIqiKIqiKDNLBUJFURRFUZRLnAqEiqIoiqIolzgVCBVFURRFUS5xKhAqiqIoiqJc4lQgVBRFURRFucSpQKgoiqIoinKJM871RNd1GRwcJBwOI4SYyjYpiqIoiqIoU0BKSTqdprGxEU07eT/gOQfCwcFBWlpazvV0RVEURVEUZZr09fXR3Nx80tvPecg4HA6f66mKoiiKoijKNDpdbjvnQKiGiRVFURRFUT4YTpfb1KISRVEURVGUS5wKhIqiKIqiKJc4FQgVRVEURVEucSoQKoqiKIqiXOJUIFQURVEURbnEqUCoKIqiKIpyiVOBUFEURVEU5RKnAqGiKIqiKMolTgVCRVEURVGUS5wKhIqiKIqiKJc4FQgVRVEURVEucSoQKoqiKIqiXOJUIFQURVEURbnEqUCoKIqiKIpyiVOBUFEURVEU5RKnAqGiKIqiKMolTgVCRVEURVGUS5wKhIqiKIqiKJc4FQgVRVEURVEucSoQKoqiKIqiXOJUIFQURVEURbnEqUCoKIqiKIpyiVOBUFEURVEU5RKnAqHyvpob9lHrM2e6GYqiKIqinIIKhMqUi2gVCDRuaYjy8l3LeP3Dy6n3q1CoKIqiKBcqY6YboFxcLvffxDzvcgatHmYH3gHAb2hUeAyG89YMt05RFEVRlBNRgVCZUrVGHRUeiJizGBoP8Veb48S0EJPZPnRKODgz3URFURRFUd5DBUJlytwSvoMmXx1eHUAQL+V5eyRFTCvy+ZrPMm6N86PxH890MxVFURRFeQ81h1CZMi2euZRccKUk7zgMOalDtwgMJH4RocqonNE2KoqiKIpyPNVDqEyZcXeCsKhgR9YiZNpcG67CdW08Th05R+BIwc2R23kx+TSJw2FRURRFUZSZpnoIlbMmEDQajWh4AXH4+9uLIyQZJWLmuKdO49bKIDV6kLxjkLRgqFSi1VfFz9V+fMbariiKoijK8VQgVM7a7cHPE5cCTQ/j02sBWBFYSr1hsMTXTLuvhsG8xHYFw6UJpJTY0iakp/HqGiXHZn6gnogWxCNUORpFURRFmWlqyFg5Kx4RYlzmsCgCIBH8zad1wsNzeGO/l7xbBARPDW/nnurLmBdoZX8+wXPxF8k6o7QX2rg+dBs3BD/D0sYsHVmN15Jr6Sn2k1TDyIqiKIoyI1QPoXKWJCZBTOFHoPHRuoV87coILVU5VoZrafZLvj/6A0adfQR0DwCNXg9evRZbOtiOB4kBCAqOIKD7uavyVr5U+xnEUcPPiqIoiqJMH9VDqJwVicWKcIhPB+9GAl4dHnmgmZYV63FkO33FEfJuAYlJhTeJRoAt8SGSVg8Afi1K3pYIARtSvXhEkDm+Fkyh49dMcm5pZh+goiiKolyCVCBUzshtFWtYGmhja7qHoCHpLhVp83oAwUAcfvmHHThOBy4SgLxr8bsdDzPLv5Rxy8FyswCEdAMBaEiennwDR1r8SdsXqPea/KJ2Nf/U/9rMPUhFURRFuUSpQKiclo7GLbHLAWjxFXg6OY4LLPB6qSZG0grxpaqvsCWznXdymw+f5xCiSD1hEwruJCmrh4ybJGwK8m4BR9rYOHj0HB49giPdGXqEiqIoinJpU4FQOS0HlzeSW2n1NfNkciclCTG9kZGSF9stF5peU53Esa9gwO5gpJQAoOgkKbkZdEzy9gQAO/LbGLaGSDtpbGw0BH/S9QztgRq2pwdm6iEqiqIoyiVNBULljLyZfAdTExSkhUSiiUl04No5NnMjHtYPVDNREuhHrVNysejLvXrctcbsUQCaPFX8avM95Nwif9P7ECWp9jlWFEVRlJmgAqFyRq6OLuRDFYvpKmUZKmVZ4K3mjeRa1u3aQUhUsjpwHVm5m6AhWe6tZXt69LTXnOtvwKd78OkeGryVdOSHpuGRHG9VuIGfb1zOixMdPDN+cEbaoCiKoigzSQVC5Yz05RM4UvKFqst5dkSSztsU3Ff4RMU9NHka6bG287X65VT55yGE4E8OvkGnHOB/3GOyqdPlvjeP7/3bmN5Pk7eajFOgMz88bY/l8kgzvzbrenZkhvjr7lf5XMMS5geraPFFzjgQxvQoq4IrOFjopK/U/z63+P0n0JCoOZyKoiiXKhUIFQACWpC53vn0lrpJOvHjbq/UZ7Ev5QdAkCPlDuATPub4ZgMw2zsHTYAQ5VqC7d4lfO62cf7rbeUh5Od35BlNaizwXI2Gxt7SWvJuiftHX52eB3iUq6Oz8esmV0ZnEdBMXhjvYJYvynOnCIM1Rj3t3oXsL+xiwhnjpugNtPlaWRJYzHdGvkeLp5WEPULSSUzfA5kiEa2JNu/NFGSCfYWnVTBUFEW5BKlAqABwY/g2mj2zWOKs4IHJHxxzm0BwXWUFc/w50Iq0RQxeHtfYnSiwNrWeRk8DO/KbqQ+spKcoSeSasIs1dHbvx5XDHByWxLNQq8+mzbMSF0nCHWHQ3j8jj/Xxsd1UmAG2Z4bIuRYvTXbx0mTXSY/X0Lin8uO4roeFvmXYYpy8zAGwObePWu8irgxcRlDz8+OJ71KSM1NL0SM853DfghpjIVI4+EUMQ/iwDj02RVEU5dKhAqECQO5QncB3/z1azPAzL1gNQIXfZmXtJM9PJKnW65ltXE7OThP16CwIlVccv5m1sKXOfeuy/OfWAqk82A4kxTgDcoy8sJBaaPoe3Hv0FOL8cecLZ3x8rc/gtkVdPLdrAUJApVGDR5PcO/oAGS2MEIKdxYNc5V9+wt1WWjzNzPO3syW7jbh9fO/rmYpodXhFkDGnEwAdg1ZjFSniXBteylz/LOJWhrdLj/Fvt1WSdYt8+olxasRSJp1BJpzjh7avDX6CmFnJZMnCBYIiRkIFQkVRlEuOCoQKAG+kX2JfYRfj1hjzPatp9SxlZ+EtBuyDWK5gXXySpeEo7f4s8ZLLnsw4VUY9VV6NNeEgk9Y1FBwbTQgejT9JfyGJgwOZI/fh0SvICwsAr1ENH5BNSX6u6VZuXthBLFCkY6SKA0NNTNqTjNvjmKaBjp+EPc6j8QcoyuLh81p9NSTsLB+tvAuf5qNCj/Lw5GNnff86Jqbws8p7D0II9pVeZ8jew3X+zzAoJjHtUdr8swCIGAFqcp/kT5/T+KM7N7I6NB+vtQApXXaWnmPcHqYkizg4xLQKVoaaEEKgiRzdxTRRrZaEOzhlz52iKIrywaACoQKAi8uwVQ4CC72XYwgPK/zXoRVTfK3hk2QsjX/p3UztsJ+AnEe7OYst+UdxtFE0UUW1J0bAmOQPD75KT2GSqB7Fr/kZto4sFhGHStJIKZko7ZmRx3ku0pkqXtjuI+LPs3FA8tj4fUwemmdZtAYpz6qUx5xzXXQ+X228gaJjAS62LLIpc/qV11Aeor86vAYdjYGCxSzzMkbsDiQuAp2wbvL15rvZPF5BjTSI+aopuhKPgL2ZAnlXYrvw0F6TiaKgUQND0/hQ5G50TZBx0vx08j4ujzaji/JVc66FgyQuu6f2yVMURVE+EFQgVI7hJUzamcCjVWHg5euNH6bJb5Ms6SwKVvONq3IsaXmN//fmIrI9FTw4/jS/YNzIVRXV+A1BhekhpIX4Us3PYwidp+LP0FPs5lvNHyasB/jB8CY8ZpG/vsuPVmHya/dbFKyZftSn9tDEk6Sdm6k2m1ifev1wGDxCHndOxCgvwDE0A9AwkHQVTz1ncq6/jq/W30JPPktYmwNAkAkKDsS0ejYVHsIjAtxeXU+FWUPSKdHuj6ELwUTBJV6yGXU
"text/plain": [
"<Figure size 800x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from umap.umap_ import UMAP\n",
"import umap.plot as umap_plot\n",
"import numpy as np\n",
"\n",
"mapper = UMAP().fit(support_embeddings)\n",
"umap_plot.points(mapper, values=np.array(flat_dists), show_legend=False, theme='inferno')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Computing the density function over distances \n",
"\n",
"Using the returned distances, we compute the density function using `numpy`. "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"# Compute a density function over the distances\n",
"import numpy as np\n",
"hist, bin_edges = np.histogram(flat_dists, bins=100, density=True)\n",
"cumulative_density = np.cumsum(hist) / np.sum(hist)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAGwCAYAAAB7MGXBAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAA9hAAAPYQGoP6dpAAB0n0lEQVR4nO3dd3iUVfrG8e/MpHdCKhASOkgvUqWKoCIruiqiK6Kiq2Jff7a1N1zLWrEruAXFjitFKSJVlA4CoQVCSSFAep95f39MSUISTEKSSbk/15UrzMw7M+cNkNw55znPazIMw0BERETETczuHoCIiIg0bwojIiIi4lYKIyIiIuJWCiMiIiLiVgojIiIi4lYKIyIiIuJWCiMiIiLiVh7uHkBV2Gw2jh07RmBgICaTyd3DERERkSowDIOsrCxatWqF2Vz5/EejCCPHjh0jJibG3cMQERGRGjh8+DBt2rSp9PFGEUYCAwMB+8kEBQW5eTQiIiJSFZmZmcTExLh+jlemUYQR59JMUFCQwoiIiEgj80clFipgFREREbdSGBERERG3UhgRERERt2oUNSMiIo2Z1WqlqKjI3cMQqXWenp5YLJazfh2FERGROmIYBsnJyaSnp7t7KCJ1JiQkhKioqLPqA6YwIiJSR5xBJCIiAj8/PzVtlCbFMAxyc3NJTU0FIDo6usavpTAiIlIHrFarK4i0bNnS3cMRqRO+vr4ApKamEhERUeMlGxWwiojUAWeNiJ+fn5tHIlK3nP/Gz6YuSmFERKQOaWlGmrra+DeuMCIiIiJupTAiIiIibqUwIiIiDdK0adOYNGmSu4ch9aBZh5HjWQUcPplLbmGxu4ciItJgTJs2DZPJhMlkwtPTk8jISC644AI+/vhjbDZbvY3j9ddfZ86cOa7bo0aN4p577qm395f606zDyPR/bWD4iz+xdt8Jdw9FRKRBufDCC0lKSuLgwYMsWrSI0aNHc/fdd3PJJZdQXFw/v8AFBwcTEhJSL+8l7tWsw4iH2V4BXFyPSV9Emi/DMMgtLHbLh2EY1Rqrt7c3UVFRtG7dmn79+vHII48wf/58Fi1a5JqtSE9PZ/r06YSHhxMUFMSYMWPYunWr6zWefPJJ+vTpw7///W/i4uIIDg7m6quvJisry3XMl19+Sc+ePfH19aVly5aMHTuWnJwcoOwyzbRp0/j55595/fXXXbM2CQkJdOzYkZdffrnM2Lds2YLJZGLfvn01+FsSd2jWTc9Kwkj1/pOKiNREXpGVcx7/wS3vvfPp8fh5nd23/DFjxtC7d2++/vprpk+fzpVXXomvry+LFi0iODiY9957j/PPP589e/YQGhoKwP79+/n222/5/vvvOXXqFFdddRUvvPACzz33HElJSUyZMoUXX3yRyy67jKysLFatWlVhcHr99dfZs2cPPXr04OmnnwYgPDycG2+8kdmzZ3P//fe7jp09ezYjRoygY8eOZ3W+Un+a98yIxR5GrAojIiJV0rVrVw4ePMjq1av59ddf+eKLLxgwYACdOnXi5ZdfJiQkhC+//NJ1vM1mY86cOfTo0YPhw4dz3XXXsWzZMgCSkpIoLi7m8ssvJy4ujp49e3L77bcTEBBQ7n2Dg4Px8vLCz8+PqKgooqKisFgsTJs2jfj4eH799VfA3nhr7ty53HjjjfXzBZFa0axnRixmexYrsiqMiEjd8/W0sPPp8W5779pgGAYmk4mtW7eSnZ1drtV9Xl4e+/fvd92Oi4sjMDDQdTs6Otp1LZPevXtz/vnn07NnT8aPH8+4ceO44ooraNGiRZXH06pVKyZMmMDHH3/MwIED+d///kdBQQFXXnnlWZ6p1KdmHUY8zc6ZEdWMiEjdM5lMZ71U4m67du2iXbt2ZGdnEx0dzYoVK8odU7ro1NPTs8xjJpPJtSPHYrGwZMkS1q5dy48//sibb77J3//+d9avX0+7du2qPKbp06dz3XXX8eqrrzJ79mwmT56sNvyNTOP+X3GWLKoZERGpsuXLl7N9+3buvfde2rRpQ3JyMh4eHsTFxdX4NU0mE8OGDWPYsGE8/vjjxMbG8s0333DfffeVO9bLywur1Vru/osvvhh/f3/eeecdFi9ezMqVK2s8HnGPZh1GnDUjxVqmEREpo6CggOTkZKxWKykpKSxevJiZM2dyySWXMHXqVMxmM0OGDGHSpEm8+OKLdO7cmWPHjrFgwQIuu+wyBgwY8IfvsX79epYtW8a4ceOIiIhg/fr1HD9+nG7dulV4fFxcHOvXr+fgwYMEBAQQGhqK2Wx21Y48/PDDdOrUiSFDhtT2l0PqWPMuYHXUjGhmRESkrMWLFxMdHU1cXBwXXnghP/30E2+88Qbz58/HYrFgMplYuHAhI0aM4IYbbqBz585cffXVHDp0iMjIyCq9R1BQECtXruTiiy+mc+fOPProo7zyyitcdNFFFR5///33Y7FYOOeccwgPDycxMdH12E033URhYSE33HBDrZy/1C+TUd3N526QmZlJcHAwGRkZBAUF1drr3jdvC19vPsojF3fllhEdau11RUTy8/NJSEigXbt2+Pj4uHs4Td6qVas4//zzOXz4cJXDkNSOM/1br+rP72a9TOOsGdFuGhGRxqmgoIDjx4/z5JNPcuWVVyqINFLNe5nGYj999RkREWmcPv30U2JjY0lPT+fFF19093Ckhpp3GNFuGhGRRm3atGlYrVY2btxI69at3T0cqaFmHUZcW3ut6jMiIiLiLs06jHiqHbyIiIjbNeswYtHWXhEREbdr1mHEQ8s0IiIibte8w4hFBawiIiLu1rzDiFk1IyIijZHJZOLbb79tMK/TWMTFxfHaa6+5exjlNOsw4qwZUdMzEZGykpOTufPOO2nfvj3e3t7ExMQwceJEli1b5u6h1ciTTz5Jnz59yt2flJRUafv52hIXF4fJZMJkMuHr60tcXBxXXXUVy5cvr9P3rchvv/3GLbfc4rrdUMJYsw4jJbtpVDMiIuJ08OBB+vfvz/Lly3nppZfYvn07ixcvZvTo0cyYMcPdw6tVUVFReHt71/n7PP300yQlJREfH8+//vUvQkJCGDt2LM8991ydv3dp4eHh+Pn51et7VkWzDiMWNT0TESnn9ttvx2Qy8euvv/LnP/+Zzp070717d+677z5++eUXwB5YTCYTW7ZscT0vPT0dk8nEihUrAFixYgUmk4kffviBvn374uvry5gxY0hNTWXRokV069aNoKAgrrnmGnJzc12vU9FSQp8+fXjyyScrHfODDz5I586d8fPzo3379jz22GMUFRUBMGfOHJ566im2bt3qmqGYM2cOUHZmYOjQoTz44INlXvf48eN4enqycuVKwN5+/v7776d169b4+/szaNAg1/meSWBgIFFRUbRt25YRI0bw/vvv89hjj/H4448THx/vOm7Hjh1cdNFFBAQEEBkZyXXXXUdaWprr8VGjRnHXXXfxwAMPEBoaSlRUVJmvi2EYPPnkk7Rt2xZvb29atWrFXXfdVeHXNi4uDoDLLrsMk8lEXFwcBw8exGw2s2HDhjLjf+2114iNjcVWR7+8N+swUrKbRmFEROqBYUBhjns+qnhN1JMnT7J48WJmzJiBv79/ucdDQkKqfdpPPvkkb731FmvXruXw4cNcddVVvPbaa8ydO5cFCxbw448/8uabb1b7dUsLDAxkzpw57Ny5k9dff50PPviAV199FYDJkyfzt7/9je7du5OUlERSUhKTJ08u9xrXXnstn332GaWvHztv3jxatWrF8OHDAbjjjjtYt24dn332Gdu2bePKK6/kwgsvZO/evdUe8913341hGMyfPx+wh7kxY8bQt29fNmzYwOLFi0lJSeGqq64q87xPPvkEf39/1q9fz4svvsjTTz/NkiVLAPjqq6949dVXee+999i7dy/ffvstPXv2rPD9f/vtNwBmz55NUlISv/32G3FxcYwdO5bZs2eXOXb27NlMmzYNs7luYkOzvlCe89o0mhkRkXpRlAvPt3LPez9yDLzKh4vT7du3D8Mw6Nq1a6299bPPPsuwYcMAuOmmm3j44YfZv38/7du3B+CKK67gp59+Kjc
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot the density function\n",
"import matplotlib.pyplot as plt\n",
"plt.plot(bin_edges[1:], hist, label=\"Density\")\n",
"plt.plot(bin_edges[1:], cumulative_density, label=\"Cumulative Density\")\n",
"plt.legend(loc=\"upper right\")\n",
"plt.xlabel(\"Distance\")\n",
"plt.show()\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Computing relevance using the density function\n",
"\n",
"We use the percentile a given query falls into with respect to the overall distribution of distances between elements of the dataset, to estimate its relevance. Intuitively, results which are less relevant to the query, should be in higher percentiles than those which are more relevant. \n",
"\n",
"By using the distribution of distances in this way, we eliminate the need to tune an explicit distance threshold, and can instead reason in terms of likelihoods. We could either apply a threshold to the percentile-based relevance directly, or else feed this information into a re-ranking model, or take a sampling approach. "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"def compute_percentile(dist):\n",
" index = np.searchsorted(bin_edges[1:], dist, side='right')\n",
" return cumulative_density[index - 1]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation\n",
"\n",
"We evaluate the percentile based relevance score using the SciQ dataset. \n",
"\n",
"1. We query the collection of supporting sentences using the questions from the dataset, returning the 10 nearest results, along with their distances.\n",
"2. We check the results for whether the supporting sentence is present or absent. If it's present in the results, we record the percentile that the support falls into, otherwise we record the percentile of the nearest result. "
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"question_results = collection.query(query_texts=dataset['question'], n_results=10, include=['documents', 'distances'])"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"support_percentiles = []\n",
"missing_support_percentiles = []\n",
"for i, q in enumerate(dataset['question']):\n",
" support = dataset['support'][i]\n",
" if support in question_results['documents'][i]:\n",
" support_index = question_results['documents'][i].index(support)\n",
" percentile = compute_percentile(question_results['distances'][i][support_index])\n",
" support_percentiles.append(percentile)\n",
" else:\n",
" missing_support_percentiles.append(compute_percentile(question_results['distances'][i][0]))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Visualization\n",
"\n",
"We plot histograms of the percentiles for the cases where the support was found, and the case where it wasn't. A lower percentile is more relevant. "
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAGdCAYAAADAAnMpAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAA9hAAAPYQGoP6dpAAArl0lEQVR4nO3de3RU1d3G8WcIZBIkF265ABG5hjuBIBBQgRabAkVifZGKr1wE1BoqGMUaFMKlGhARrCKICKFWGkUlKFA0jUZeIaggqYAIIkiwJgEsJBgggeS8f1imjiSBGTLZuXw/a81anTP77PObHZbzdJ99zrFZlmUJAADAkDqmCwAAALUbYQQAABhFGAEAAEYRRgAAgFGEEQAAYBRhBAAAGEUYAQAARhFGAACAUXVNF3AlSkpK9N1338nPz082m810OQAA4ApYlqXTp0+rWbNmqlOn7PmPahFGvvvuO4WFhZkuAwAAuOHo0aNq0aJFmZ9XizDi5+cn6ccv4+/vb7gaAABwJfLz8xUWFub4HS9LtQgjF0/N+Pv7E0YAAKhmLrfEggWsAADAKJfCyNKlS9WtWzfHDEVUVJT+/ve/l7vP2rVr1aFDB/n4+Khr167atGnTVRUMAABqFpfCSIsWLTRv3jzt3LlTO3bs0C9+8QuNGDFCe/fuLbX9tm3bdMcdd2jChAnatWuXYmJiFBMToz179lRI8QAAoPqzWZZlXU0HjRo10oIFCzRhwoRLPhs1apQKCgq0YcMGx7a+ffsqIiJCy5Ytu+Jj5OfnKyAgQHl5eawZAYAqzLIsXbhwQcXFxaZLQSXw8vJS3bp1y1wTcqW/324vYC0uLtbatWtVUFCgqKioUttkZGQoLi7OaVt0dLRSUlLK7buwsFCFhYWO9/n5+e6WCQCoJEVFRcrOztaZM2dMl4JKVL9+fYWGhsrb29vtPlwOI7t371ZUVJTOnTunBg0aaN26derUqVOpbXNychQcHOy0LTg4WDk5OeUeIzExUbNnz3a1NACAISUlJTp8+LC8vLzUrFkzeXt7c5PKGs6yLBUVFen48eM6fPiw2rVrV+6NzcrjchgJDw9XZmam8vLy9MYbb2js2LH68MMPywwk7oiPj3eaUbl4nTIAoGoqKipSSUmJwsLCVL9+fdPloJL4+vqqXr16OnLkiIqKiuTj4+NWPy6HEW9vb7Vt21aSFBkZqU8//VTPPvusXnzxxUvahoSEKDc312lbbm6uQkJCyj2G3W6X3W53tTQAgGHu/j9jVF8V8Te/6h5KSkqc1nf8VFRUlNLS0py2paamlrnGBAAA1D4uzYzEx8dryJAhuvbaa3X69GmtWbNG6enpevfddyVJY8aMUfPmzZWYmChJmjJligYMGKCFCxdq2LBhSk5O1o4dO7R8+fKK/yYAAKBacimMHDt2TGPGjFF2drYCAgLUrVs3vfvuu7r55pslSVlZWU7TNf369dOaNWv0+OOPa/r06WrXrp1SUlLUpUuXiv0WAIAqa1HqgUo71oM3t3d5n+PHj2vmzJnauHGjcnNz1bBhQ3Xv3l0zZ85U//79PVBlxUpPT9egQYN08uRJBQYGmi7HLS6FkZdffrncz9PT0y/ZNnLkSI0cOdKlogAAqCy33XabioqKtHr1arVu3Vq5ublKS0vT999/b7q0yzp//rzpEioEK40AALXWqVOn9H//93+aP3++Bg0apJYtW6p3796Kj4/XLbfcom+++UY2m02ZmZlO+9hsNsf/AU9PT5fNZtPGjRvVrVs3+fj4qG/fvk53G09KSlJgYKBSUlLUrl07+fj4KDo6WkePHnWqZ+nSpWrTpo28vb0VHh6uV155xelzm82mpUuX6pZbbtE111yjSZMmadCgQZKkhg0bymazady4cR4ZK08ijAAAaq0GDRqoQYMGSklJKfNijCs1bdo0LVy4UJ9++qmaNm2q4cOHO81cnDlzRk888YT+8pe/aOvWrTp16pR+97vfOT5ft26dpkyZooceekh79uzRvffeq/Hjx+uDDz5wOs6sWbN06623avfu3Zo9e7befPNNSdL+/fuVnZ2tZ5999qq+hwmEEQBArVW3bl0lJSVp9erVCgwMVP/+/TV9+nR9/vnnLveVkJCgm2++WV27dtXq1auVm5urda+ulPKzpbOndP78eT0/f5aiOl+nyHbNtHrJAm3btk2ffLBJys/W0/Of1LjRt+v+/71V7UP8FDfxDv12+FA9Pe+JH/vIz5YkjR49WuPHj1fr1q3VsmVLNWrUSJIUFBSkkJAQBQQEVOgYVQbCCACgVrvtttv03Xff6e2339avf/1rpaenq2fPnkpKSnKpn5/etqJRo0YKDw/XvgNfObbVrVtX1/eMcLzv0L6dAgMCtG//j2327T+o/n2vd+qzf9/rHZ9f1KtXL5fqqg4IIwCAWs/Hx0c333yzZsyYoW3btmncuHFKSEhwXCH602fKml40es011xg9vicQRgAA+JlOnTqpoKBATZs2lSRlZ2c7PvvpYtaf2r59u+N/nzx5UgcOHFDH9u0c2y5cuKAdu/7peL//q4M6lZenjuE/tukY3lZbt3/q1OfW7Z+qU4fyL1e++IC66vykZLef2gsAQHX3/fffa+TIkbr77rvVrVs3+fn5aceOHXrqqac0YsQI+fr6qm/fvpo3b55atWqlY8eO6fHHHy+1rzlz5qhx48YKDg7WY489piZNmijmN792fF6vXj39Ydrj+vNTc1XXq64mT3tMfa+PVO/IHpKkaQ/8XrePu089unXR4EE36p2/p+qtdzbpH+tfK/c7tGzZUjabTRs2bNDQoUPl6+urBg0aVNwgVQJmRgAAtVaDBg3Up08fLVq0SDfddJO6dOmiGTNmaNKkSXr++eclSStXrtSFCxcUGRmpqVOn6k9/+lOpfc2bN09TpkxRZGSkcnJy9M477zhmLSSpfn1f/XFqrEZPiFX/6BFqcE19vbZqqePzmN8M0bPz5ujp55apc59BenHVK1r1wiINvLFfud+hefPmmj17th599FEFBwdr8uTJFTAylctm/fREWBWVn5+vgIAA5eXlyd/f33Q5AICfOXfunA4fPqxWrVq5/eTW6qrcO6D+5wqYpFdf09T4BJ3K+vLqD+gfevV9VKDy/vZX+vvNzAgAADCKMAIAAIwijAAAcBUGDhwoy7LKfUjduDtHVcwpmhqKq2kAANVLfvbl27ijiq3FqE2YGQEAAEYRRgAAgFGEEQAAYBRhBAAAGMUCVgAAqhNPLOA1vHiXmREAAGAUMyMAAM/6ILFi+ys8XfZn/f7gcnfjfj9Vq9e8rsTERD366KOO7SkpKbr11ltVDZ6a4jHjxo3TqVOnlJKS4tHjMDMCAKj1fHx8NH/+fJ08edJ0KVVCcXGxSkpKKu14hBEAQK03eOANCgkJUWJi+bM4b775pjp37iy73a7rrrtOCxcuLLf9P3fv1aDf/I/8mreTf4v2irwpWjs++6ckaVbi04q4YbBT+8UvvKTruvZ2vB/3+6mKGT1es+ctVNPWXeTfor3um/pHFRUVOdoMHHabJj88XZMfnq6AsHA1adVZM/70lNOMzsmTpzTm3gfU8NqOqh/SWkNuu1NffX3I8XlSUpICAwP19ttvq1OnTrLb7br77ru1evVqrV+/XjabTTabTenp6ZcdS3dwmgYAUOt5eXnpySef1OjRo/XAAw+oRYsWl7TZuXOnbr/9ds2aNUujRo3Stm3bdP/996tx48YaN25cqf3eOWmyenTroqXPJMrLy0uZn+9VvXqu/fSmffiRfOx2pW98U99kHdX4+x9U40YN9cTM/55SWv23tZpw1x365P2N2rHrc90zZZqubdFck8bdKUkad/9UffX1Yb2dnCR/vwb6Y8ITGvo/d+mLT9JVr149SdKZM2c0f/58rVixQo0bN1ZoaKjOnj2r/Px8rVq1SpLUqFEjl2q/UoQRAAAk3XrrrYqIiFBCQoJefvnlSz5/5pln9Mtf/lIzZsyQJLVv315ffPGFFixYUGYYyfr2X5r2wO/VoX07SVK7Nq1drsu7nrdWLnlG9evXV+eO4ZozfZqmzZyruY8
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot normalized histograms of the percentiles\n",
"plt.hist(support_percentiles, bins=20, density=True, alpha=0.5, label='Support')\n",
"plt.hist(missing_support_percentiles, bins=20, density=True, alpha=0.5, label='No support')\n",
"plt.legend(loc='upper right')\n",
"plt.show()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preliminary results\n",
"\n",
"While we don't observe a clear separation of the two classes, we do note that in general, supports tend to be in lower percentiles, and hence more relevant, than results which aren't the support. \n",
"\n",
"One possible confounding factor is that in some cases, the result does contain the answer to the query question, but is not itself the support for that question. "
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question: What type of organism is commonly used in preparation of foods such as cheese and yogurt? \n",
"Support: Mesophiles grow best in moderate temperature, typically between 25°C and 40°C (77°F and 104°F). Mesophiles are often found living in or on the bodies of humans or other animals. The optimal growth temperature of many pathogenic mesophiles is 37°C (98°F), the normal human body temperature. Mesophilic organisms have important uses in food preparation, including cheese, yogurt, beer and wine. \n",
"Top result: Bacteria can be used to make cheese from milk. The bacteria turn the milk sugars into lactic acid. The acid is what causes the milk to curdle to form cheese. Bacteria are also involved in producing other foods. Yogurt is made by using bacteria to ferment milk ( Figure below ). Fermenting cabbage with bacteria produces sauerkraut.\n",
"\n",
"Question: Changes from a less-ordered state to a more-ordered state (such as a liquid to a solid) are always what? \n",
"Support: Summary Changes of state are examples of phase changes, or phase transitions. All phase changes are accompanied by changes in the energy of a system. Changes from a more-ordered state to a less-ordered state (such as a liquid to a gas) areendothermic. Changes from a less-ordered state to a more-ordered state (such as a liquid to a solid) are always exothermic. The conversion of a solid to a liquid is called fusion (or melting). The energy required to melt 1 mol of a substance is its enthalpy of fusion (ΔHfus). The energy change required to vaporize 1 mol of a substance is the enthalpy of vaporization (ΔHvap). The direct conversion of a solid to a gas is sublimation. The amount of energy needed to sublime 1 mol of a substance is its enthalpy of sublimation (ΔHsub) and is the sum of the enthalpies of fusion and vaporization. Plots of the temperature of a substance versus heat added or versus heating time at a constant rate of heating are calledheating curves. Heating curves relate temperature changes to phase transitions. A superheated liquid, a liquid at a temperature and pressure at which it should be a gas, is not stable. A cooling curve is not exactly the reverse of the heating curve because many liquids do not freeze at the expected temperature. Instead, they form a supercooled liquid, a metastable liquid phase that exists below the normal melting point. Supercooled liquids usually crystallize on standing, or adding a seed crystal of the same or another substance can induce crystallization. \n",
"Top result: Under the right pressure conditions, lowering the temperature of a substance in the liquid state causes the substance to solidify. The opposite effect occurs if the temperature is increased.\n",
"\n",
"Question: Kilauea in hawaii is the worlds most continuously active volcano. very active volcanoes characteristically eject red-hot rocks and lava rather than this? \n",
"Support: Example 3.5 Calculating Projectile Motion: Hot Rock Projectile Kilauea in Hawaii is the worlds most continuously active volcano. Very active volcanoes characteristically eject red-hot rocks and lava rather than smoke and ash. Suppose a large rock is ejected from the volcano with a speed of 25.0 m/s and at an angle 35.0º above the horizontal, as shown in Figure 3.40. The rock strikes the side of the volcano at an altitude 20.0 m lower than its starting point. (a) Calculate the time it takes the rock to follow this path. (b) What are the magnitude and direction of the rocks velocity at impact?. \n",
"Top result: Volcanoes can be active, dormant, or extinct.\n",
"\n",
"Question: When a meteoroid reaches earth, what is the remaining object called? \n",
"Support: Meteoroids are smaller than asteroids, ranging from the size of boulders to the size of sand grains. When meteoroids enter Earths atmosphere, they vaporize, creating a trail of glowing gas called a meteor. If any of the meteoroid reaches Earth, the remaining object is called a meteorite. \n",
"Top result: A meteoroid is dragged toward Earth by gravity and enters the atmosphere. Friction with the atmosphere heats the object quickly, so it starts to vaporize. As it flies through the atmosphere, it leaves a trail of glowing gases. The object is now a meteor. Most meteors vaporize in the atmosphere. They never reach Earths surface. Large meteoroids may not burn up entirely in the atmosphere. A small core may remain and hit Earths surface. This is called a meteorite .\n",
"\n",
"Question: What kind of a reaction occurs when a substance reacts quickly with oxygen? \n",
"Support: A combustion reaction occurs when a substance reacts quickly with oxygen (O 2 ). For example, in the Figure below , charcoal is combining with oxygen. Combustion is commonly called burning, and the substance that burns is usually referred to as fuel. The products of a complete combustion reaction include carbon dioxide (CO 2 ) and water vapor (H 2 O). The reaction typically gives off heat and light as well. The general equation for a complete combustion reaction is:. \n",
"Top result: A combustion reaction occurs when a substance reacts quickly with oxygen (O 2 ). You can see an example of a combustion reaction in Figure below . Combustion is commonly called burning. The substance that burns is usually referred to as fuel. The products of a combustion reaction include carbon dioxide (CO 2 ) and water (H 2 O). The reaction typically gives off heat and light as well. The general equation for a combustion reaction can be represented by:.\n",
"\n",
"Question: Organisms categorized by what species descriptor demonstrate a version of allopatric speciation and have limited regions of overlap with one another, but where they overlap they interbreed successfully?. \n",
"Support: Ring species Ring species demonstrate a version of allopatric speciation. Imagine populations of the species A. Over the geographic range of A there exist a number of subpopulations. These subpopulations (A1 to A5) and (Aa to Ae) have limited regions of overlap with one another but where they overlap they interbreed successfully. But populations A5 and Ae no longer interbreed successfully are these populations separate species?  In this case, there is no clear-cut answer, but it is likely that in the link between the various populations will be broken and one or more species may form in the future. Consider the black bear Ursus americanus. Originally distributed across all of North America, its distribution is now much more fragmented. Isolated populations are free to adapt to their own particular environments and migration between populations is limited. Clearly the environment in Florida is different from that in Mexico, Alaska, or Newfoundland. Different environments will favor different adaptations. If, over time, these populations were to come back into contact with one another, they might or might not be able to interbreed successfully - reproductive isolation may occur and one species may become many. \n",
"Top result: Allopatric speciation occurs when groups from the same species are geographically isolated for long periods. Imagine all the ways that plants or animals could be isolated from each other:.\n",
"\n",
"Question: Zinc is more easily oxidized than iron because zinc has a lower reduction potential. since zinc has a lower reduction potential, it is a more what? \n",
"Support: One way to keep iron from corroding is to keep it painted. The layer of paint prevents the water and oxygen necessary for rust formation from coming into contact with the iron. As long as the paint remains intact, the iron is protected from corrosion. Other strategies include alloying the iron with other metals. For example, stainless steel is mostly iron with a bit of chromium. The chromium tends to collect near the surface, where it forms an oxide layer that protects the iron. Zinc-plated or galvanized iron uses a different strategy. Zinc is more easily oxidized than iron because zinc has a lower reduction potential. Since zinc has a lower reduction potential, it is a more active metal. Thus, even if the zinc coating is scratched, the zinc will still oxidize before the iron. This suggests that this approach should work with other active metals. Another important way to protect metal is to make it the cathode in a galvanic cell. This is cathodic protection and can be used for metals other than just iron. For example, the rusting of underground iron storage tanks and pipes can be prevented or greatly reduced by connecting them to a more active metal such as zinc or magnesium (Figure 17.18). This is also used to protect the metal parts in water heaters. The more active metals (lower reduction potential) are called sacrificial anodes because as they get used up as they corrode (oxidize) at the anode. The metal being protected serves as the cathode, and so does not oxidize (corrode). When the anodes are properly monitored and periodically replaced, the useful lifetime of the iron storage tank can be greatly extended. \n",
"Top result: In the reaction above, the zinc is being oxidized by losing electrons. However, there must be another substance present that gains those electrons and in this case that is the sulfur. In other words, the sulfur is causing the zinc to be oxidized. Sulfur is called the oxidizing agent. The zinc causes the sulfur to gain electrons and become reduced and so the zinc is called the reducing agent. The oxidizing agent is a substance that causes oxidation by accepting electrons. The reducing agent is a substance that causes reduction by losing electrons. The simplest way to think of this is that the oxidizing agent is the substance that is reduced, while the reducing agent is the substance that is oxidized. The sample problem below shows how to analyze a redox reaction.\n",
"\n",
"Question: What are used to write nuclear equations for radioactive decay? \n",
"Support: Nuclear symbols are used to write nuclear equations for radioactive decay. Lets consider the example of the beta-minus decay of thorium-234 to protactinium-234. This reaction is represented by the equation:. \n",
"Top result: Nuclear symbols are used to write nuclear equations for radioactive decay. Lets consider an example. Uranium-238 undergoes alpha decay to become thorium-234. (The numbers following the chemical names refer to the number of protons plus neutrons. ) In this reaction, uranium-238 loses two protons and two neutrons to become the element thorium-234. The reaction can be represented by this nuclear equation:.\n",
"\n",
"Question: What is controlled by regulatory proteins that bind to regulatory elements on dna? \n",
"Support: Gene transcription is controlled by regulatory proteins that bind to regulatory elements on DNA. The proteins usually either activate or repress transcription. \n",
"Top result: As shown in Figure below , transcription is controlled by regulatory proteins . The proteins bind to regions of DNA, called regulatory elements , which are located near promoters. After regulatory proteins bind to regulatory elements, they can interact with RNA polymerase, the enzyme that transcribes DNA to mRNA. Regulatory proteins are typically either activators or repressors.\n",
"\n",
"Question: What occurs when the immune system attacks a harmless substance that enters the body from the outside? \n",
"Support: An allergy occurs when the immune system attacks a harmless substance that enters the body from the outside. A substance that causes an allergy is called an allergen. It is the immune system, not the allergen, that causes the symptoms of an allergy. \n",
"Top result: The second line of defense attacks pathogens that manage to enter the body. It includes the inflammatory response and phagocytosis by nonspecific leukocytes.\n",
"\n",
"Question: The plants alternation between haploid and diploud generations allow it to do what? \n",
"Support: All plants have a characteristic life cycle that includes alternation of generations . Plants alternate between haploid and diploid generations. Alternation of generations allows for both asexual and sexual reproduction. Asexual reproduction with spores produces haploid individuals called gametophytes . Sexual reproduction with gametes and fertilization produces diploid individuals called sporophytes . A typical plants life cycle is diagrammed in Figure below . \n",
"Top result: Plants alternate between diploid-cell plants and haploid-cell plants. This is called alternation of generations , because the plant type alternates from generation to generation. In alternation of generations, the plant alternates between a sporophyte that has diploid cells and a gametophyte that has haploid cells.\n",
"\n"
]
}
],
"source": [
"for i, q in enumerate(dataset['question'][:20]):\n",
" support = dataset['support'][i]\n",
" top_result = question_results['documents'][i][0]\n",
"\n",
" if support != top_result:\n",
" print(f\"Question: {q} \\nSupport: {support} \\nTop result: {top_result}\\n\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conclusion\n",
"\n",
"This notebook presents one possible approach to computing a relevance score for embeddings-based retreival, based on the distribution of distances between embeddings in the dataset. We have done some initial evaluation, but there is a lot left to do. \n",
"\n",
"Some things to try include:\n",
"- Construct the distance distribution on the basis of the query-support pairs, rather than between nearest neighbor supports. \n",
"- Additional evaluations comparing different embedding models for the same dataset, as well as datasets with less redundancy. \n",
"- Using the distance distribution to deduplicate data, by finding low-percentile outliers. One idea is to use an LLM in the loop to create summaries of document pairs, creating a single point from several which are near one another. \n",
"- Using relevance as a signal for automatically fine-tuning embedding space. One approach may be to learn an affine transform based on question/answer pairs, to increase the relevance of the correct points relative to others. \n",
"\n",
"We welcome contributions and ideas! "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "chroma",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}