Skip to content

Silas new site search and SDK 1.0 release #1317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 72 commits into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
be03897
New site search
SilasMarvin Jan 10, 2024
9df3528
Working fast site search and vector search
SilasMarvin Jan 13, 2024
f9cb8a1
Cleaned tests and remote fallback working for search and vector_search
SilasMarvin Jan 17, 2024
b04ead6
Clean up vector search
SilasMarvin Jan 17, 2024
44ab0ed
Switched to a transactional version of upsert documents and syncing p…
SilasMarvin Jan 17, 2024
9aaa31b
Working conditional pipeline running on document upsert
SilasMarvin Jan 18, 2024
6979f69
Really good upsert documents
SilasMarvin Jan 18, 2024
c8e1af8
Cleaned up some tests
SilasMarvin Jan 18, 2024
9df12b5
Switching old pipeline to be a pass through for the new multi field p…
SilasMarvin Jan 19, 2024
f75a2ec
Finished pipeline as a pass through and more tests
SilasMarvin Jan 22, 2024
59f4419
Working site search with doc type filtering
SilasMarvin Jan 22, 2024
ec351ff
Working site search with doc type filtering
SilasMarvin Jan 23, 2024
027080f
collection query_builder now a wrapper around collection.vector_search
SilasMarvin Jan 23, 2024
44cc8a0
Verifying on Python and JavaScript
SilasMarvin Jan 24, 2024
6a9fd14
Working with JavaScript and Python
SilasMarvin Jan 25, 2024
099ea60
Cleaned up
SilasMarvin Jan 25, 2024
412fb57
Move MultiFieldPipeline to Pipeline and added batch uploads for docum…
SilasMarvin Jan 25, 2024
9781766
Added SingleFieldPipeline function shoutout to Lev
SilasMarvin Jan 25, 2024
b87a654
Working on fixing query
SilasMarvin Jan 27, 2024
17b81e7
Working recursive query
SilasMarvin Feb 5, 2024
7339cd5
Added smarter chunking and search results table
SilasMarvin Feb 5, 2024
84e621a
Updated deps, added debugger for queries
SilasMarvin Feb 9, 2024
d745fc6
Logging search results done
SilasMarvin Feb 9, 2024
2d75d98
Correct return type with search inserts
SilasMarvin Feb 9, 2024
bed7144
Updated tests to pass with new sqlx version
SilasMarvin Feb 9, 2024
0e06ce1
Added a way for users to provide search_events
SilasMarvin Feb 12, 2024
1677a51
Quick fix on remote embeddings search
SilasMarvin Feb 12, 2024
a5599e5
Quick fix and change the upsert query to be more efficient
SilasMarvin Feb 13, 2024
f47002e
Fix for JS after updating tokio
SilasMarvin Feb 13, 2024
f39b94c
Updated extractive_question_answering example for Python
SilasMarvin Feb 13, 2024
f2c5f61
Updated question_answering for Python
SilasMarvin Feb 13, 2024
6ec6df5
Updated question_answering_instructor for Python
SilasMarvin Feb 13, 2024
c9a24e6
Updated semantic_search for Python
SilasMarvin Feb 14, 2024
6c7f05a
Updated summarizing_question_answering for Python
SilasMarvin Feb 14, 2024
119807f
Updated table question answering for Python
SilasMarvin Feb 14, 2024
71d4915
Updated table question answering for Python
SilasMarvin Feb 14, 2024
6dfd0d7
Updated rag question answering for Python
SilasMarvin Feb 14, 2024
70f1ac0
Updated question_answering for JavaScript
SilasMarvin Feb 14, 2024
67fae04
Updated question_answering_instructor for JavaScript
SilasMarvin Feb 14, 2024
0dd0027
Updated question_answering_instructor for JavaScript
SilasMarvin Feb 14, 2024
7afea01
Updated extractive_question_answering example for JavaScript
SilasMarvin Feb 14, 2024
95188a4
Updated summarizing_question_answering for JavaScript
SilasMarvin Feb 14, 2024
8807489
Updated semantic_search for JavaScript
SilasMarvin Feb 14, 2024
c9e5d04
Updated versions and removed unused clone
SilasMarvin Feb 14, 2024
c71143f
Cleaned up search query
SilasMarvin Feb 14, 2024
f4d261e
Edit test
SilasMarvin Feb 14, 2024
3d1a6ce
Added the stress test
SilasMarvin Feb 14, 2024
692c252
Updated to use new sdk
SilasMarvin Feb 14, 2024
fc5658f
Updated test
SilasMarvin Feb 15, 2024
4c38aca
Removed document_id
SilasMarvin Feb 16, 2024
4167e32
Removed document_id and updated all searches to work without it
SilasMarvin Feb 16, 2024
0cadd8c
Fixed python test
SilasMarvin Feb 16, 2024
077ce1b
Updated stress test
SilasMarvin Feb 16, 2024
7f53b93
Updated to clean up pool access
SilasMarvin Feb 16, 2024
144da42
Added test for bad collection names
SilasMarvin Feb 16, 2024
039c9cc
Cleaned up tests
SilasMarvin Feb 16, 2024
bd983cf
Add migration error
SilasMarvin Feb 26, 2024
4fb0149
Updated text
SilasMarvin Feb 26, 2024
b4f1edd
Add dockerfile to build javascript
SilasMarvin Feb 26, 2024
c41597a
Working dockerfile for build
SilasMarvin Feb 26, 2024
3f53e9c
Test github docker build
SilasMarvin Feb 26, 2024
679b995
Iterating on gh action
SilasMarvin Feb 26, 2024
c614e4e
Iterating on gh action
SilasMarvin Feb 26, 2024
7169596
Iterating on gh action
SilasMarvin Feb 26, 2024
8de7727
Iterating on gh action
SilasMarvin Feb 26, 2024
25fe41c
Iterating on gh action
SilasMarvin Feb 26, 2024
271e1e4
Updated collection test
SilasMarvin Feb 26, 2024
9e4c2a1
Finished boosting and working with the new sdk
SilasMarvin Feb 27, 2024
c46957c
Made document search just use semantic search and boosted title
SilasMarvin Feb 27, 2024
0d963a8
Updated the chatbot to use the new chat history
SilasMarvin Feb 27, 2024
d9b241d
Small cleanups
SilasMarvin Feb 27, 2024
a34619b
Adjust boosting
SilasMarvin Feb 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Working conditional pipeline running on document upsert
  • Loading branch information
SilasMarvin committed Feb 28, 2024
commit 9aaa31b7caaf6d0e84706aaa0e36eca03e4dad9d
205 changes: 148 additions & 57 deletions pgml-sdks/pgml/src/collection.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ use serde_json::json;
use sqlx::postgres::PgPool;
use sqlx::Executor;
use sqlx::PgConnection;
use sqlx::Postgres;
use sqlx::Transaction;
use std::borrow::Cow;
use std::path::Path;
use std::sync::Arc;
Expand Down Expand Up @@ -274,18 +276,43 @@ impl Collection {
/// ```
#[instrument(skip(self))]
pub async fn add_pipeline(&mut self, pipeline: &mut MultiFieldPipeline) -> anyhow::Result<()> {
// The flow for this function:
// 1. Create collection if it does not exists
// 2. Create the pipeline if it does not exist and add it to the collection.pipelines table with ACTIVE = FALSE
// 3. Create the tables for the collection_pipeline schema
// 4. Start a transaction
// 5. Sync the pipeline
// 6. Set the pipeline ACTIVE = TRUE
// 7. Commit the transaction
self.verify_in_database(false).await?;
let project_info = &self
.database_data
.as_ref()
.context("Database data must be set to add a pipeline to a collection")?
.project_info;
pipeline.set_project_info(project_info.clone());
pipeline.verify_in_database(true).await?;
pipeline.verify_in_database(false).await?;
pipeline.create_tables().await?;

let pool = get_or_initialize_pool(&self.database_url).await?;
let transaction = pool.begin().await?;
let transaction = Arc::new(Mutex::new(transaction));

let mp = MultiProgress::new();
mp.println(format!("Added Pipeline {}, Now Syncing...", pipeline.name))?;
self.sync_pipeline(pipeline).await?;
eprintln!("Done Syncing {}\n", pipeline.name);
pipeline.execute(None, transaction.clone()).await?;
let mut transaction = Arc::into_inner(transaction)
.context("Error transaction dangling")?
.into_inner();
sqlx::query(&query_builder!(
"UPDATE %s SET active = TRUE WHERE name = $1",
self.pipelines_table_name
))
.bind(&pipeline.name)
.execute(&mut *transaction)
.await?;
transaction.commit().await?;
mp.println(format!("Done Syncing {}\n", pipeline.name))?;
Ok(())
}

Expand All @@ -308,28 +335,28 @@ impl Collection {
/// }
/// ```
#[instrument(skip(self))]
pub async fn remove_pipeline(
&mut self,
pipeline: &mut MultiFieldPipeline,
) -> anyhow::Result<()> {
let pool = get_or_initialize_pool(&self.database_url).await?;
pub async fn remove_pipeline(&mut self, pipeline: &MultiFieldPipeline) -> anyhow::Result<()> {
// The flow for this function:
// Create collection if it does not exist
// Begin a transaction
// Drop the collection_pipeline schema
// Delete the pipeline from the collection.pipelines table
// Commit the transaction
self.verify_in_database(false).await?;
let project_info = &self
.database_data
.as_ref()
.context("Database data must be set to remove pipeline from collection")?
.context("Database data must be set to remove a pipeline from a collection")?
.project_info;
pipeline.set_project_info(project_info.clone());
pipeline.verify_in_database(false).await?;

let pool = get_or_initialize_pool(&self.database_url).await?;
let pipeline_schema = format!("{}_{}", project_info.name, pipeline.name);

let mut transaction = pool.begin().await?;
transaction
.execute(query_builder!("DROP SCHEMA IF EXISTS %s CASCADE", pipeline_schema).as_str())
.await?;
sqlx::query(&query_builder!(
"UPDATE %s SET active = FALSE WHERE name = $1",
"DELETE FROM %s WHERE name = $1",
self.pipelines_table_name
))
.bind(&pipeline.name)
Expand All @@ -344,7 +371,7 @@ impl Collection {
///
/// # Arguments
///
/// * `pipeline` - The [Pipeline] to remove.
/// * `pipeline` - The [Pipeline] to enable
///
/// # Example
///
Expand All @@ -359,22 +386,18 @@ impl Collection {
/// }
/// ```
#[instrument(skip(self))]
pub async fn enable_pipeline(&self, pipeline: &Pipeline) -> anyhow::Result<()> {
sqlx::query(&query_builder!(
"UPDATE %s SET active = TRUE WHERE name = $1",
self.pipelines_table_name
))
.bind(&pipeline.name)
.execute(&get_or_initialize_pool(&self.database_url).await?)
.await?;
Ok(())
pub async fn enable_pipeline(
&mut self,
pipeline: &mut MultiFieldPipeline,
) -> anyhow::Result<()> {
self.add_pipeline(pipeline).await
}

/// Disables a [Pipeline] on the [Collection]
///
/// # Arguments
///
/// * `pipeline` - The [Pipeline] to remove.
/// * `pipeline` - The [Pipeline] to disable
///
/// # Example
///
Expand All @@ -389,14 +412,38 @@ impl Collection {
/// }
/// ```
#[instrument(skip(self))]
pub async fn disable_pipeline(&self, pipeline: &Pipeline) -> anyhow::Result<()> {
pub async fn disable_pipeline(&mut self, pipeline: &MultiFieldPipeline) -> anyhow::Result<()> {
// Our current system for keeping documents, chunks, embeddings, and tsvectors in sync
// does not play nice with disabling and then re-enabling pipelines.
// For now, when disabling a pipeline, simply delete its schema and remake it later
// The flow for this function:
// 1. Create the collection if it does not exist
// 2. Begin a transaction
// 3. Set the pipelines ACTIVE = FALSE in the collection.pipelines table
// 4. Drop the collection_pipeline schema (this will get remade if they enable it again)
// 5. Commit the transaction
self.verify_in_database(false).await?;
let project_info = &self
.database_data
.as_ref()
.context("Database data must be set to remove a pipeline from a collection")?
.project_info;
let pool = get_or_initialize_pool(&self.database_url).await?;
let pipeline_schema = format!("{}_{}", project_info.name, pipeline.name);

let mut transaction = pool.begin().await?;
sqlx::query(&query_builder!(
"UPDATE %s SET active = FALSE WHERE name = $1",
self.pipelines_table_name
))
.bind(&pipeline.name)
.execute(&get_or_initialize_pool(&self.database_url).await?)
.execute(&mut *transaction)
.await?;
transaction
.execute(query_builder!("DROP SCHEMA IF EXISTS %s CASCADE", pipeline_schema).as_str())
.await?;
transaction.commit().await?;

Ok(())
}

Expand Down Expand Up @@ -442,13 +489,21 @@ impl Collection {
/// Ok(())
/// }
/// ```
// TODO: Make it so if we upload the same documen twice it doesn't do anything
#[instrument(skip(self, documents))]
pub async fn upsert_documents(
&mut self,
documents: Vec<Json>,
_args: Option<Json>,
) -> anyhow::Result<()> {
// The flow for this function
// 1. Create the collection if it does not exist
// 2. Get all pipelines where ACTIVE = TRUE
// 3. Create each pipeline and the collection_pipeline schema and tables if they don't already exist
// 4. Foreach document
// -> Begin a transaction returning the old document if it existed
// -> Insert the document
// -> Foreach pipeline check if we need to resync the document and if so sync the document
// -> Commit the transaction
let pool = get_or_initialize_pool(&self.database_url).await?;
self.verify_in_database(false).await?;
let mut pipelines = self.get_pipelines().await?;
Expand All @@ -468,20 +523,55 @@ impl Collection {
let md5_digest = md5::compute(id.as_bytes());
let source_uuid = uuid::Uuid::from_slice(&md5_digest.0)?;

let document_id: i64 = sqlx::query_scalar(&query_builder!("INSERT INTO %s (source_uuid, document) VALUES ($1, $2) ON CONFLICT (source_uuid) DO UPDATE SET document = $2 RETURNING id", self.documents_table_name)).bind(source_uuid).bind(document).fetch_one(&mut *transaction).await?;
let (document_id, previous_document): (i64, Option<Json>) = sqlx::query_as(&query_builder!(
"WITH prev AS (SELECT document FROM %s WHERE source_uuid = $1) INSERT INTO %s (source_uuid, document) VALUES ($1, $2) ON CONFLICT (source_uuid) DO UPDATE SET document = EXCLUDED.document RETURNING id, (SELECT document FROM prev)",
self.documents_table_name,
self.documents_table_name
))
.bind(&source_uuid)
.bind(&document)
.fetch_one(&mut *transaction)
.await?;

let transaction = Arc::new(Mutex::new(transaction));
if !pipelines.is_empty() {
use futures::stream::StreamExt;
futures::stream::iter(&mut pipelines)
// Need this map to get around moving the transaction
.map(|pipeline| (pipeline, transaction.clone()))
.for_each_concurrent(10, |(pipeline, transaction)| async move {
pipeline
.execute(Some(document_id), transaction)
.await
.expect("Failed to execute pipeline");
.map(|pipeline| {
(
pipeline,
previous_document.clone(),
document.clone(),
transaction.clone(),
)
})
.for_each_concurrent(
10,
|(pipeline, previous_document, document, transaction)| async move {
// Can unwrap here as we know it has parsed schema from the create_table call above
match previous_document {
Some(previous_document) => {
let should_run =
pipeline.parsed_schema.as_ref().unwrap().iter().any(
|(key, _)| document[key] != previous_document[key],
);
if should_run {
pipeline
.execute(Some(document_id), transaction)
.await
.expect("Failed to execute pipeline");
}
}
None => {
pipeline
.execute(Some(document_id), transaction)
.await
.expect("Failed to execute pipeline");
}
}
},
)
.await;
}

Expand Down Expand Up @@ -705,29 +795,30 @@ impl Collection {
// Ok(())
}

#[instrument(skip(self))]
async fn sync_pipeline(&mut self, pipeline: &mut MultiFieldPipeline) -> anyhow::Result<()> {
self.verify_in_database(false).await?;
let project_info = &self
.database_data
.as_ref()
.context("Database data must be set to get collection pipelines")?
.project_info;
pipeline.set_project_info(project_info.clone());
pipeline.create_tables().await?;

let pool = get_or_initialize_pool(&self.database_url).await?;
let transaction = pool.begin().await?;
let transaction = Arc::new(Mutex::new(transaction));
pipeline.execute(None, transaction.clone()).await?;

Arc::into_inner(transaction)
.context("Error transaction dangling")?
.into_inner()
.commit()
.await?;
Ok(())
}
// #[instrument(skip(self))]
// async fn sync_pipeline(
// &mut self,
// pipeline: &mut MultiFieldPipeline,
// transaction: Arc<Mutex<Transaction<'static, Postgres>>>,
// ) -> anyhow::Result<()> {
// self.verify_in_database(false).await?;
// let project_info = &self
// .database_data
// .as_ref()
// .context("Database data must be set to get collection pipelines")?
// .project_info;
// pipeline.set_project_info(project_info.clone());
// pipeline.create_tables().await?;

// pipeline.execute(None, transaction).await?;

// Arc::into_inner(transaction)
// .context("Error transaction dangling")?
// .into_inner()
// .commit()
// .await?;
// Ok(())
// }

#[instrument(skip(self))]
pub async fn search(
Expand Down
16 changes: 14 additions & 2 deletions pgml-sdks/pgml/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ mod tests {
#[sqlx::test]
async fn can_add_pipeline_and_upsert_documents() -> anyhow::Result<()> {
internal_init_logger(None, None).ok();
let collection_name = "test_r_c_capaud_44";
let collection_name = "test_r_c_capaud_46";
let pipeline_name = "test_r_p_capaud_6";
let mut pipeline = MultiFieldPipeline::new(
pipeline_name,
Expand Down Expand Up @@ -361,13 +361,13 @@ mod tests {
.fetch_all(&pool)
.await?;
assert!(body_chunks.len() == 4);
collection.archive().await?;
let tsvectors_table = format!("{}_{}.body_tsvectors", collection_name, pipeline_name);
let tsvectors: Vec<models::TSVector> =
sqlx::query_as(&query_builder!("SELECT * FROM %s", tsvectors_table))
.fetch_all(&pool)
.await?;
assert!(tsvectors.len() == 4);
collection.archive().await?;
Ok(())
}

Expand Down Expand Up @@ -588,6 +588,18 @@ mod tests {
Ok(())
}

#[sqlx::test]
async fn can_update_documents() -> anyhow::Result<()> {
let collection_name = "test_r_c_cud_0";
let mut collection = Collection::new(collection_name, None);
let mut documents = generate_dummy_documents(1);
collection.upsert_documents(documents.clone(), None).await?;
documents[0]["body"] = json!("new body");
collection.upsert_documents(documents, None).await?;
// collection.archive().await?;
Ok(())
}

#[sqlx::test]
async fn can_search_with_local_embeddings() -> anyhow::Result<()> {
internal_init_logger(None, None).ok();
Expand Down
Loading
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy