From 05f79b49dfbb8e16cf1570adab1bb4e81c634ae0 Mon Sep 17 00:00:00 2001 From: craigmoore1 <115486341+craigmoore1@users.noreply.github.com> Date: Thu, 2 May 2024 16:32:54 -0500 Subject: [PATCH 1/5] Initial_audit_changes Create some initial changes to the docs based on my audit suggestions. --- pgml-cms/docs/README.md | 8 ++- pgml-cms/docs/api/apis.md | 12 ++-- .../docs/api/sql-extension/pgml.deploy.md | 6 +- pgml-cms/docs/api/sql-extension/pgml.embed.md | 4 +- .../api/sql-extension/pgml.train/README.md | 4 +- .../pgml.train/classification.md | 42 ++++++------ .../sql-extension/pgml.train/clustering.md | 10 +-- .../sql-extension/pgml.train/regression.md | 64 +++++++++---------- .../pgml.transform/text-classification.md | 2 +- .../pgml.transform/text-generation.md | 12 ++-- pgml-cms/docs/api/sql-extension/pgml.tune.md | 16 ++--- .../introduction/getting-started/README.md | 14 +++- .../resources/developer-docs/installation.md | 12 ++-- pgml-cms/docs/use-cases/chatbots.md | 16 ++--- ...-with-application-data-in-your-database.md | 4 -- 15 files changed, 118 insertions(+), 108 deletions(-) diff --git a/pgml-cms/docs/README.md b/pgml-cms/docs/README.md index 1d993a933..e12ddf095 100644 --- a/pgml-cms/docs/README.md +++ b/pgml-cms/docs/README.md @@ -4,12 +4,14 @@ description: The key concepts that make up PostgresML. # Overview -PostgresML is a complete MLOps platform built on PostgreSQL. Our operating principle is: +PostgresML is a complete [MLOps platform](## "A Machine Learning Operations platform is a set of practices that streamlines bringing machine learning models to production") built on PostgreSQL. Our operating principle is: > _Move models to the database, rather than constantly moving data to the models._ Data for ML & AI systems is inherently larger and more dynamic than the models. It's more efficient, manageable and reliable to move models to the database, rather than continuously moving data to the models. +We offer both [managed-cloud](/docs/product/cloud-database/) and [local](/docs/resources/developer-docs/installation) installations to provide solutions for wherever you keep your data. + ## AI engine PostgresML allows you to take advantage of the fundamental relationship between data and models, by extending the database with the following capabilities: @@ -48,8 +50,8 @@ Some of the use cases include: ## Our mission -PostgresML strives to provide access to open source AI for everyone. We are continuously developping PostgresML to keep up with the rapidly evolving use cases for ML & AI, but we remain committed to never breaking user facing APIs. We welcome contributions to our [open source code and documentation](https://github.com/postgresml) from the community. +PostgresML strives to provide access to open source AI for everyone. We are continuously developing PostgresML to keep up with the rapidly evolving use cases for ML & AI, but we remain committed to never breaking user-facing APIs. We welcome contributions to our [open source code and documentation](https://github.com/postgresml target="_blank") from the community. ## Managed cloud -While our extension and pooler are open source, we also offer a managed cloud database service for production deployments of PostgresML. You can [sign up](https://postgresml.org/signup) for an account and get a free Serverless database in seconds. +While our extension and pooler are open source, we also offer a managed cloud database service for production deployments of PostgresML. You can [sign up](https://postgresml.org/signup target="_blank") for an account and get a free Serverless database in seconds. diff --git a/pgml-cms/docs/api/apis.md b/pgml-cms/docs/api/apis.md index a4a465d4f..b23be60e6 100644 --- a/pgml-cms/docs/api/apis.md +++ b/pgml-cms/docs/api/apis.md @@ -4,15 +4,15 @@ description: Overview of the PostgresML SQL API and SDK. # API overview -PostgresML is a PostgreSQL extension which adds SQL functions to the database where it's installed. The functions work with modern machine learning algorithms and latest open source LLMs while maintaining a stable API signature. They can be used by any application that connects to the database. +PostgresML is a PostgreSQL extension which adds SQL functions to the database where it is installed. The functions work with modern machine learning algorithms and latest open source LLMs while maintaining a stable API signature. They can be used by any application that connects to the database. -In addition to the SQL API, we built and maintain a client SDK for JavaScript, Python and Rust. The SDK uses the same extension functionality to implement common ML & AI use cases, like retrieval-augmented generation (RAG), chatbots, and semantic & hybrid search engines. +In addition to the SQL API, we built and maintain a client SDK for JavaScript, Python, and Rust. The SDK uses the same extension functionality to implement common ML & AI use cases, like retrieval-augmented generation (RAG), chatbots, and semantic & hybrid search engines. Using the SDK is optional, and you can implement the same functionality with standard SQL queries. If you feel more comfortable using a programming language, the SDK can help you to get started quickly. ## [SQL extension](sql-extension/) -The PostgreSQL extension provides all of the ML & AI functionality, like training models and inference, via SQL functions. The functions are designed for ML practitioners to use dozens of ML algorithms to train models, and run real time inference, on live application data. Additionally, the extension provides access to the latest Hugging Face transformers for a wide range of NLP tasks. +The PostgreSQL extension provides all of the ML & AI functionality, like training models and inference, via SQL functions. The functions are designed for ML practitioners to use dozens of ML algorithms to train models and run real time inference on live application data. Additionally, the extension provides access to the latest Hugging Face transformers for a wide range of NLP tasks. ### Functions @@ -21,18 +21,18 @@ The following functions are implemented and maintained by the PostgresML extensi | Function | Description | |------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | [pgml.embed()](sql-extension/pgml.embed) | Generate embeddings inside the database using open source embedding models from Hugging Face. | -| [pgml.transform()](sql-extension/pgml.transform/) | Download and run latest Hugging Face transformer models, like Llama, Mixtral, and many more to perform various NLP tasks like text generation, summarization, sentiment analysis and more. | +| [pgml.transform()](sql-extension/pgml.transform/) | Download and run the latest Hugging Face transformer models, like Llama, Mixtral, and many more to perform various NLP tasks like text generation, summarization, sentiment analysis, and more. | | pgml.transform_stream() | Streaming version of [pgml.transform()](sql-extension/pgml.transform/). Retrieve tokens as they are generated by the LLM, decreasing time to first token. | | [pgml.train()](sql-extension/pgml.train/) | Train a machine learning model on data from a Postgres table or view. Supports XGBoost, LightGBM, Catboost and all Scikit-learn algorithms. | | [pgml.deploy()](sql-extension/pgml.deploy) | Deploy a version of the model created with pgml.train(). | | [pgml.predict()](sql-extension/pgml.predict/) | Perform real time inference using a model trained with pgml.train() on live application data. | | [pgml.tune()](sql-extension/pgml.tune) | Run LoRA fine tuning on an open source model from Hugging Face using data from a Postgres table or view. | -Together with standard database functionality provided by PostgreSQL, these functions allow to create and manage the entire life cycle of a machine learning application. +Together with standard database functionality provided by PostgreSQL, these functions allow you to create and manage the entire life cycle of a machine-learning application. ## [Client SDK](client-sdk/) -The client SDK implements best practices and common use cases, using the PostgresML SQL functions and standard PostgreSQL features to do it. The SDK core is written in Rust, which manages creating and running queries, connection pooling, and error handling. +The client SDK implements best practices and common use cases using the PostgresML SQL functions and standard PostgreSQL features. The SDK core is written in Rust, which manages creating and running queries, connection pooling, and error handling. For each additional language we support (currently JavaScript and Python), we create and publish language-native bindings. This architecture ensures all programming languages we support have identical APIs and similar performance when interacting with PostgresML. diff --git a/pgml-cms/docs/api/sql-extension/pgml.deploy.md b/pgml-cms/docs/api/sql-extension/pgml.deploy.md index 3181f9d51..2fd0b9493 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.deploy.md +++ b/pgml-cms/docs/api/sql-extension/pgml.deploy.md @@ -87,7 +87,7 @@ SELECT * FROM pgml.deploy( ### Rolling Back -In case the new model isn't performing well in production, it's easy to rollback to the previous version. A rollback creates a new deployment for the old model. Multiple rollbacks in a row will oscillate between the two most recently deployed models, making rollbacks a safe and reversible operation. +If the new model is not performing well in production, it is easy to rollback to the previous version. A rollback creates a new deployment for the old model. Multiple rollbacks in a row will oscillate between the two most recently deployed models, making rollbacks a safe and reversible operation. #### Rollback @@ -101,7 +101,7 @@ SELECT * FROM pgml.deploy( #### Output ```sql - project | strategy | algorithm + project | strategy | algorithm ------------------------------------+----------+----------- Handwritten Digit Image Classifier | rollback | linear (1 row) @@ -129,7 +129,7 @@ SELECT * FROM pgml.deploy( ### Specific Model IDs -In the case you need to deploy an exact model that is not the `most_recent` or `best_score`, you may deploy a model by id. Model id's can be found in the `pgml.models` table. +In the case you need to deploy an exact model that is not the `most_recent` or `best_score`, you may deploy a model by id. Model ids can be found in the `pgml.models` table. #### SQL diff --git a/pgml-cms/docs/api/sql-extension/pgml.embed.md b/pgml-cms/docs/api/sql-extension/pgml.embed.md index b31c944b3..3aa74cbf2 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.embed.md +++ b/pgml-cms/docs/api/sql-extension/pgml.embed.md @@ -6,7 +6,7 @@ description: >- # pgml.embed() -The `pgml.embed()` function generates [embeddings](/docs/use-cases/embeddings/) from text, using in-database models downloaded from Hugging Face. Thousands of [open-source models](https://huggingface.co/models?library=sentence-transformers) are available and new and better ones are being published regularly. +The `pgml.embed()` function generates [embeddings](/docs/use-cases/embeddings/) from text, using in-database models downloaded from Hugging Face. Thousands of [open-source models](https://huggingface.co/models?library=sentence-transformers) are available, with new and better models being published regularly. ## API @@ -69,7 +69,7 @@ VALUES {% endtab %} {% endtabs %} -In this example, we're using [generated columns](https://www.postgresql.org/docs/current/ddl-generated-columns.html) to automatically create an embedding of the `quote` column every time the column value is updated. +In this example, we are using [generated columns](https://www.postgresql.org/docs/current/ddl-generated-columns.html) to automatically create an embedding of the `quote` column every time the column value is updated. #### Using embeddings in queries diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/README.md b/pgml-cms/docs/api/sql-extension/pgml.train/README.md index ec49916cc..4074a1cac 100644 --- a/pgml-cms/docs/api/sql-extension/pgml.train/README.md +++ b/pgml-cms/docs/api/sql-extension/pgml.train/README.md @@ -37,7 +37,7 @@ pgml.train( | `task` | `'regression'` | The objective of the experiment: `regression`, `classification` or `cluster` | | `relation_name` | `'public.search_logs'` | The Postgres table or view where the training data is stored or defined. | | `y_column_name` | `'clicked'` | The name of the label (aka "target" or "unknown") column in the training table. | -| `algorithm` | `'xgboost'` |
The algorithm to train on the dataset, see the task specific pages for available algorithms:
regression.md
classification.md
clustering.md
The algorithm to train on the dataset. |
| `hyperparams` | `{ "n_estimators": 25 }` | The hyperparameters to pass to the algorithm for training, JSON formatted. |
| `search` | `grid` | If set, PostgresML will perform a hyperparameter search to find the best hyperparameters for the algorithm. See [hyperparameter-search.md](hyperparameter-search.md "mention") for details. |
| `search_params` | `{ "n_estimators": [5, 10, 25, 100] }` | Search parameters used in the hyperparameter search, using the scikit-learn notation, JSON formatted. |
@@ -63,7 +63,7 @@ This will create a "My Classification Project", copy the `pgml.digits` table int
When used for the first time in a project, `pgml.train()` function requires the `task` parameter, which can be either `regression` or `classification`. The task determines the relevant metrics and analysis performed on the data. All models trained within the project will refer to those metrics and analysis for benchmarking and deployment.
-The first time it's called, the function will also require a `relation_name` and `y_column_name`. The two arguments will be used to create the first snapshot of training and test data. By default, 25% of the data (specified by the `test_size` parameter) will be randomly sampled to measure the performance of the model after the `algorithm` has been trained on the 75% of the data.
+The first time it is called, the function will also require a `relation_name` and `y_column_name`. The two arguments will be used to create the first snapshot of training and test data. By default, 25% of the data (specified by the `test_size` parameter) will be randomly sampled to measure the performance of the model after the `algorithm` has been trained on the 75% of the data.
!!! tip
diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/classification.md b/pgml-cms/docs/api/sql-extension/pgml.train/classification.md
index 24df21c49..f92fef062 100644
--- a/pgml-cms/docs/api/sql-extension/pgml.train/classification.md
+++ b/pgml-cms/docs/api/sql-extension/pgml.train/classification.md
@@ -8,7 +8,7 @@ description: >-
## Example
-This example trains models on the sklean digits dataset which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). This demonstrates using a table with a single array feature column for classification. You could do something similar with a vector column.
+This example trains models on the sklean digits dataset which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits target="_blank"). This demonstrates using a table with a single array feature column for classification. You could do something similar with a vector column.
```sql
-- load the sklearn digits dataset
@@ -33,16 +33,16 @@ LIMIT 10;
## Algorithms
-We currently support classification algorithms from [scikit-learn](https://scikit-learn.org/), [XGBoost](https://xgboost.readthedocs.io/), [LightGBM](https://lightgbm.readthedocs.io/) and [Catboost](https://catboost.ai/).
+We currently support classification algorithms from [scikit-learn](https://scikit-learn.org/ target="_blank"), [XGBoost](https://xgboost.readthedocs.io/ target="_blank"), [LightGBM](https://lightgbm.readthedocs.io/ target="_blank") and [Catboost](https://catboost.ai/ target="_blank").
### Gradient Boosting
| Algorithm | Reference |
| ----------------------- | -------------------------------------------------------------------------------------------------------------------------- |
-| `xgboost` | [XGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBClassifier) |
-| `xgboost_random_forest` | [XGBRFClassifier](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBRFClassifier) |
-| `lightgbm` | [LGBMClassifier](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier) |
-| `catboost` | [CatBoostClassifier](https://catboost.ai/en/docs/concepts/python-reference\_catboostclassifier) |
+| `xgboost` | [XGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBClassifier target="_blank") |
+| `xgboost_random_forest` | [XGBRFClassifier](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBRFClassifier target="_blank") |
+| `lightgbm` | [LGBMClassifier](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier target="_blank") |
+| `catboost` | [CatBoostClassifier](https://catboost.ai/en/docs/concepts/python-reference\_catboostclassifier target="_blank") |
#### Examples
@@ -57,12 +57,12 @@ SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'catboost', hyperpar
| Algorithm | Reference |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
-| `ada_boost` | [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) |
-| `bagging` | [BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) |
-| `extra_trees` | [ExtraTreesClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) |
-| `gradient_boosting_trees` | [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) |
-| `random_forest` | [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) |
-| `hist_gradient_boosting` | [HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html) |
+| `ada_boost` | [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html target="_blank") |
+| `bagging` | [BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html target="_blank") |
+| `extra_trees` | [ExtraTreesClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html target="_blank") |
+| `gradient_boosting_trees` | [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html target="_blank") |
+| `random_forest` | [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html target="_blank") |
+| `hist_gradient_boosting` | [HistGradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html target="_blank") |
#### Examples
@@ -79,9 +79,9 @@ SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'hist_gradient_boost
| Algorithm | Reference |
| ------------ | ----------------------------------------------------------------------------------------- |
-| `svm` | [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) |
-| `nu_svm` | [NuSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html) |
-| `linear_svm` | [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) |
+| `svm` | [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html target="_blank") |
+| `nu_svm` | [NuSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html target="_blank") |
+| `linear_svm` | [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html target="_blank") |
#### Examples
@@ -95,11 +95,11 @@ SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'linear_svm');
| Algorithm | Reference |
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
-| `linear` | [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LogisticRegression.html) |
-| `ridge` | [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.RidgeClassifier.html) |
-| `stochastic_gradient_descent` | [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.SGDClassifier.html) |
-| `perceptron` | [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Perceptron.html) |
-| `passive_aggressive` | [PassiveAggressiveClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.PassiveAggressiveClassifier.html) |
+| `linear` | [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LogisticRegression.html target="_blank") |
+| `ridge` | [RidgeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.RidgeClassifier.html target="_blank") |
+| `stochastic_gradient_descent` | [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.SGDClassifier.html target="_blank") |
+| `perceptron` | [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Perceptron.html target="_blank") |
+| `passive_aggressive` | [PassiveAggressiveClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.PassiveAggressiveClassifier.html target="_blank") |
#### Examples
@@ -114,7 +114,7 @@ SELECT * FROM pgml.train('Handwritten Digits', algorithm => 'passive_aggressive'
| Algorithm | Reference |
| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------- |
-| `gaussian_process` | [GaussianProcessClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian\_process.GaussianProcessClassifier.html) |
+| `gaussian_process` | [GaussianProcessClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian\_process.GaussianProcessClassifier.html target="_blank") |
#### Examples
diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/clustering.md b/pgml-cms/docs/api/sql-extension/pgml.train/clustering.md
index 16554f54a..5e3ffe58a 100644
--- a/pgml-cms/docs/api/sql-extension/pgml.train/clustering.md
+++ b/pgml-cms/docs/api/sql-extension/pgml.train/clustering.md
@@ -4,7 +4,7 @@ Models can be trained using `pgml.train` on unlabeled data to identify groups wi
## Example
-This example trains models on the sklearn digits dataset -- which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). This demonstrates using a table with a single array feature column for clustering. You could do something similar with a vector column.
+This example trains models on the sklearn digits dataset -- which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits target="_blank"). This demonstrates using a table with a single array feature column for clustering. You could do something similar with a vector column.
```sql
SELECT pgml.load_dataset('digits');
@@ -31,10 +31,10 @@ All clustering algorithms implemented by PostgresML are online versions. You may
| Algorithm | Reference |
| ---------------------- | ----------------------------------------------------------------------------------------------------------------- |
-| `affinity_propagation` | [AffinityPropagation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) |
-| `birch` | [Birch](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html) |
-| `kmeans` | [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) |
-| `mini_batch_kmeans` | [MiniBatchKMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html) |
+| `affinity_propagation` | [AffinityPropagation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html target="_blank") |
+| `birch` | [Birch](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html target="_blank") |
+| `kmeans` | [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html target="_blank") |
+| `mini_batch_kmeans` | [MiniBatchKMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html target="_blank") |
### Examples
diff --git a/pgml-cms/docs/api/sql-extension/pgml.train/regression.md b/pgml-cms/docs/api/sql-extension/pgml.train/regression.md
index eb1a1d4de..959464339 100644
--- a/pgml-cms/docs/api/sql-extension/pgml.train/regression.md
+++ b/pgml-cms/docs/api/sql-extension/pgml.train/regression.md
@@ -6,11 +6,11 @@ description: >-
# Regression
-We currently support regression algorithms from [scikit-learn](https://scikit-learn.org/), [XGBoost](https://xgboost.readthedocs.io/), [LightGBM](https://lightgbm.readthedocs.io/) and [Catboost](https://catboost.ai/).
+We currently support regression algorithms from [scikit-learn](https://scikit-learn.org/ target="_blank"), [XGBoost](https://xgboost.readthedocs.io/ target="_blank"), [LightGBM](https://lightgbm.readthedocs.io/ target="_blank") and [Catboost](https://catboost.ai/ target="_blank").
## Example
-This example trains models on the sklean [diabetes dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load\_diabetes.html#sklearn.datasets.load\_diabetes). This example uses multiple input features to predict a single output variable.
+This example trains models on the sklean [diabetes dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load\_diabetes.html#sklearn.datasets.load\_diabetes target="_blank"). This example uses multiple input features to predict a single output variable.
```sql
-- load the dataset
@@ -34,10 +34,10 @@ LIMIT 10;
| Algorithm | Reference |
| ----------------------- | ----------------------------------------------------------------------------------------------------------------------- |
-| `xgboost` | [XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBRegressor) |
-| `xgboost_random_forest` | [XGBRFRegressor](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBRFRegressor) |
-| `lightgbm` | [LGBMRegressor](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html#lightgbm.LGBMRegressor) |
-| `catboost` | [CatBoostRegressor](https://catboost.ai/en/docs/concepts/python-reference\_catboostregressor) |
+| `xgboost` | [XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBRegressor target="_blank") |
+| `xgboost_random_forest` | [XGBRFRegressor](https://xgboost.readthedocs.io/en/stable/python/python\_api.html#xgboost.XGBRFRegressor target="_blank") |
+| `lightgbm` | [LGBMRegressor](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html#lightgbm.LGBMRegressor target="_blank") |
+| `catboost` | [CatBoostRegressor](https://catboost.ai/en/docs/concepts/python-reference\_catboostregressor target="_blank") |
#### Examples
@@ -52,12 +52,12 @@ SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'catboost', hyperp
| Algorithm | Reference |
| ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
-| `ada_boost` | [AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html) |
-| `bagging` | [BaggingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html) |
-| `extra_trees` | [ExtraTreesRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html) |
-| `gradient_boosting_trees` | [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) |
-| `random_forest` | [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) |
-| `hist_gradient_boosting` | [HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html) |
+| `ada_boost` | [AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html target="_blank") |
+| `bagging` | [BaggingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html target="_blank") |
+| `extra_trees` | [ExtraTreesRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html target="_blank") |
+| `gradient_boosting_trees` | [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html target="_blank") |
+| `random_forest` | [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html target="_blank") |
+| `hist_gradient_boosting` | [HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html target="_blank") |
#### Examples
@@ -74,9 +74,9 @@ SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'hist_gradient_boo
| Algorithm | Reference |
| ------------ | ----------------------------------------------------------------------------------------- |
-| `svm` | [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) |
-| `nu_svm` | [NuSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVR.html) |
-| `linear_svm` | [LinearSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html) |
+| `svm` | [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html target="_blank") |
+| `nu_svm` | [NuSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVR.html target="_blank") |
+| `linear_svm` | [LinearSVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html target="_blank") |
#### Examples
@@ -90,21 +90,21 @@ SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'linear_svm', hype
| Algorithm | Reference |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
-| `linear` | [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LinearRegression.html) |
-| `ridge` | [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Ridge.html) |
-| `lasso` | [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Lasso.html) |
-| `elastic_net` | [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.ElasticNet.html) |
-| `least_angle` | [LARS](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Lars.html) |
-| `lasso_least_angle` | [LassoLars](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LassoLars.html) |
-| `orthoganl_matching_pursuit` | [OrthogonalMatchingPursuit](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.OrthogonalMatchingPursuit.html) |
-| `bayesian_ridge` | [BayesianRidge](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.BayesianRidge.html) |
-| `automatic_relevance_determination` | [ARDRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.ARDRegression.html) |
-| `stochastic_gradient_descent` | [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.SGDRegressor.html) |
-| `passive_aggressive` | [PassiveAggressiveRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.PassiveAggressiveRegressor.html) |
-| `ransac` | [RANSACRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.RANSACRegressor.html) |
-| `theil_sen` | [TheilSenRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.TheilSenRegressor.html) |
-| `huber` | [HuberRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.HuberRegressor.html) |
-| `quantile` | [QuantileRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.QuantileRegressor.html) |
+| `linear` | [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LinearRegression.html target="_blank") |
+| `ridge` | [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Ridge.html target="_blank") |
+| `lasso` | [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Lasso.html target="_blank") |
+| `elastic_net` | [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.ElasticNet.html target="_blank") |
+| `least_angle` | [LARS](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.Lars.html target="_blank") |
+| `lasso_least_angle` | [LassoLars](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LassoLars.html target="_blank") |
+| `orthoganl_matching_pursuit` | [OrthogonalMatchingPursuit](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.OrthogonalMatchingPursuit.html target="_blank") |
+| `bayesian_ridge` | [BayesianRidge](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.BayesianRidge.html target="_blank") |
+| `automatic_relevance_determination` | [ARDRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.ARDRegression.html target="_blank") |
+| `stochastic_gradient_descent` | [SGDRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.SGDRegressor.html target="_blank") |
+| `passive_aggressive` | [PassiveAggressiveRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.PassiveAggressiveRegressor.html target="_blank") |
+| `ransac` | [RANSACRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.RANSACRegressor.html target="_blank") |
+| `theil_sen` | [TheilSenRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.TheilSenRegressor.html target="_blank") |
+| `huber` | [HuberRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.HuberRegressor.html target="_blank") |
+| `quantile` | [QuantileRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.QuantileRegressor.html target="_blank") |
#### Examples
@@ -130,8 +130,8 @@ SELECT * FROM pgml.train('Diabetes Progression', algorithm => 'quantile');
| Algorithm | Reference |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
-| `kernel_ridge` | [KernelRidge](https://scikit-learn.org/stable/modules/generated/sklearn.kernel\_ridge.KernelRidge.html) |
-| `gaussian_process` | [GaussianProcessRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian\_process.GaussianProcessRegressor.html) |
+| `kernel_ridge` | [KernelRidge](https://scikit-learn.org/stable/modules/generated/sklearn.kernel\_ridge.KernelRidge.html target="_blank") |
+| `gaussian_process` | [GaussianProcessRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian\_process.GaussianProcessRegressor.html target="_blank") |
#### Examples
diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/text-classification.md b/pgml-cms/docs/api/sql-extension/pgml.transform/text-classification.md
index eb670b267..bacb0e77b 100644
--- a/pgml-cms/docs/api/sql-extension/pgml.transform/text-classification.md
+++ b/pgml-cms/docs/api/sql-extension/pgml.transform/text-classification.md
@@ -8,7 +8,7 @@ Text classification is a task which includes sentiment analysis, natural languag
### Sentiment analysis
-Sentiment analysis is a type of natural language processing technique which analyzes a piece of text to determine the sentiment or emotion expressed within. It can be used to classify a text as positive, negative, or neutral.
+Sentiment analysis is a type of natural language processing (NLP) technique which analyzes a piece of text to determine the sentiment or emotion expressed within. It can be used to classify a text as positive, negative, or neutral.
#### Example
diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/text-generation.md b/pgml-cms/docs/api/sql-extension/pgml.transform/text-generation.md
index 8d84ca762..74d3c73c4 100644
--- a/pgml-cms/docs/api/sql-extension/pgml.transform/text-generation.md
+++ b/pgml-cms/docs/api/sql-extension/pgml.transform/text-generation.md
@@ -27,7 +27,7 @@ _Result_
### Model from hub
-To use a specific model from :hugging: model hub, pass the model name along with task name in task.
+To use a specific model from the HuggingFace model hub, pass the model name along with task name in task.
```sql
SELECT pgml.transform(
@@ -109,7 +109,7 @@ _Result_
### Beam Search
-Text generation typically utilizes a greedy search algorithm that selects the word with the highest probability as the next word in the sequence. However, an alternative method called beam search can be used, which aims to minimize the possibility of overlooking hidden high probability word combinations. Beam search achieves this by retaining the num\_beams most likely hypotheses at each step and ultimately selecting the hypothesis with the highest overall probability. We set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token.
+Text generation typically utilizes a greedy search algorithm that selects the word with the highest probability as the next word in the sequence. However, an alternative method called beam search can be used, which aims to minimize the possibility of overlooking hidden high probability word combinations. Beam search achieves this by retaining the `num\_beams` most likely hypotheses at each step and ultimately selecting the hypothesis with the highest overall probability. We set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token.
```sql
SELECT pgml.transform(
@@ -135,14 +135,16 @@ _Result_
]]
```
-Sampling methods involve selecting the next word or sequence of words at random from the set of possible candidates, weighted by their probabilities according to the language model. This can result in more diverse and creative text, as well as avoiding repetitive patterns. In its most basic form, sampling means randomly picking the next word $w\_t$ according to its conditional probability distribution: $$w_t \approx P(w_t|w_{1:t-1})$$
+Sampling methods involve selecting the next word or sequence of words at random from the set of possible candidates, weighted by their probabilities according to the language model. This can result in more diverse and creative text, as well as avoiding repetitive patterns. In its most basic form, sampling means randomly picking the next word `$w\_t$` according to its conditional probability distribution: `$$w_t \approx P(w_t|w_{1:t-1})$$`.
-However, the randomness of the sampling method can also result in less coherent or inconsistent text, depending on the quality of the model and the chosen sampling parameters such as temperature, top-k, or top-p. Therefore, choosing an appropriate sampling method and parameters is crucial for achieving the desired balance between creativity and coherence in generated text.
+However, the randomness of the sampling method can also result in less coherent or inconsistent text, depending on the quality of the model and the chosen sampling parameters such as `temperature`, `top-k`, or `top-p`. Therefore, choosing an appropriate sampling method and parameters is crucial for achieving the desired balance between creativity and coherence in generated text.
You can pass `do_sample = True` in the arguments to use sampling methods. It is recommended to alter `temperature` or `top_p` but not both.
### _Temperature_
+The `temperature` parameter can fine tune the level of confidence, diversity, and randommness of a model. It uses a range from 0 (very conservative output) to infinity (very diverse output) to define how the model should select a certain output based on the output's certainty. Higher temperatures should be used when the certainty of the output is low, and lower temperatures should be used when the certainty is very high. A `temperature` of 1 is considered a medium setting.
+
```sql
SELECT pgml.transform(
task => '{
@@ -167,6 +169,8 @@ _Result_
### _Top p_
+Top_p is a technique used to improve the performance of generative models. It selects the tokens that are in the top percentage of probability distribution, allowing for more diverse responses. If you are experiencing repetitive responses, modifying this setting can improve the quality of the output. The value of `top_p` is a number between 0 and 1 that sets the probability distribution, so a setting of `.8` sets the probability distribution at 80 percent. This means that it selects tokens that have an 80% probability of being accurate or higher.
+
```sql
SELECT pgml.transform(
task => '{
diff --git a/pgml-cms/docs/api/sql-extension/pgml.tune.md b/pgml-cms/docs/api/sql-extension/pgml.tune.md
index 4c874893a..535e2e44c 100644
--- a/pgml-cms/docs/api/sql-extension/pgml.tune.md
+++ b/pgml-cms/docs/api/sql-extension/pgml.tune.md
@@ -10,11 +10,11 @@ Pre-trained models allow you to get up and running quickly, but you can likely i
### Translation Example
-The [Helsinki-NLP](https://huggingface.co/Helsinki-NLP) organization provides more than a thousand pre-trained models to translate between different language pairs. These can be further fine tuned on additional datasets with domain specific vocabulary. Researchers have also created large collections of documents that have been manually translated across languages by experts for training data.
+The [Helsinki-NLP](https://huggingface.co/Helsinki-NLP target="_blank") organization provides more than a thousand pre-trained models to translate between different language pairs. These can be further fine tuned on additional datasets with domain specific vocabulary. Researchers have also created large collections of documents that have been manually translated across languages by experts for training data.
#### Prepare the data
-The [kde4](https://huggingface.co/datasets/kde4) dataset contains many language pairs. Subsets can be loaded into your Postgres instance with a call to `pgml.load_dataset`, or you may wish to create your own fine tuning dataset with vocabulary specific to your domain.
+The [kde4](https://huggingface.co/datasets/kde4 target="_blank") dataset contains many language pairs. Subsets can be loaded into your Postgres instance with a call to `pgml.load_dataset`, or you may wish to create your own fine tuning dataset with vocabulary specific to your domain.
```sql
SELECT pgml.load_dataset('kde4', kwargs => '{"lang1": "en", "lang2": "es"}');
@@ -149,7 +149,7 @@ Time: 126.837 ms
\===
-See the [task documentation](https://huggingface.co/tasks/translation) for more examples, use cases, models and datasets.
+See the [task documentation](https://huggingface.co/tasks/translation target="_blank") for more examples, use cases, models and datasets.
### Text Classification Example
@@ -190,7 +190,7 @@ SELECT * FROM pgml.imdb LIMIT 1;
#### Tune the model
-Tuning has a nearly identical API to training, except you may pass the name of a [model published on Hugging Face](https://huggingface.co/models) to start with, rather than training an algorithm from scratch.
+Tuning has a nearly identical API to training, except you may pass the name of a [model published on Hugging Face](https://huggingface.co/models target="_blank") to start with, rather than training an algorithm from scratch.
```sql
SELECT pgml.tune(
@@ -257,7 +257,7 @@ Time: 18.101 ms
This shows that there is a 6.26% chance for category 0 (negative sentiment), and a 93.73% chance it's category 1 (positive sentiment).
-See the [task documentation](https://huggingface.co/tasks/text-classification) for more examples, use cases, models and datasets.
+See the [task documentation](https://huggingface.co/tasks/text-classification target="_blank") for more examples, use cases, models and datasets.
### Summarization Example
@@ -265,7 +265,7 @@ At a high level, summarization uses similar techniques to translation. Both use
#### Prepare the data
-[BillSum](https://huggingface.co/datasets/billsum) is a dataset with training examples that summarize US Congressional and California state bills. You can pass `kwargs` specific to loading datasets, in this case we'll restrict the dataset to California samples:
+[BillSum](https://huggingface.co/datasets/billsum target="_blank") is a dataset with training examples that summarize US Congressional and California state bills. You can pass `kwargs` specific to loading datasets, in this case we'll restrict the dataset to California samples:
```sql
SELECT pgml.load_dataset('billsum', kwargs => '{"split": "ca_test"}');
@@ -377,7 +377,7 @@ LIMIT 10;
#### Tune the model
-Tuning has a nearly identical API to training, except you may pass the name of a [model published on Hugging Face](https://huggingface.co/models) to start with, rather than training an algorithm from scratch.
+Tuning has a nearly identical API to training, except you may pass the name of a [model published on Hugging Face](https://huggingface.co/models target="_blank") to start with, rather than training an algorithm from scratch.
```sql
SELECT pgml.tune(
@@ -443,7 +443,7 @@ Time: 18.101 ms
This shows that there is a 6.26% chance for category 0 (negative sentiment), and a 93.73% chance it's category 1 (positive sentiment).
-See the [task documentation](https://huggingface.co/tasks/text-classification) for more examples, use cases, models and datasets.
+See the [task documentation](https://huggingface.co/tasks/text-classification target="_blank") for more examples, use cases, models and datasets.
### Text Generation
diff --git a/pgml-cms/docs/introduction/getting-started/README.md b/pgml-cms/docs/introduction/getting-started/README.md
index 309e0ac64..bc89d7854 100644
--- a/pgml-cms/docs/introduction/getting-started/README.md
+++ b/pgml-cms/docs/introduction/getting-started/README.md
@@ -4,6 +4,16 @@ description: Getting starting with PostgresML, a GPU powered machine learning da
# Getting started
+This guide will walk you through the steps of getting started with PostgresML by following these easy steps:
+
+1. [Create a free cloud account with PostgresML](/docs/introduction/getting-started/create-your-database#sign-up-for-an-account). This also creates a PostgresML database and includes access to GPU-accelerated models and 5 GB of storage.
+2. [Select a plan](create-your-database#select-a-plan).
+3. [Connect your PostgreSQL client to your PostgresML database](connect-your-app).
+
+If you would prefer to run PostgresML locally, you can skip to our [Developer Docs](/docs/resources/developer-docs/quick-start-with-docker).
+
+## How PostgresML works
+
A PostgresML deployment consists of multiple components working in concert to provide a complete Machine Learning platform:
* PostgreSQL database, with [_pgml_](/docs/api/sql-extension/), _pgvector_ and many other extensions that add features useful in day-to-day and machine learning use cases
@@ -14,6 +24,4 @@ We provide a fully managed solution in [our cloud](create-your-database), and do
The algorithm to train on the dataset. |
+| `algorithm` | `'xgboost'` | The algorithm to train on the dataset. The algorithm to train on the dataset. The algorithm to train on the dataset, see the task specific pages for available algorithms: [classification.md](classification.md "mention") The flow of inputs through an LLM. In this case the inputs are "What is Baldur's Gate 3?" and the output token "14" maps to the word "I" Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages. Alternative Proxies:
[regression.md](regression.md "mention")
[clustering.md](clustering.md "mention")Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.