Skip to content

Commit 3b406d0

Browse files
authored
add PCA as first decomposition method (#1441)
1 parent c5d8c6f commit 3b406d0

File tree

20 files changed

+295
-49
lines changed

20 files changed

+295
-49
lines changed
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
description: Decompose an input vector into it's principal components
3+
---
4+
5+
# pgml.decompose()
6+
7+
8+
Chunks are pieces of documents split using some specified splitter. This is typically done before embedding.
9+
10+
## API
11+
12+
```sql
13+
pgml.decompose(
14+
project_name TEXT, -- project name
15+
vector REAL[] -- features to decompose
16+
)
17+
```
18+
19+
### Parameters
20+
21+
| Parameter | Example | Description |
22+
|----------------|---------------------------------|----------------------------------------------------------|
23+
| `project_name` | `'My First PostgresML Project'` | The project name used to train models in `pgml.train()`. |
24+
| `vector` | `ARRAY[0.1, 0.45, 1.0]` | The feature vector that needs decomposition. |
25+
26+
## Example
27+
28+
```sql
29+
SELECT pgml.decompose('My PCA', ARRAY[0.1, 2.0, 5.0]);
30+
```
31+
32+
!!! example
33+
34+
```sql
35+
SELECT *,
36+
pgml.decompose(
37+
'Buy it Again',
38+
ARRAY[
39+
user.location_id,
40+
NOW() - user.created_at,
41+
user.total_purchases_in_dollars
42+
]
43+
) AS buying_score
44+
FROM users
45+
WHERE tenant_id = 5
46+
ORDER BY buying_score
47+
LIMIT 25;
48+
```
49+
50+
!!!

pgml-cms/docs/api/sql-extension/pgml.train/clustering.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@ SELECT image FROM pgml.digits;
1616
-- view the dataset
1717
SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;
1818

19-
-- train a simple model to classify the data
20-
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'cluster', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');
19+
-- train a simple model to cluster the data
20+
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'clustering', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');
2121

2222
-- check out the predictions
2323
SELECT target, pgml.predict('Handwritten Digit Clusters', image) AS prediction
@@ -27,7 +27,7 @@ LIMIT 10;
2727

2828
## Algorithms
2929

30-
All clustering algorithms implemented by PostgresML are online versions. You may use the [pgml.predict](../../../api/sql-extension/pgml.predict/ "mention")function to cluster novel datapoints after the clustering model has been trained.
30+
All clustering algorithms implemented by PostgresML are online versions. You may use the [pgml.predict](../../../api/sql-extension/pgml.predict/ "mention")function to cluster novel data points after the clustering model has been trained.
3131

3232
| Algorithm | Reference |
3333
| ---------------------- | ----------------------------------------------------------------------------------------------------------------- |
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Decomposition
2+
3+
Models can be trained using `pgml.train` on unlabeled data to identify important features within the data. To decompose a dataset into it's principal components, we can use the table or a view. Since decomposition is an unsupervised algorithm, we don't need a column that represents a label as one of the inputs to `pgml.train`.
4+
5+
## Example
6+
7+
This example trains models on the sklearn digits dataset -- which is a copy of the test set of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits). This demonstrates using a table with a single array feature column for principal component analysis. You could do something similar with a vector column.
8+
9+
```sql
10+
SELECT pgml.load_dataset('digits');
11+
12+
-- create an unlabeled table of the images for unsupervised learning
13+
CREATE VIEW pgml.digit_vectors AS
14+
SELECT image FROM pgml.digits;
15+
16+
-- view the dataset
17+
SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;
18+
19+
-- train a simple model to cluster the data
20+
SELECT * FROM pgml.train('Handwritten Digit Components', 'decomposition', 'pgml.digit_vectors', hyperparams => '{"n_components": 3}');
21+
22+
-- check out the compenents
23+
SELECT target, pgml.decompose('Handwritten Digit Components', image) AS pca
24+
FROM pgml.digits
25+
LIMIT 10;
26+
```
27+
28+
Note that the input vectors have been reduced from 64 dimensions to 3, which explain nearly half of the variance across all samples.
29+
30+
## Algorithms
31+
32+
All decomposition algorithms implemented by PostgresML are online versions. You may use the [pgml.decompose](../../../api/sql-extension/pgml.decompose "mention") function to decompose novel data points after the model has been trained.
33+
34+
| Algorithm | Reference |
35+
|---------------------------|---------------------------------------------------------------------------------------------------------------------|
36+
| `pca` | [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) |
37+
38+
### Examples
39+
40+
```sql
41+
SELECT * FROM pgml.train('Handwritten Digit Clusters', algorithm => 'pca', hyperparams => '{"n_components": 10}');
42+
```

pgml-dashboard/src/models.rs

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,10 +55,11 @@ impl Project {
5555
match self.task.as_ref().unwrap().as_str() {
5656
"classification" | "text_classification" | "question_answering" => Ok("f1"),
5757
"regression" => Ok("r2"),
58+
"clustering" => Ok("silhouette"),
59+
"decomposition" => Ok("cumulative_explained_variance"),
5860
"summarization" => Ok("rouge_ngram_f1"),
5961
"translation" => Ok("bleu"),
6062
"text_generation" | "text2text" => Ok("perplexity"),
61-
"cluster" => Ok("silhouette"),
6263
task => Err(anyhow::anyhow!("Unhandled task: {}", task)),
6364
}
6465
}
@@ -67,10 +68,11 @@ impl Project {
6768
match self.task.as_ref().unwrap().as_str() {
6869
"classification" | "text_classification" | "question_answering" => Ok("F<sup>1</sup>"),
6970
"regression" => Ok("R<sup>2</sup>"),
71+
"clustering" => Ok("silhouette"),
72+
"decomposition" => Ok("Cumulative Explained Variance"),
7073
"summarization" => Ok("Rouge Ngram F<sup>1</sup>"),
7174
"translation" => Ok("Bleu"),
7275
"text_generation" | "text2text" => Ok("Perplexity"),
73-
"cluster" => Ok("silhouette"),
7476
task => Err(anyhow::anyhow!("Unhandled task: {}", task)),
7577
}
7678
}
File renamed without changes.

pgml-extension/Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pgml-extension/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "pgml"
3-
version = "2.8.3"
3+
version = "2.8.4"
44
edition = "2021"
55

66
[lib]

pgml-extension/examples/cluster.sql renamed to pgml-extension/examples/clustering.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ SELECT image FROM pgml.digits;
2020
SELECT left(image::text, 40) || ',...}' FROM pgml.digit_vectors LIMIT 10;
2121

2222
-- train a simple model to classify the data
23-
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'cluster', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');
23+
SELECT * FROM pgml.train('Handwritten Digit Clusters', 'clustering', 'pgml.digit_vectors', hyperparams => '{"n_clusters": 10}');
2424

2525
-- check out the predictions
2626
SELECT target, pgml.predict('Handwritten Digit Clusters', image) AS prediction
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
-- This example reduces the dimensionality of images in the sklean digits dataset
2+
-- which is a copy of the test set of the UCI ML hand-written digits datasets
3+
-- https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
4+
--
5+
-- This demonstrates using a table with a single array feature column
6+
-- for decomposition to reduce dimensionality.
7+
--
8+
-- Exit on error (psql)
9+
-- \set ON_ERROR_STOP true
10+
\timing on
11+
12+
SELECT pgml.load_dataset('digits');
13+
14+
-- view the dataset
15+
SELECT left(image::text, 40) || ',...}', target FROM pgml.digits LIMIT 10;
16+
17+
-- create a view of just the vectors for decomposition, without any labels
18+
CREATE VIEW digit_vectors AS
19+
SELECT image FROM pgml.digits;
20+
21+
SELECT * FROM pgml.train('Handwritten Digits Reduction', 'decomposition', 'digit_vectors');
22+
23+
-- check out the decomposed vectors
24+
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
25+
FROM pgml.digits
26+
LIMIT 10;
27+
28+
--
29+
-- After a project has been trained, omitted parameters will be reused from previous training runs
30+
-- In these examples we'll reuse the training data snapshots from the initial call.
31+
--
32+
33+
-- We can reduce the image vectors from 64 dimensions to 3 components
34+
SELECT * FROM pgml.train('Handwritten Digits Reduction', hyperparams => '{"n_components": 3}');
35+
36+
-- check out the reduced vectors
37+
SELECT target, pgml.decompose('Handwritten Digits Reduction', image) AS pca
38+
FROM pgml.digits
39+
LIMIT 10;
40+
41+
-- check out all that hard work
42+
SELECT trained_models.* FROM pgml.trained_models
43+
JOIN pgml.models on models.id = trained_models.id
44+
ORDER BY models.metrics->>'cumulative_explained_variance' DESC LIMIT 5;
45+
46+
-- deploy the PCA model for prediction use
47+
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent', 'pca');
48+
-- check out that throughput
49+
SELECT * FROM pgml.deployed_models ORDER BY deployed_at DESC LIMIT 5;
50+
51+
-- deploy the "best" model for prediction use
52+
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score');
53+
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'most_recent');
54+
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'rollback');
55+
SELECT * FROM pgml.deploy('Handwritten Digits Reduction', 'best_score', 'pca');
56+
57+
-- check out the improved predictions
58+
SELECT target, pgml.predict('Handwritten Digits Reduction', image) AS prediction
59+
FROM pgml.digits
60+
LIMIT 10;

pgml-extension/examples/image_classification.sql

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,8 @@
55
-- This demonstrates using a table with a single array feature column
66
-- for classification.
77
--
8-
-- The final result after a few seconds of training is not terrible. Maybe not perfect
9-
-- enough for mission critical applications, but it's telling how quickly "off the shelf"
10-
-- solutions can solve problems these days.
8+
-- Some algorithms converge on this trivial dataset in under a second, demonstrating the
9+
-- speed with which modern machines can "learn" from example data.
1110

1211
-- Exit on error (psql)
1312
-- \set ON_ERROR_STOP true

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy