Grasp is a lightweight AI toolkit for Python, with tools for data mining, natural language processing (NLP), machine learning (ML) and network analysis. It has 300+ fast and essential algorithms, with ~25 lines of code per function, self-explanatory function names, no dependencies, bundled into one well-documented file: grasp.py (250KB). Or install with pip, including language models (25MB):
$ pip install git+https://github.com/textgain/grasp
Download stuff with download(url)
(or dl
), with built-in caching and logging:
src = dl('https://www.textgain.com', cached=True)
Parse HTML with dom(html)
into an Element
tree and search it with CSS Selectors:
for e in dom(src)('a[href^="http"]'): # external links
print(e.href)
Strip HTML with plain(Element)
to get a plain text string:
for word, count in wc(plain(dom(src))).items():
print(word, count)
Find articles with wikipedia(str)
, in HTML:
for e in dom(wikipedia('cat', language='en'))('p'):
print(plain(e))
Find opinions with twitter.seach(str)
:
for tweet in first(10, twitter.search('from:textgain')): # latest 10
print(tweet.id, tweet.text, tweet.date)
Deploy APIs with App
. Works with WSGI and Nginx:
app = App()
@app.route('/')
def index(*path, **query):
return 'Hi! %s %s' % (path, query)
app.run('127.0.0.1', 8080, debug=True)
Once this app is up, go check http://127.0.0.1:8080/app?q=cat.
Get language with lang(str)
for 40+ languages and ~92.5% accuracy:
print(lang('The cat sat on the mat.')) # {'en': 0.99}
Get locations with loc(str)
for 25K+ EU cities:
print(loc('The cat lives in Catena.')) # {('Catena', 'IT', 43.8, 11.0): 1}
Get words & sentences with tok(str)
(tokenize) at ~125K words/sec:
print(tok("Mr. etc. aren't sentence breaks! ;) This is:.", language='en'))
Get word polarity with pov(str)
(point-of-view). Is it a positive or negative opinion?
print(pov(tok('Nice!', language='en'))) # +0.6
print(pov(tok('Dumb.', language='en'))) # -0.4
- For de, en, es, fr, nl, with ~75% accuracy.
- You'll need the language models in grasp/lm.
Tag word types with tag(str)
in 10+ languages using robust ML models from UD:
for word, pos in tag(tok('The cat sat on the mat.'), language='en'):
print(word, pos)
- Parts-of-speech include
NOUN
,VERB
,ADJ
,ADV
,DET
,PRON
,PREP
, ... - For ar, da, de, en, es, fr, it, nl, no, pl, pt, ru, sv, tr, with ~95% accuracy.
- You'll need the language models in grasp/lm.
Tag keywords with trie
, a compiled dict that scans ~250K words/sec:
t = trie({'cat*': 1, 'mat' : 2})
for i, j, k, v in t.search('Cats love catnip.', etc='*'):
print(i, j, k, v)
Get answers with gpt()
. You'll need an OpenAI API key.
print(gpt("Why do cats sit on mats? (you're a psychologist)", key='...'))
Machine Learning (ML) algorithms learn by example. If you show them 10K spam and 10K real emails (i.e., train a model), they can predict whether other emails are also spam or not.
Each training example is a {feature: weight}
dict with a label. For text, the features could be words, the weights could be word count, and the label might be real or spam.
Quantify text with vec(str)
(vectorize) into a {feature: weight}
dict:
v1 = vec('I love cats! π', features=('c3', 'w1'))
v2 = vec('I hate cats! π‘', features=('c3', 'w1'))
c1
,c2
,c3
count consecutive characters. Forc2
, cats β 1x ca, 1x at, 1x ts.w1
,w2
,w3
count consecutive words.
Train models with fit(examples)
, save as JSON, predict labels:
m = fit([(v1, '+'), (v2, '-')], model=Perceptron) # DecisionTree, KNN, ...
m.save('opinion.json')
m = fit(open('opinion.json'))
print(m.predict(vec('She hates dogs.')) # {'+': 0.4: , '-': 0.6}
Once trained, Model.predict(vector)
returns a dict with label probabilities (0.0β1.0).
Map networks with Graph
, a {node1: {node2: weight}}
dict subclass:
g = Graph(directed=True)
g.add('a', 'b') # a β b
g.add('b', 'c') # b β c
g.add('b', 'd') # b β d
g.add('c', 'd') # c β d
print(g.sp('a', 'd')) # shortest path: a β b β d
print(top(pagerank(g))) # strongest node: d, 0.8
See networks with viz(graph)
:
with open('g.html', 'w') as f:
f.write(viz(g, src='graph.js'))
You'll need to set src
to the grasp/graph.js lib.
Easy date handling with date(v)
, where v
is an int, a str, or another date:
print(date('Mon Jan 31 10:00:00 +0000 2000', format='%Y-%m-%d'))
Easy path handling with cd(...)
, which always points to the script's folder:
print(cd('kb', 'en-loc.csv')
Easy CSV handling with csv([path])
, a list of lists of values:
for code, country, _, _, _, _, _ in csv(cd('kb', 'en-loc.csv')):
print(code, country)
data = csv()
data.append(('cat', 'Kitty'))
data.append(('cat', 'Simba'))
data.save(cd('cats.csv'))
A challenge in AI is bias introduced by human trainers. Remember the Model
trained earlier? Grasp has tools to explain how & why it makes decisions:
print(explain(vec('She hates dogs.'), m)) # why so negative?
In the returned dict, the model's explanation is: βyou wrote hat + ate (hate)β.