Project At A Glance

Objective: Off the top of my head, the word 'go' can be represented as a verb, a noun, or a general part of a string. The goal here is to build an intelligent solution that analyzes textual relationships and locates all instances of 'golang' from programming queries.

Data: StackSample: 10% of Stack Overflow Q&A. [Download]

Implementation: spaCy's en_core_web Model, Part of Speech, Sentence Dependencies, Rule-Based Matching, Tagging

Results:

The model performs best a small (_sm) model with the following conditions:

[i] Question Tag== 'go'

[ii] Part of Speech!= 'verb'.
In the above case, the model is able to collect all instances of golang.
Therefore, Recall= (1.00) while Precision = Accuracy = (0.891)
The food for thought enabled by this project led to LangWhich: a much more concise, CLI implementation that works for all programming languages. [View LangWhich]

Deployment: View this project on GitHub.

Initialization

import pandas as pd

df = (pd.read_csv('Questions.csv', nrows=1_000_000, usecols=['Title', 'Id'], encoding='ISO-8859-1'))
titles = [_ for _ in df['Title']]

Dataset Features

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   Id      1000000 non-null  int64 
 1   Title   1000000 non-null  object
dtypes: int64(1), object(1)
memory usage: 15.3+ MB

import random
random.choices(titles, k=20)

['Call a click on a newly added dom element',
 'some confusion about kendo grid databind',
 'display Jquery GIF onload Jquery Tabs',
 'Port Chrome Extension to Firefox, Safari, IE',
 'Uploading A SQL Server Script?',
 'adding a tab character in swift',
 '^M in the diff using svn',
 'Can I open an HTML file in Ace Editor?',
 'Cast value type to generic',
 'What do triple curly braces indicate?',
 "Can't get javascript to run onsubmit",
 'Wordpress - Dynamic "static" permalink based on permalink taxonomy',
 'Can depth peeling be implemented without any shader?',
 'count(*) Does not return 0 when using group by in MySQL',
 'How to call action on onclick-javascript in ruby on rails',
 'How to translate Xcode 3 properties to Xcode 4',
 'Java-JFXpanel refresh page cause JVMã\x80\x80crash -- Threading issue',
 "Can't seem to figure out why the InnerHTML statements are not working",
 'How I can validate only some validation groups based on some fields in the form itself in Symfony2',
 'Drawable vs Single reusable Bitmap better with memory?']

Primitive Function for 'go'

def has_golang(text):
    return 'go' in text # basic string-matching ~ unsatisfactory output

g = (title for title in titles if has_golang(title))
[next(g) for i in range(2)]

['My website got hacked... What should I do?',
 "DVCS Choices - What's good for Windows?"]

spaCy Injection

import spacy

#!python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

[t for t in nlp('Go is a both a verb and a programming language.')]

[Go, is, a, both, a, verb, and, a, programming, language, .]

doc = nlp('Go is a both a verb and a programming language.')
t =doc[0]
type(t)

spacy.tokens.token.Token

Text Relationships and Dependencies

Render

from spacy import displacy
displacy.render(doc) # token-relationships

Explanation

spacy.explain('det')

'determiner'

for t in doc:
    print(t, t.pos_, t.dep_) # pos = part of speech, dep = dependency

Go VERB csubj
is AUX ROOT
a DET det
both CCONJ preconj
a DET det
verb NOUN attr
and CCONJ cc
a DET det
programming NOUN amod
language NOUN conj
. PUNCT punct

Data Manipulation

Contains 'go'

df = (pd.read_csv('Questions.csv', nrows=2_000_000, usecols=['Title', 'Id'], encoding='ISO-8859-1'))

titles = [_ for _ in df.loc[lambda d: d['Title'].str.lower().str.contains('go')]['Title']]

random.choices(titles, k=7)

['Trying to understand/get working a k-means clustering algorithem in MySQL',
 'Google Cloud:- Not able to access Visual SVN via https',
 'How can I use a "For" loop to map multiple polygons with the leaflet within shiny in R?',
 'Remove marker from Google Maps API V3',
 'Config Mongodb in Cakephp 2.8.5',
 'Can a single Meteor instance listen and react to multiple MongoDB databases?',
 'Going from Scrum to Kanban near "release"']

nlp = spacy.load('en_core_web_sm', disable=['ner'])

POS== Noun

%%time

def has_golang(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            if t.pos_ == 'NOUN':
                return True 
    return False
# Collecting data that has 'go'/'golang' where pos = Noun

g = (doc for doc in nlp.pipe(titles) if has_golang(doc)) # nlp.pipe() added to optimize; takes doc as input instead of tokens
[next(g) for i in range(15)]

Wall time: 7.03 s

[Deploying multiple Java web apps to Glassfish in one go,
 Removing all event handlers in one go,
 Paypal integration to serve multiple sellers in one go for a shopping site,
 How do I disable multiple listboxes in one go using jQuery?,
 multi package makefile example for go,
 Google's 'go' and scope/functions,
 Where is App.config go after publishing?,
 SOAPUI & Groovy Scripts, executing multiple SQL statements in one go,
 What's the simplest way to edit conflicted files in one go when using git and an editor like Vim or textmate?,
 Import large chunk of data into Google App Engine Data Store at one go,
 Saving all nested form objects in one go,
 what's the state of go language IDE support?,
 Decrypt many PDFs in one go using pdftk,
 How do I allocate memory for an array in the go programming language?,
 Is message passing via channels in go guaranteed to be non-blocking?]

Has Tag== 'go'

df_tags = pd.read_csv('Tags.csv')
go_ids = df_tags.loc[lambda d: d['Tag'] == 'go']['Id']
# Collecting data from the Tags dataset with ID = 'go'

POS!= Verb

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            if t.pos_ != 'VERB':
                return True
    return False
# Collecting data with 'go'/'golang' where pos =! verb

Splitting and Benchmarking

all_go_sentences = df.loc[lambda d: d['Id'].isin(go_ids)]['Title'].tolist()
detectable = [d.text for d in nlp.pipe(all_go_sentences) if has_go_token(d)]

non_detectable = (df
                  .loc[lambda d: ~d['Id'].isin(go_ids)]
                  .loc[lambda d: d['Title'].str.lower().str.contains('go')]
                  ['Title']
                  .tolist())

non_detectable = [d.text for d in nlp.pipe(non_detectable) if has_go_token(d)]

len(all_go_sentences),len(detectable), len(non_detectable)
# all_go_sentences = has go in title
# detectable = golang confirmed by both title and tag
# non_detectable = has go in title but not a tag
# all_go_sentences - detectble = has a go tag but doesn't contain go in the title

#optimal model has been found to be en_core_web_sm on logging metrics for precision, accuracy and recall for the manipulated data.

(1858, 941, 115)

Using our model on the detectable class

model_name = 'en_core_web_sm'
model = spacy.load(model_name, disable=['ner'])

method = 'not-verb-but-pobj'

correct = sum(has_go_token(doc) for doc in model.pipe(detectable))
wrong = sum(has_go_token(doc) for doc in model.pipe(non_detectable))
precision = correct/(correct + wrong)
recall = correct/len(detectable)
accuracy = (correct + len(non_detectable) - wrong)/(len(detectable) + len(non_detectable))

f"{precision},{recall},{accuracy},{model_name},{method}" # custom-log

'0.8910984848484849,1.0,0.8910984848484849,en_core_web_sm,not-verb-but-pobj'

Metrics

print(precision), print(recall), print(accuracy)

0.8910984848484849
1.0
0.8910984848484849

(None, None, None)