Golang Detection from Stack Overflow Questions
Pipelining an NLP solution that provides semantic value for English homonyms.
- Project At A Glance
- Initialization
- Dataset Features
- Primitive Function for 'go'
- spaCy Injection
- Text Relationships and Dependencies
- Data Manipulation
- Splitting and Benchmarking
- Using our model on the detectable class
- Metrics
Project At A Glance
Objective
: Off the top of my head, the word 'go' can be represented as a verb, a noun, or a general part of a string. The goal here is to build an intelligent solution that analyzes textual relationships and locates all instances of 'golang' from programming queries.
Data
: StackSample: 10% of Stack Overflow Q&A. [Download]
Implementation
: spaCy's en_core_web Model, Part of Speech, Sentence Dependencies, Rule-Based Matching, Tagging
Results
:
-
The model performs best a small (_sm) model with the following conditions:
[i] Question Tag== 'go'
[ii] Part of Speech!= 'verb'.
-
In the above case, the model is able to collect all instances of golang.
- Therefore, Recall= (1.00) while Precision = Accuracy = (0.891)
- The food for thought enabled by this project led to LangWhich: a much more concise, CLI implementation that works for all programming languages. [View LangWhich]
Deployment
: View this project on GitHub.
import pandas as pd
df = (pd.read_csv('Questions.csv', nrows=1_000_000, usecols=['Title', 'Id'], encoding='ISO-8859-1'))
titles = [_ for _ in df['Title']]
df.info()
import random
random.choices(titles, k=20)
def has_golang(text):
return 'go' in text # basic string-matching ~ unsatisfactory output
g = (title for title in titles if has_golang(title))
[next(g) for i in range(2)]
import spacy
#!python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')
[t for t in nlp('Go is a both a verb and a programming language.')]
doc = nlp('Go is a both a verb and a programming language.')
t =doc[0]
type(t)
from spacy import displacy
displacy.render(doc) # token-relationships
spacy.explain('det')
for t in doc:
print(t, t.pos_, t.dep_) # pos = part of speech, dep = dependency
df = (pd.read_csv('Questions.csv', nrows=2_000_000, usecols=['Title', 'Id'], encoding='ISO-8859-1'))
titles = [_ for _ in df.loc[lambda d: d['Title'].str.lower().str.contains('go')]['Title']]
random.choices(titles, k=7)
nlp = spacy.load('en_core_web_sm', disable=['ner'])
%%time
def has_golang(doc):
for t in doc:
if t.lower_ in ['go', 'golang']:
if t.pos_ == 'NOUN':
return True
return False
# Collecting data that has 'go'/'golang' where pos = Noun
g = (doc for doc in nlp.pipe(titles) if has_golang(doc)) # nlp.pipe() added to optimize; takes doc as input instead of tokens
[next(g) for i in range(15)]
df_tags = pd.read_csv('Tags.csv')
go_ids = df_tags.loc[lambda d: d['Tag'] == 'go']['Id']
# Collecting data from the Tags dataset with ID = 'go'
def has_go_token(doc):
for t in doc:
if t.lower_ in ['go', 'golang']:
if t.pos_ != 'VERB':
return True
return False
# Collecting data with 'go'/'golang' where pos =! verb
all_go_sentences = df.loc[lambda d: d['Id'].isin(go_ids)]['Title'].tolist()
detectable = [d.text for d in nlp.pipe(all_go_sentences) if has_go_token(d)]
non_detectable = (df
.loc[lambda d: ~d['Id'].isin(go_ids)]
.loc[lambda d: d['Title'].str.lower().str.contains('go')]
['Title']
.tolist())
non_detectable = [d.text for d in nlp.pipe(non_detectable) if has_go_token(d)]
len(all_go_sentences),len(detectable), len(non_detectable)
# all_go_sentences = has go in title
# detectable = golang confirmed by both title and tag
# non_detectable = has go in title but not a tag
# all_go_sentences - detectble = has a go tag but doesn't contain go in the title
#optimal model has been found to be en_core_web_sm on logging metrics for precision, accuracy and recall for the manipulated data.
model_name = 'en_core_web_sm'
model = spacy.load(model_name, disable=['ner'])
method = 'not-verb-but-pobj'
correct = sum(has_go_token(doc) for doc in model.pipe(detectable))
wrong = sum(has_go_token(doc) for doc in model.pipe(non_detectable))
precision = correct/(correct + wrong)
recall = correct/len(detectable)
accuracy = (correct + len(non_detectable) - wrong)/(len(detectable) + len(non_detectable))
f"{precision},{recall},{accuracy},{model_name},{method}" # custom-log
print(precision), print(recall), print(accuracy)