22 April 2018
People understand text from writings and conversations, but this can be difficult for computers.
This explains how the technology of Natural Language Processing (NLP) works by building upon what you already know.
It assumes only that you are able to read an English language document.
This gets progressively deeper into useful methods (and ultimately deploying into production, for those so inclined). Go as deep as you like. Annotated references at the end are intended as a jumping-off point for learning more, doing more.
Following the overview: 1) General Concepts, 2) NLP in Theory, 3) NLP in Practice and 4) Annotated References.
TL;DR: Perusing just the Annotated References will help get you from zero to full deployment in production, yet reading the whole document will help you get the most out of that short list of NLP resources.
Natural Language Processing (NLP) lets computers facilitate communications among people by examining text such as what gets extracted from documents, web pages, blog posts and social media.
Rudimentary techniques allow the machine to mimic a primary school student performing sentence diagramming that identifies Parts of Speech (PoS) and sentence structure.
Concepts are introduced along the way, and jargon is identified by use of italics as in the preceding sentence.
No single document or book could possibly cover everything, so please understand that this is merely scratching the surface. The intent here is that you get through to the other side and be conversant in these concepts, perhaps for hiring the right person or going deeper yourself.
The field is vast and is the sole pursuit of Masters or PhD programs within university computer science departments. However, there is much that you can accomplish with just a little guidance, too!
This document ultimately bridges the huge gap between open jobs and potential talent. Businesses that lack the financial resources of major tech companies must resort to educating themselves along the way.
Hopefully, this helps you avoid getting lost in the wilderness.
Potential uses of Natural Language Processing include:
There are many more use cases, far too numerous to list here.
This is to get you acquainted with key concepts and specific use of words– possibly alternate definitions than you might already know.
A typical adult having only lived in their native country has likely forgotten the means by which they learned language.
Immigrants and travelers are slightly more aware of this process, as are parents of a child with learning disabilities.
Common expectations are that the learning process is natural and progressively building forward.
However, computers lack the cognitive machinery of the human brain, not yet having an equivalent of evolving over billions of years.
Software developers must start with first principles, and getting there involves a discovery process.
The field of this discovery is Computational Linguistics, and they have built models for understanding those principles since before modern computing devices.
These models involve many facets. Terminology used will include algorithms, heuristics, statistics plus more familiar concepts such as synonyms and word similarity.
Those terms and more will be introduced gradually and gently.
People understand synonyms in conversation.
This is like that.
While the statement above is an analogy due to use of the word like, the concept conveyed here expresses a synonym relationship.
A more specific example:
A chair is a seat.
While this statement could be understood as simple metaphor, it conveys a synonym relationship between chair and seat.
We understand this relationship because a person can sit on either, whichever one their cat isn’t already occupying. (Bear with me on the cat thing…)
A chair is like a table.
This one is a different type of synonym relationship.
Both a physical chair and table are of a similar structure, as each has legs and some kind of platform. For the chair, the platform is the horizontal part where your body is in direct contact while seated. For the table, the platform is also a horizontal component and is where a cat is likely walking across because she’s supervising whatever it is you’re considering to do next.
A car is like a chair.
Yet a different kind of synonym, a car and a chair have a particular type of relationship. A car contains a seat, and as we’ve already established earlier, chair is synonymous with seat.
This time, it’s a dog in the passenger seat because he loves being your copilot. (Again, there's a point coming with regard to these animal references…)
Each of the types of synonyms given above are described in [Witzig03] and the basic identification of each is as follows. Technical terms below indicate how comprehensive the field of Linguistics can be– which is separate but related to Computational Linguistics.
There are three types of hypernym plus three more for the reflexive (or inverse) relationship, meronym.
There are nearly 20 relationship classifications used by [WordNet].
Strictly speaking, however, WordNet offers synsets which are synonym relationships.
Of those relationships in the field of Linguistics and in WordNet’s implementation, some are hierarchical and specify direction.
Others are lateral.
More on the cat and dog in the next section.
Seemingly irrelevant references to a cat and dog in the preceding section may have confused (or annoyed) you mildly, but they served to illustrate a point.
Much of what gives human dialog and prose little flourishes here and there– some might even say making language beautiful– becomes challenging for Natural Language Processing.
Among these challenges are synonyms.
One tool for computers identifying synonyms already mentioned is WordNet. It contains on the order of 100,000 words and just under 500,000 synonym relationships (synsets) of different types.
While that may seem like a lot and may be sufficient for some uses, a robust system would need on the order of millions for coping with what might be found on the Internet today for a single natural language.
For computers to have better accuracy, it’s a numbers game. The greater the number, the better everything works here.
Perspective from Director of Research at Google since their early days (circa 2001) of doing primarily Internet search:
More data beats better algorithms.
See [Halevy09] for full context of Peter Norvig’s statement.
This idea indicates that with enough volume of data, simple correlations can produce rich results to be meaningful for the person making a query.
However, that point of view was also from an earlier era, a decade ago.
Today, people are increasingly dissatisfied with Internet search results.
Results from the biggest Search engines seem to pale in comparison to perceived quality from earlier years when there was obviously less on the Internet.
We have crossed a threshold between the straight-forward approach of yesteryear and people’s expectations of meaningful results compounded by other parties attempting to game the system.
(Without going off-topic, a huge factor for the perceived loss of quality in search results is due to the moving target that is search engine optimization and the arms-race between Internet search engines versus those looking to make a quick buck by having you land on their page first and thus see their ads.)
Perhaps more importantly, people’s expectations have increased as we are now in the third decade of the World Wide Web.
Ultimately, the problem of synonyms for machines stems from lack of context.
When you enter a search query, there is little or no context beyond the words that you have entered.
Some context may be derived from the few previous search queries you just made in the last ten seconds (due to poor quality results of those searches as measured by clicking back quickly). However, that is one possible heuristic that the machine can use and is not always reliable.
For instance, relying upon your previous search query and the fact that you declined clicking-through to any of the given results cannot be strictly assumed that those were bad results. Perhaps you attained your goal for that search by simply reading an excerpt of a web page within the results.
People do that all the time to quickly confirm spelling of, say, dessert
versus desert (one
S or two?) when their device is handy but a
Back to synonyms.
Searching for dessert setting could mean setting or dressing a dinner table for dessert or possibly part of a procedure for assembling the structural foundation of a fancy cake.
Or was it a typo or stuck key while typing a query for the barren land referenced in a piece of classic literature?
Without context, it’s difficult knowing what kind of synonyms to apply.
When speaking to another person, we generally use whole sentences.
Should the other person not understand, we have additional cues such as their confused facial expression.
Failing that, the other can always ask for clarity.
Commodity machines, however, have been without that additional exchange until the relatively recent introduction of intelligent personal assistants.
Instead, computer-based algorithms are matching patterns, and the humans who programmed the computer apply heuristics to decide which algorithms to use and when.
This combination of algorithms and heuristics will be address in-depth later but may be understood as how you operate a car or bicycle versus why you might decide to use one or the other of those two vehicles.
Closely related to synonyms is the notion of word similarity.
Consider the word, cancer.
It could be the Latin term for the sideways walking sea creature that some consider a seafood delicacy: “crab” in English.
In astronomy, it’s a constellation of stars. To astrologers, this is one of the twelve Signs. In health, it’s an unwanted effect of abnormal cellular growth, where it may be said to have gone “sideways” like the walk of the sea creature and is from where the medical term originated.
Used as analogy or metaphor, there are generally negative connotations (which also gets into semantic analysis, and that’s coming up soon).
When other words surrounding this one are examined, we can use that context to determine additional words that might be similar.
For instance, without understanding the medical meaning of “cancer” we can deduce that “melanoma” is somehow related.
This deduction process is very important.
Children learn this way.
Adults continue to learn this way.
Machines do too as described in the next section.
One packaged model used for identifying word similarity is [Gensim].
The process for learning new relationships begins with a relatively small vocabulary. It then expands the vocabulary based upon identifying context of words we know and matching it to words that have never before been encountered.
This technique is called modeling.
There are various levels to it, but consider the human-based approach.
If the exact same words that surround a familiar word are also found around a new word, we can safely conclude that these are related.
Applying mathematics of a model, the degree of accuracy for this conclusion is a statistical weight.
Weighting is generally expressed as a number between 0.0 and 1.0, and zero represents completely unrelated while 1.0 would be used for 100% exact match.
It’s rare to encounter either 0 or 100%.
Rather than recording a 0.0 value, conventional practice would simply omit the entry from ever being recorded.
Likewise, only if you have literally encountered the same word would you deem it to be a 100% match. Even then, we would want to play it safe and account for the possibility of a dual meaning. Again, in practice you are extremely unlikely to see 1.0 (100%).
One important detail for these weights is that it involves a particular ratio. Comparing the number of occurrences for a specific term versus grand total of all terms combined is the most basic measurement for our purposes. In practice, there will be other factors beyond scope here.
Such weights may be considered statistically significant when above a certain threshold. (See proposed new definitions in [Benjamin17] for significance versus merely suggestive that may also apply to your work.)
While discussing weights, another type of value is relevant.
Synonym hierarchies and word similarities mentioned in prior sections are associated with distance.
An example using a cat is: tabby -> feline -> mammal.
The distance between tabby and mammal is 3, but between tabby and feline is only 2. (In practice, however, calculations are more involved.)
When comparing distance between two synonyms, it may be advantageous to prune results and only include those within a range or of the least distance.
Along the lines of synonyms and word similarity– yet distinctly different– is the concept of semantic relatedness.
For a given word or phrase, find others that have been used in a similar context.
Demonstrations might be better here than descriptions, especially since a tool is publicly available.
Note that the demo dataset used a single year of Reddit as their source material, so beware and/or have fun!
Searching for “natural language processing” gives the follow as its top results:
Trying “silicon valley” yields:
Clicking on “Vancouver” from full results of the previous query, gives:
The percentage value following each term within all of those lists indicates a relative weight.
These weights may be used for indicating significance such that you can prune results by only examining matched terms above a useful threshold. (It’s similar to what was mentioned in the section on Modeling, Weights & Distance, above.)
Use of analogy is problematic for computer systems, but at least there’s a clue through use of specific words. The words like and as indicate a proper simile.
Metaphor is problematic due to liberty of expressiveness granted through poetic license.
Such license permits a parent to discuss delicate matters to another adult without disturbing a young child by simply substituting another term or perhaps not naming the subject directly.
For those concerned with certain unnamed organizations collecting their every word, metaphor makes for effective means of discretion between two parties.
Read that last paragraph again.
Note that it omits naming its subject directly.
It also is free from naming any specific person or organization, yet unless you’ve been living under a rock since one year into the 21st Century– or roughly since September of that year– you know exactly to whom it refers.
There it is again.
Without giving a specific date, it cites a major event in modern history that had become a turning-point for so much of the Western world.
While reading this, you are 99% certain of its meaning.
However, it raises the level of complexity beyond the threshold of what is practical or even plausible to pursue for computer-based algorithms for the foreseeable future.
The general category abstracting both analogy and metaphor is a trope.
As example of a trope, the title for this section references a famous line from one of the first color movies, The Wizard of Oz (1939). The original line was, “Lions and tigers and bears! Oh, my!”
While some people may have missed that classic reference, others may have been inclined to read too much into its use here. (It was selected due to broad, world-wide reach of the film and accessible to a non-native English-reading audience.) Both of those points highlight potential problems with tropes for people– let along machine algorithms.
While very smart people are exploring ways to handle various kinds of tropes, they aren’t quite there for practical purposes yet.
Jokes are another source of trouble for machines.
What’s the difference between theory and practice?
In theory, it’s all the same.
–Unknown (one of many variations)
A related field addressing all these is Natural Language Understanding (NLU), beyond scope here other than brief points below.
The reason that people can understand the preceding section is due to semantics.
As demonstrated by reading this far into the document, you have sufficient grasp of the English language to handle its grammar and extract meaning from the words used here.
In linguistics, grammar largely falls under syntax, and meanings of words falls under semantics.
The title of a classic book on grammar is Eats, Shoots & Leaves [Truss03], with wonderful cover art of one panda bear on a ladder painting-out the comma and another panda strolling by, holding a pistol.
Their cover gives all the context you need to appreciate the significance of that comma.
Thus, syntax is important.
We’ve already established the significance of semantics in the previous section by using just the right combination of words to invoke a precise meaning. That’s the essence of semantics.
For more on both, see another classic rooted in ancient Greek educational traditions, The Trivium [Joseph37].
While parsing a sentence, grammar makes it more efficient to read.
Early use of punctuation gave an indication of when a speaker should take a short breath (comma) or long one (period, full-stop). There’s far more to grammar than just those two symbols that assist with conveying meaning, such as quotes.
Algorithms involved with parsing refer to tokens as the combined collection of all symbols of punctuation plus all words found while traversing a sentence or phrase.
So then, a token may be a word or an element of punctuation.
That distinction becomes important when learning, using or implementing the different algorithms, below.
This is different than semantics but related.
Sentiment is often bucketed using broad labels of positive, negative and neutral with different degrees or weights of each.
A single text or statement might contained mixed sentiment:
I really enjoyed eating at this restaurant but am not sure it was worth the wait.
This type of analysis was explored in [NetflixPrize06] plus the body of research and exploration that followed the competition.
There are trade shows and conferences; see References at end of this document for more information on the topic.
While Part 1 covered key concepts and introduced a few terms and definitions, Part 2 goes deeper.
This is for those who have a specific NLP task to solve such as sentiment analysis or enriching search results through the use of synonyms.
The following few sections will help as you begin to evaluate different NLP tools, libraries, frameworks and whole systems.
Basic parts of speech (PoS) include nouns, verbs, adjectives, adverts, etc.
This is useful information for applying word similarity or synonyms when additional context of surrounding words may be ambiguous, especially due to some writing styles.
For instance, is “MARK” a verb, noun or proper noun of a person’s name?
When parsing tokens from text, we can get PoS tags associated with each word and with each bit of punctuation.
Some tools will identify each punctuation symbol with its own identity, while others may group everything simply as the canonical, “PUNCTUATION”, with perhaps only sentence-ending as distinct “FULL-STOP”.
Associating each token with its PoS tag may be considered metadata.
There is a bit of healthy debate among practitioners regarding what a regular form for words or sentences might be.
Some NLP practitioners believe that using the root word or stem being free of plurals, free of possessive forms, etc., is most appropriate for this.
In this camp, the practical approach to this is called stemming.
For example, reading would become read, and running would be run.
Potential problems arise when verb tense conflicts.
Because of this, a competing camp prefers to use the lemma form.
The lemma of to be is be.
The lemma of are is also be.
Using the earlier example, both running and ran would be run.
This is useful for comparing verb phrases regardless of tense used in text or query.
The action of converting a word to its lemma form is called lemmatization.
You will find some NLP researchers and NLP software developers have very strong opinions on this subject.
Concepts of algorithms and heuristics were introduced in Part 1 using metaphor involving a car and bicycle.
Going a bit deeper, an algorithm may be understood as a conceptual machine.
Like a traffic light at an intersection, it follows a pattern. Some intersections are simple and work with basic timing to alternate the flow of traffic.
Others are very complex, involving triggers for less used turning lanes and buttons for pedestrian crossing signals.
Deciding which type of signaling system to deploy for each particular intersection would be an example of heuristic.
For the next couple of sections, each touches upon concepts for which there are many algorithms and heuristics.
A parser is a type of algorithm.
For purposes here, imagine a primary school student diagramming a sentence. The rules exercised by that student represent an algorithm for a parser.
There are many varieties of parsers. (See References.)
With one noteworthy exception, parsers are considered to have terrible performance, and common advice has been to avoid them in production.
That exception, however, is [spaCy] for high performance. (This is due to being written in Cython which tastes very much like Python but compiles to object code like C with all of the speed advantages.)
As an alternative to parsing, machine learning systems have provided statistical models that are very fast for robust production use. (These get their own section, below.)
In NLP, sentence boundary detection (SBD) is the formal name for finding sentence endings.
This is a notoriously difficult problem in computer science for languages such as English.
Pattern-matching such as using Regular Expressions fails in a big way here, unless you are intimately familiar with the corpus and are unlikely to ever use the same pattern on any other.
Despite intricate syntax of a Regular Expression pattern, it’s insufficient for this particular problem because you forfeit granular control over state within the pattern matcher.
Naïvely, one might consider identifying periods (full-stop) followed by two or more spaces. However, formal use of two space characters following period after sentence ending has been steadily declining as more people post to social media and other very informal uses of writing.
So that won’t work.
You would need to track significant amounts of state information that becomes cumbersome for these patterns.
Consider a sentence ending with the pronoun for self, “I”. How might you distinguish such a sentence ending with that word from its use as an initial? Then anticipate the first word of the next sentence possibly being “Mark” or “Pat” which could be a verb, regular noun or proper noun. There are too many scenarios to mention here.
Therefore, that category is problematic as well.
Likewise for use of “etc.” for et cetera, which may or may not end a sentence.
That segues to the problem of abbreviations at large.
While you may think that an exhaustive list will do, how deep do you go? Are you aware of multiple abbreviations for avenue as “ave.” and as suggested by some local post offices, “av.”? Have you considered abbreviations for military rank? Other titles of respect such as senator, representative, reverend? One title for a woman, “Ms.”, is less used today but may still be encountered. And so on.
What about use of ellipses (three dots) in the middle of a sentence? These may be spaced, maybe not when encountered in actual use.
How about a quoted statement with multiple sentences? This also gets into stylistic territory.
Potential issues go on and on.
Rather than produce an exhaustive list here, “Good luck with that!” is all that will be said on the matter.
If attempting to parse sentence endings yourself, see published examples of challenging sentences such as from Grammarly.
Instead, consider using [spaCy] which discovers end of sentences while performing other operations such as Parts of Speech (PoS) tagging.
A canonical form simply means the “proper” format based upon what might be appropriate for the task at hand.
When a software pipeline is involved, consider making all conversions at once in the beginning.
(This word is more commonly encountered in terms of a canon such as The Bible being the principal text for Christians or a “story bible” for staff script-writers of an on-going television or movie series like the Star Trek franchise.)
Across various documents, there may be multiple symbols indicating similar punctuation.
For instance, there are many marks within Unicode that could indicate quotation, the least of which would be the double-quote (0x22) for both opening and closing quotes.
Some Parts of Speech tagging systems while producing the canonical form of a given text, may split words in seemingly strange ways.
For instance, you may find the contraction, can’t, becomes two tokens: “CAN”, “N’T”.
That representation might be the canonical form for one NLP system.
The term, canonical form, is not widely used in NLP literature but is important for Part 3.
When constructing or populating a model, the “sample” of documents used is call a corpus (or when plural, corpora).
It’s intended to be a representative subset of the documents that you are expecting to see in the wild.
There are well-known corpora such as the “Brown corpus” or “Penn Treebank / Wall Street Journal” corpus. Various examples are available with [NLTK].
Best ones available usually require a fee on the order of USD $10k or $100k, and terms of the contract compel keeping it private and protected. Much effort and expense goes into preparing these, so such licensing terms help cover their costs.
For instance, Penn Treebank is considered a gold-standard using segmentation and tokenization of the US newspaper, The Wall Street Journal.
That is, existing articles were later annotated manually with parts of speech (PoS) tags.
That exercise permitted models to be trained.
Not all collections of such text are considered high quality or meet the level of gold standard.
Also, be mindful that just because a particular corpus is labeled “gold standard”, it won’t necessarily be suitable for all purposes.
If you are dealing with, say, mechanical or civil engineering documents as your primary focus, the models that were trained using a mainstream newspaper would be ill-suited for your particular needs.
While there are some niche or industry-specific corpora, the next direction to consider is unsupervised learning.
In lieu of a gold standard corpus, an unsupervised learning approach is likely to use a much bigger volume of data during the training phase.
Its corpus might be 100x or 1000x in size compared to using a gold standard.
This direction comes with its own challenges and should be considered an advanced NLP topic.
For instance, if considering a Wikipedia archival dump due to its massive volume of text for merely your “cost of storage and bandwidth” (plus time and effort!), beware that it is considered “dirty data” due to intermingling of administrative pages along with its primary encyclopedia pages. That one issue would be the least of your adventure there but beyond scope here.
Along the lines of “use the right tool for the right job”, consider:
Use the right corpus to train the right model for the right NLP job.
Using an appropriate corpus (see above) helps with picking the right model for your application.
There are different types, so here’s a quick overview of the landscape:
Whenever you encounter “predicting” with respect to certain topics such as NLP, you can be sure that a statistical model is close at hand.
The basic idea of a statistical model uses numeric weighting as the basis of making decisions.
Those decisions are often considered optimizations such as finding the maximum or minimum value, isolating a target range of values, and so on.
There are potential issues to accommodate such as avoiding local maximums and local minimums, but those are beyond scope here.
Calculations within a model can range from simple arithmetic to very long polynomial equations with dozens of terms.
Often, however, models are populated through the use of other techniques such as machine learning. (See next section.)
Models are commonly used for Parts of Speech tagging, parsing words/tokens and detecting sentence boundaries.
For the intended audience of this document, it may be sufficient to treat models as a black box. Use what has been provided by the various tools, at least as your starting point.
These can come in different sizes, so be aware of this before downloading or automating within your build system. When some of the authors indicate that their models are large, they likely mean in the range of one to two gigabytes (1-2 GiB) per model.
Buzzwords flying around mainstream news media include all of these phrases:
For purposes here, these may be treated as synonymous but each builds upon the next.
Taking that list in reverse order, a neural network (NN) is essentially the foundation. (Formally, it’s called an “artificial neural network” or ANN. Different academic literature uses either the formal or inform name.)
It’s a type of statistical model (see above) that gets populated using a structure inspired by cellular structures within the brain of a mammal. This mimicking of neurons in biology is from where the name is derived.
While that was the original inspiration, practical use expanded when liberated beyond design of a synthetic synapse.
There are a few varieties of how neural nets “learn” and a few variants including whether additional storage beyond internal weights are involved or not. One of the oldest and perhaps most widely studied by university undergraduates since late 1970’s is the backpropagation algorithm.
Other types widely discussed in the literature include recurrent neural networks (RNN).
Google’s TensorFlow system contains a particular variant of RNN within [SyntaxNet].
SyntaxNet– due to being a machine learning approach– facilitates a step beyond NLP called Natural Language Understanding (NLU).
NLU offers additional solutions such as resolving pronouns despite complicated context, so the statement:
Mom talked to her sister yesterday.
She said that she’s fine.
A successful NLU approach would identify who is “fine” or at least give a probabilistic weight for each person, “mom” versus “her sister”.
As with so much of this material, there is vastly more to it. This is just getting you acquainted with the terminology and hallmarks of the ideas behind it.
You’ve become acquainted with key concepts and NLP jargon from Part 1.
You’ve grasped a bit of theory in Part 2 that may be taken for granted in white-papers, blogs and websites of NLP practitioners.
Now, you’re ready to begin planning a software development project, building upon those earlier parts in concrete ways.
Practical guidance is given for software development generalists and dev-ops staff using NLP for a specific task.
This is the part that goes beyond “assuming only that you read English”, far beyond.
Essentially, this walks through various stages of a production pipeline that uses NLP effectively.
The most basic idea of how software works:
Something goes in, something comes out, and something is done in the middle.
That is a software pipeline, one which is linear and comprises only three stages.
(Side note: it’s also a directed acyclic graph in terms of Graph Theory, which lends relevant ideas for NLP practitioners for dev-ops and PhD candidates alike.)
Be mindful of your pipeline while designing your system and its workflow. This will pay dividends later.
Questions to ask about the nature of your work-flow:
While a particular library or tool may be ideal to use today, there could be a more appropriate one tomorrow, and this is a very real prospect. Anticipate these substitutions, because NLP is a fast moving field despite relatively crawling until just a few years ago.
Maybe a Unix shell script is sufficient for the pipeline:
That offers very flexible pipeline in about 100 lines of Bash script, or go as fancy and full-featured as you’d like. Erlang is an excellent choice of programming language here because this piece is essentially an exercise in Graph Theory principles.
Having built many types of pipelines, some pragmatic guidance:
Let data flow through the pipeline along with each request.
Then, each component only has to communicate with the pipeline (rather than each component having to know about a majority of other subsystems). It’s more than merely being a microservices architecture.
That approach also helps scaling the people involved in your project, because new staff only have to learn the pipeline beyond the one component that they are working on.
It’s also useful for handling confidential data such as personal health information because data only rests (persists) at either end of the pipeline. Data protections within the pipeline itself are reused by all requests, so there’s less testing and less auditing required.
(Numerous additional factors are beyond scope here, such as accommodating multi-tenancy and/or multiple priorities. Be clear on distinction between events versus messages. Beware of data loss within application-layer and OS buffers when a subsystem crashes; hence, simply begin that request again! Anticipate that the system will fail, and “Let it crash!” There’s enough to fill a thick technical book.)
Following the guidance above, your top-level code will look a bit like a weekend errand checklist: first pick up the original document, then get plain-text, identify sentences & PoS tags, enrich with word similarities, persist to primary storage, and finally notify customer that they may query against it.
Put text to be processed into your own canonical form.
Again, this simply means that you define an internal convention for the structure of a text and its metadata. For instance, maybe you convert all words to lower-case, discard all punctuation, and convert all sentence endings to your own token such as “FULL-STOP” even if it is a question.
Your canonical form is for your own internal use only.
It’s important to note that contemporary tools should accommodate Unicode and likely use UTF-8 encoding during their own internal processing.
Many libraries and frameworks may accommodate this conversion, but be explicit about it. Own the conversion process to eliminate that as a source of potential anomalies later.
At minimum, ensure UTF-8 encoding before feeding text to an NLP system. (Then you have control over exceptions such as unrecognized code-points.)
Also consider enriching each text especially if working with HTML files. (See next section on HTML entities.)
Extract plain-text from various document types such as PDF, older PostScript, various generations of Microsoft Word, OpenOffice/LibreOffice, XML, SVG, etc.
For those spydering the web to fetch documents, are you aware that very
early versions of HTML used the paragraph tag only for separation rather
than as XML style containers? (i.e., no leading
<P> tag, only in between)
Being in the third decade of the Web, you must contend with using a
full-featured headless browser (e.g., variations of Chrome or Firefox
without a graphical user interface) for properly rendering HTML DOM, because
simple bots of past– based only on
CURL– are insufficient
for getting core text from an increasing number of websites.
Consider your requirements further down the pipeline.
If considering use of Google’s [SyntaxNet], it requires being fed one sentence per line. That is, one sentence followed by one Newline character.
As mentioned much earlier, sentence boundary detection (SBD) is itself a notoriously difficult problem within NLP.
So then, consider using [spaCy] (or even [NLTK]) to get your source text into that form.
Summarizing all of those factors into a hypothetical canonical form, your transformation stack may look like:
At this point, you may have your canonical form. (See Part 2 above.)
Documents and web pages are most commonly using HTML, the HyperText Markup Language.
Within that standard, an HTML Entity is an encoded representation of a symbol that may be difficult to enter from a conventional keyboard or perhaps is a character infrequently used.
To display “Çatalhöyük” (name of a UNESCO World Heritage Site in Turkey), this might be appear within an HTML page as:
Another example is the HTML Entity for Copyright symbol, “©”, which is coded as:
Some document-generation tools make liberal use of HTML entities for visual effect, such as ensuring their preferred rendering of opening versus closing quotes rather than letting a web browser decide.
It may be worthwhile to perform HTML entity substitutions prior to PoS tagging so that the syntax of coding each entity doesn’t become translated as punctuation.
If you’ve read through this far, numerous instances of potential problems described in this document have the same proscription (don’t do it that way) and same prescription (use spaCy.io instead).
They are focused on getting realistic NLP work done in a realistic way.
They focus on performance.
They accommodate just about all of the preliminary research you might want to do such as providing performance benchmarks compared against popular candidates.
From experience, spaCy.io will more than simply facilitate:
All of that is available by using the few lines of code in their spaCy 101 document, and everything you need from the above list is just a few more lines of code also on that page.
Since you are likely curious about the funky capitalization of their name, it’s a reference to Cython (usually pronounced “sigh-thon”) as a nod to the high performance derivative of Python with speed comparable to C.
Because spaCy has been implemented in Cython, they can also disable the Python runtime Global Interpreter Lock (GIL) for certain tasks. This gives better throughput for some multiprocessing operations.
Best of all for most users, you can code your app in regular Python. (Yes, both Python 2 or Python 3 are still accommodated as of spaCy version 2.0.x.)
Gensim has many features and use cases on its own, yet only one is explored for our purposes here.
Their word similarity feature
by one of the spaCy leads, particularly for
This is a huge model and may be understood as a sparse matrix.
When an operations person sees those words in the same sentence– “huge” and “matrix”– think about deploying a dedicated hardware GPU with several gigabytes of RAM.
For the dev-ops perspective:
Running spaCy within Docker or other container (such as plain LXC on Linux) works well, due to being stateless at runtime.
However, if using an optional model for English, its size will require advanced configuration of Docker.
Specifically, their “large” core English model for spaCy v2.0 is over 800 MiB in size. Simply downloading and installing it can be problematic for default builds of Docker due to caps on namespaces pertaining to file system.
Conversely, the default core English model is small enough that building spaCy within Docker for a developer’s laptop works beautifully (circa Ubuntu 17.04 and 17.10).
Because spaCy is implemented in Cython, it can disable the Python Global Interpreter Lock (GIL) for attaining better concurrency.
Some of their code does this already but may be worth pursuing for your own top-level code too, if running spaCy as a network service.
(If new to Cython, much of the top-level code has strong resemblance to regular Python but with performance close to the C language since it compiles to object code.)
For one particular project, Snagz.net, results from combining items within the preceding sections were used for our canonical form.
We kept original tokens as-is and added the lower-case form of each word as metadata.
For named entities and noun phrases, some of these may span multiple words. For tracking this, think in terms of trees (again, Graph Theory).
Each sentence was stored as a vector/array. Each vector element represented one token: a word or punctuation identifier such as FULL-STOP keyword.
(When persisted to durable storage, the entire text was stored in the same way but as single vector of tokens, not grouped by sentences.)
Associated with each token was a vector of primary data, which began with the as-is token (may be a word that is capitalized or the original punctuation symbol) plus a tree of metadata.
Metadata included PoS tag, lemma, any named entities or noun phrases, plus instances of word similarity.
For named entities and noun phrases which may span multiple words, a nested associative-list was used idicating its span; e.g., 2 or 3 as the value may involve the next two or three neighboring words from the original sequence.
The same approach also applied to word similarity, because span of tokens from original to potentially be replaced may differ in length.
This kept our canonical form for each token plus its metadata as an ordinary tree with structure resembling the Greek character, Lambda (λ).
Be sure to see the original, classic cover art on Wikipedia for Eats, Shoots & Leaves.
[Truss03] Truss, Lynne; Eats, Shoots & Leaves, Profile Books, November 2003.
A classic Greek education would be built on three pillars of the trivium (followed by quadrivium).
Those three are:
While earlier generations born up to end of the 19th Century that had “only a 4th grade education” would have learned the trivium, this version is a university level text.
[Joseph37] Joseph, Sister Miriam; edited by McGlinn, Marguerite; The Trivium, Paul Dry Books, 1937, 1940, 1948, reissued 2002.
There is much that you can accomplish without Natural Language Processing, per se.
Searching and sorting (including use of index and inverted-index mechanisms) may sometimes be sufficient, so use the right tool for the right job.
See the timeless classic Art of Computer Programming series by Donald Knuth, specifically:
Knuth, Donald E.; Art of Computer Programming, The, Volume 3: Sorting and Searching, 2nd Edition, Addison-Wesley, Reading, MA US; 1988, second edition 1998.
Even if the programming language Prolog means nothing to you or thinking it doesn’t apply, Sarah Witzig’s paper provides an excellent overview of different types of synonym relationships. The various hierarchies– both parent/child relationships and lateral ones– are described.
[Witzig03] Witzig, Sarah; Accessing WordNet from Prolog; Artificial Intelligence Center, The University of Georgia, Athens, GA US; April 2003.
An early library for Computational Linguistics from the 1980’s and 90’s is WordNet. Their contribution to the field demands respect for the rigor given to this for its time. However, as we approach the third decade of 21st Century, this library is less useful in practice. Contemporary computational linguists have strong opinions on various topics, and word similarity is favored over synonyms many times. However, for learning the field, it’s worth grasping concepts from WordNet, as you’ll find echos of it throughout the field, such as within [NLTK], below.
Source code and data:
[WordNet] WordNet: A Lexical Database for English; Princeton University, WordNet License, 1995, v3.0, 2006.
[WordNet98] Edited by Fellbaum, Christiane; WordNet: An Electronic Lexical Database; MIT Press, Cambridge, MA; May 1998.
WordNet::Similarity, “that implements a variety of semantic
similarity and relatedness measures based on information found in the
lexical database WordNet” that you might find useful:
Related but within different category of sentiment analysis, barely discussed here. For more, see:
Also, SemLink “to link together different lexical resources via set of mappings” may be of use to you:
The following technical article is sometimes paraphrased as “more data beats better algorithms” and often attributed directly to Peter Norvig. What is means is that given sufficient volume and diversity of data, searching through that collection will yield results that appear meaningful to a person. Of course, there are additional challenges with large datasets such as efficiently navigating it all, so it may be said that Google’s true genius is less about Search or PageRank than BigTable for storing and accessing it all at scale!
[Halevy09] Halevy, Alon; Norvig, Peter; Pereira, Fernando; The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, Vol. 24 Issue 2; 2009, pp 8-12.
Learn the Python programming language while also learning about Natural Language Processing.
This book is a practical introduction to both NLP and Python, meaning it’s very much hands-on.
[NLTK-book] Bird, Steven; Klein, Ewan; Loper, Edward; Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit; O'Reilly, 2009; Updated for NLTK 3.0, 2014.
Note that the toolkit itself– while wonderful for learning and experimenting– was never intended for high-performance or large-scale deployment in production.
“Contrasting with these goals are three non-requirements — potentially
useful qualities that we have deliberately avoided. First, while the toolkit
provides a wide range of functions, it is not encyclopedic; it is a toolkit,
not a system, and it will continue to evolve with the field of NLP. Second,
while the toolkit is efficient enough to support meaningful tasks, it is not
highly optimized for runtime performance; such optimizations often involve
more complex algorithms, or implementations in lower-level programming
languages such as C or C++. This would make the software less readable and
more difficult to install. Third, we have tried to avoid clever programming
tricks, since we believe that clear implementations are preferable to
ingenious yet indecipherable ones.”
[NLTK] Natural Language Toolkit; NLTK Project; Apache License 2.0; since 2001.
SyntaxNet is one model bundled with Google’s TensorFlow.
A huge potential pitfall of SyntaxNet is that you must feed it one sentence per line, which itself is a notoriously hard problem in NLP. While you may write your own and get an 80% or 90% solution to sentence-ending, attaining those last few percentage points requires a non-trivial approach.
Consider using [spaCy] to generate one sentence per line within your pipeline.
[SyntaxNet] SyntaxNet model; TensorFlow: An open-source machine learning framework for everyone; Google LLC, Mountain View, CA US; Apache License 2.0, open source since 2015.
To simply browse or see the relevant README:
spaCy is an NLP system with a robust parser that works at speed competitive with the best models like SyntaxNet. Their website gives details, so facts won’t be repeated here.
If you use only one NLP tool, pick spaCy.io
Starting here for your implementation to use NLP things could spare you a year or two of self-education, research and logistics.
While their people express strong opinions on certain nuances of NLP terrain, you would do well to heed their advice and follow their recommendations. When they advocate using [Gensim] for word similarity, do that rather than trying to make WordNet bend to your will. At least, do so during your early stages.
Then, when you’re generating revenue from your system built on the shoulders of these giants and understand more– and only then– go back and revisit that foundation for your v2.0 but not before.
[spaCy.io] Honnibal, Matthew; Montani, Ines; contributions from open source community; spaCy; Explosion AI, Berlin, Germany, MIT License, since 2015.
Gensim is a topic modeling system that improves usability compared to others.
This is a statistical model represented as a sparse matrix.
Depending upon how large of a training dataset you have, that matrix may be populated more fully.
When you see “model” and “matrix” in the same paragraph with NLP things, consider deploying using a hardware GPU device with lots of RAM for best operational performance.
Unless you train it yourself, the conventional model for Gensim word similarity is on the order of 2 GiB in size.
[Gensim] Řehůřek, Radim; Sojka, Petr; “Software Framework for Topic Modelling with Large Corpora”, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta; 22 May 2010, pp.45-50.
This competition was to perform sentiment analysis on text components of movie reviews for recommending what a customer would appreciate watching next.
(Note that the proper term for this in academic literature is Recommender system, and only what it produces would be a “recommendation”.)
The competition spanned 2006 through 2009. A sequel competition for 2010 was canceled due to researchers de-anonymizing the dataset.
[NetflixPrize06] Netflix Prize, Netflix Inc., Los Gatos, CA US; 2006-2009
Speaking of movies, if going down this particular rabbit hole you may also be interested in tropes for catching subtle references used within individual reviews:
It’s called TVTropes but spans far more than just television.
Particularly useful for overcoming confused or conflated tropes: Laconical List of Subtle Trope Distinctions.
If approaching this subject as entry into data science, consider reading:
16 Useful Advices for Aspiring Data Scientists.
Depending upon your specific field, be mindful of proposed new definitions for statistically significant (versus suggestive) as of mid-2017:
[Benjamin17] Benjamin, Daniel J.; Berger, James; Johannesson, Magnus; Nosek, Brian A; Wagenmakers, Eric-Jan; Berk, Richardd; Bollen Kenneth; et al; “Redefine Statistical Significance”, PsyArXiv; 22 July 2017