NLP

Text Magic in Java with GPT-2

Recently the OpenAI team made news again by releasing a 335-million parameter pre-trained natural language model. This model, using Python and TensorFlow , can generate text based on preceding text with such impressive capabilities that is can be used to translate and answer questions. This team actually has models several times larger, but have not yet released them due to risks of abuse. Today I have released my small contribution to this awesome project - A deployable TensorFlow model and Java-based reference implementation which uses only the core (i.

The Many Uses of Word2Vec

One artificial intelligence tool that I’ve been playing with lately is an algorithm called word2vec. The basic idea is that words are given positions in high dimensional space, and the positions are optimized such that word distance indicates how often words are seen together. These numbers can then be used in a variety of ways, from a simple word similarity search to recurrent neural networks. In this article I will outline some uses of this amazing approach, along with links to sample code and results.

Text Classification via PPM Compression and Decision Trees

Text classification is a common machine learning task which is known in various contexts as sentiment analysis, language detection, and category tagging. Many standard AI tools can be used on text given an appropriate feature selection function, which essentially transforms text down into a high-dimensional vector. However there are also certain techniques that work directly on the text, and this article is about a couple of those techniques that are enabled and demonstrated by the new release of the CharTrie component of the SimiaCryptus utilities library.

Text Modeling, Compression, and Markov Strings

The recent wave of publishing and releases included a particularly interesting text analysis component that I’d like to talk about today. There are many possible uses, including text classification, clustering, compression, and creation. Most people would most likely recognize this as the data structure behind Markov strings or full text indexes. This new component is logically a Trie Map that counts n-grams. The idea is that we can break text down into a number of overlapping n-grams, ie N-character strings like the 4-grams “frog” or “n th”.

Announcing a new (old) project: Ubermarkov: Experiments with markov tree serialization schemes

Happy Friday! I’ve just finished reviewing and updating the next project in my backlog of old research to publish. It is an experiment in how to efficiently serialize a Markov tree. I got interested in the idea when exploring some of the curious properties of a Markov tree, specifically one based off a fixed population of N-grams derived from a continuous string. It turns out that most of the data in a piece of text, if not all, can be absorbed into the Markov tree structure and then encoded in the tree’s serialized form in a more efficient manner than is obvious for the string itself!

Preview: Grammar Beans

<charsequence, charsequence="“><charsequence, charsequence="“><charsequence, charsequence="“> </charsequence,></charsequence,></charsequence,> This can be translated into a grammar very simply: Grammargrammar = GrammarBean.get(XmlTree.class); This translation happens according to a number of rules to translate various java types into grammar structures: * __Terminal Classes__ – Java classes are converted into sequence elements, where each field in the class is an element in the sequence. * __Super Classes__ – Java classes with the @Subclasses annotation become choice elements.