Está en la página 1de 2

Dichotomous Classification using Transitive Learning

Christopher Brown
Department of Linguistics
University of Texas at Austin
1 University Station B5100
Austin, TX 78712-0198 USA
chbrown@mail.utexas.edu

Abstract scribe various complications of and solutions to


the problem (the latter in much more depth), but
The classifier implementation outlined in neither consider the continuous process of user
this document is intended to take two cor- input in the production and correction of errors.
pora that markedly differ from each other Their spellchecking algorithms assume a com-
in some general way, to learn the parame- pleted document and a cessation of user input.
ters that best distinguish the two corpora, I propose a more intelligent text entry engine, in
and then be able to apply those parameters which the errors that contribute to easily corrected
to a new set of data efficiently. misspellings (by minimum edit distance methods,
In this paper I will implement a classi- for instance) will be incorporated into the model
fier to distinguish between formal and col- of edit distances. For example, if a user mistypes
loquial language use, trained upon col- “aling,” the engine will quickly autocorrect it as
loquial data from blogs, Simple English “align,” but if the user then mistypes “desing,”
Wikipedia, and children’s literature, and the engine will remember the “ng” → “gn” edit
formal data from the N.Y. Times, WSJ, it performed before, and suggest “design” instead
and legal/political/academic texts. The in- of “dewing” (my naive iPad suggests “dewing”
tent of the classifier is then to classify ad- even though I make that typo all the time). This
missions essays on a scale of formality, might be more like active learning, but as I under-
which will presumably serve as some in- stand it, the difference between active learning and
dicator of academic success/potential. semisupervised learning over a non-finite (contin-
uous?) data set is that active learning involves the
1 Introduction human user telling the machine what to learn, and
Usually, we’d have a little bit of labeled data, semisupervised learning entails the machine infer-
with very stark, clear, decisive labels, and a lot ring what to learn from the user’s normal behavior
of mushy, who-knows-what unlabeled data, that is (i.e. un-labeled data); my implementation would
pretty much the same kind of document as my la- be a combination of the two.
beled data.
2 Colloquial/formal classification
In my case, I have a ton of very thinly labeled
data that isn’t much like what I want to end up This classifier will train on very large corpora that
classifying. The labeling is kind of fuzzy – i.e. are labeled at the global level; I will consider the
it’s very noisy – but my unlabeled data is benefi- LDC’s switchboard corpus or the Simple English
cial because it is very similar to the data I want to language of Wikipedia as “colloquial,” and New
classify. York Times or Wall Street Journal articles as “for-
It is difficult to label certain lexical traits, such mal.” These will be very noisy data sets, but I be-
as linguistic register, in a general low-level way. lieve some traits of formality or colloquialism can
Spelling correction is sub-optimal in many of its be teased from the documents. Given those two
modern implementations. While some text editing poles of corpora, I will calculate the distance be-
programs allow personal dictionaries, none that I tween my unlabeled, to-be-classified documents,
know of track personal mistake tendencies. Two and documents in those corpora. (Do and Ng,
treatments of spelling correction algorithms, (Ju- 2006) present an improved algorithm for the “mul-
rafsky and Martin, 2008) and (Kukich, 1992), de- ticlass text classification task” that seems to be
suited to similar classification between such cor-
pora.1

References
Chuong Do and Andrew Ng. 2006. Transfer learning
for text classification. In Y. Weiss, B. Schölkopf,
and J. Platt, editors, Advances in Neural Information
Processing Systems 18, pages 299–306. MIT Press,
Cambridge, MA.
Daniel Jurafsky and James H. Martin, 2008. Speech
and Language Processing (2nd Edition), pages 72–
79. Prentice Hall, 2 edition.
Karen Kukich. 1992. Techniques for automatically
correcting words in text. ACM Comput. Surv.,
24(4):377–439.

1
The results section is somewhat opaque to me, but over-
all it seems like a reasonable place to start; it was not so easy
finding relevant articles for this idea.

También podría gustarte