読者です 読者をやめる 読者になる 読者になる

Research Notes

専門は、応用言語学・外国語教育学。特にライティングの指導と評価に関心を持っています。

The Cambridge Handbook of Learner Corpus Researchのまとめ

下記の本が昨年末に出版されました。

 

The Cambridge Handbook of Learner Corpus Research (Cambridge Handbooks in Language and Linguistics)

The Cambridge Handbook of Learner Corpus Research (Cambridge Handbooks in Language and Linguistics)

  • 作者: Sylviane Granger,Gaëtanelle Gilquin,Fanny Meunier
  • 出版社/メーカー: Cambridge University Press
  • 発売日: 2015/10/01
  • メディア: ハードカバー
  • この商品を含むブログを見る
 

 

この本のchapter 1とchapter 25の内容をまとめる機会があったので、下記に転載します。

 

Chapter 1

Granger, S., Gilquin, G., & Meunier, F. (2015). Introduction: learner corpus research - past, present and future. In S. Granger, G. Gilquin & F. Meunier (Eds.), The Cambridge Handbook of Learner Corpus Research (pp. 1-5). Cambridge University Press.

 

Chapter 25

Leacock, C., Chodorow, M., & Tetreault, J. (2015). Automatic grammar- and spell-checking for language learners. In S. Granger, G. Gilquin & F. Meunier (Eds.), The Cambridge Handbook of Learner Corpus Research (pp. 567-587). Cambridge University Press.

 

----------------------------------------------------------------------------------------------------------------------

  1. Introduction: learner corpus research - past, present and future

 

Learner corpus research (LCR) emerged in the late 1980s.

 

There are two advantages in access to electronic collections of L2 data.

・They are more representative than smaller data samples.

・The data can be analyzed with a whole battery of software tools

Cf. POS taggers and concordance program

 

The field of learner corpus research has undergone remarkable developments

・137 learner corpora (Learner corpora around the world)

82 (60%) L2 English, the rest focusing on other languages

The dominant focus is on writing (essay writing)

 

・Research design (longitudinal data)

 

・Individual variability

 

<参考>

Paquot, M., & Plonsky, L. (2015). Quantitative research methods and study quality in learner corpus research. LCR 2015. https://twitter.com/mrkm_a/status/642802550928998400

 

石井 (2014): 1994-2013までの英語コーパス研究184本を分析。日本人英語学習者コーパスの利用は10%に満たない。

 

The handbook is subdivided into five main parts:

  1. Learner corpus design and methodology
  2. Analysis of learner language
  3. LCR and SLA
  4. LCR and language teaching
  5. LCR and NLP

 

Chapter format

 

Introduction

A number of issues

Representative studies

Critical look

Recommended key readings

 

----------------------------------------------------------------------------------------------------------------------

  1. Automatic grammar- and spell-checking for language learners

 

  1. Introduction

Granger and Meunier (1994): grammar- and spell-checking as a promising application for learner corpus research.

 

There is a complex relationship between automated error-correction systems and the learner corpora.

 

・Some systems require large amounts of error-annotated learner writing.

・Reliable annotation

 

2 Core issues
2.1 Brief background on grammatical error correction

Published research first appeared in the 1980s.

Cf. Grammar Writer’s Workbench

→rule-based approaches

 

The approach began to shift from rule-based to statistical in the mid 1990s.

⇔almost all error-correction systems make use of at least some rules.

※この辺りの経緯は自然言語処理の学説史と密接な関係<辻井 (2012)など参照>。

 

2.2 Brief background on spelling-error correction

Kukich (1992) identified three strands of research.

(1) non-word error detection

(2) isolated-word error correction

(3) context-dependent error correction

 

Cf. 編集距離 (edit distance)とは、「2つの文字列があるときに,一方の文字列をどのくらい編集するともう一方の文字列が作成されるかを距離として計算することで,2 つの文字列の類似度(相違度)を測る尺度」(投野・望月, 2013, p. 74)

 

2.3 The needs of L2 learners

From researcher’s pedagogical experience to learner corpus such as Cambridge Learner Corpus

→The most common error is content word choice.

 

Rimrott and Heift (2008) evaluated the helpfulness of generic spell-checkers for L2 learners.

 

The spelling errors were classified as lexical, morphological and phonological.

 

For 62% of the learners’ errors, the intended word was among the suggested corrections provided by Microsoft Word.

 

2.4 The importance and design of learner corpora
2.4.1 Annotation of grammatical errors in learner corpora

Precisionとは「システムが出した結果において、本当に正しかったものの割合。検索対象の文書群の中から、正しく検索された文書の割合を指す。正確性に関する指標」

Recallとは、「結果として出てくるべきもの(記事や文書)のうち、実際に出て来たものの割合。網羅性に関する指標」

 

Gamon (2010)’s research

Errors are often ambiguous.

→researchers have often used learner text that is annotated for only a single targeted type of error.

 

The cost of developing the corpus was quite high.

→To use the error –detection system to output the errors it has found in learner text and then to ask one or more annotators to verify the output.

⇔Whenever the system is modified, its output is likely to change.

⇔It cannot be used for calculating recall.

 

Judgments of usage errors are not as clear-cuts as those of grammatical errors.

→Using crowdsourcing to annotate learner errors.

 

Errors often appear in ‘noisy’, error-ridden contexts.

→measuring the edit distance

 

2.4.2 Annotation of spelling errors in learner corpora

Bestgen and Granger (2011): identifying the categories of errors that affect essay scores.

Flor and Futagi (2012, 2013); Flor (2012): developing algorithms for spelling correction.

 

2.4.3 Error-annotated learner corpora freely available to the NLP community

  1. Helping Our Own 1 (HOO-1)
  2. Helping Our Own 2 (HOO-2)
  3. 2013 conference on Computational Natural Language Learning (CoNLL 2013)
  4. 2014 conference on Computational Natural Language Learning (CoNLL 2014)

Cf. EDCW (Error Detection and Correction Workshop) 2012

 

  1. Representative studies

A brief overview of two commonly used techniques: machine-learning (ML) statistical classifiers and language models.

 

machine-learning (ML) statistical classifiers: 教師あり学習

具体例:最近傍法(石井, 2015)

language models: 教師なし学習

 

3.1 Tetreault and Chodorow (2008)

TASK: 34 most frequent prepositions

Training data: about 7 million preposition from the Lexile corpus (fiction, non-fiction and textbooks).

RESULTS: 84% precision, almost 19 % recall.

 

3.2 Han, Tetreault, Lee and Ha (2010)

TASK: preposition-error identification and correction

Data: error-tagged corpus of essays written by English as a FL students in South Korea (111,000 essays)

Training data: about 1 million cases of preposition usage from the data.

RESULTS: 93 % precision, 15 % recall.


3.3 Rozovskaya and Roth (2010)

Developed four methods for artificially introducing article errors into training data.

Cf. GenERRate (http://www.computing.dcu.ie/~jfoster/resources/genERRate.html)


3.4 Mitton and Okada (2007)

TASK: Developed an algorhithm for spell-checker

RESULTS: The top suggestion (from 61.2% to 65.8%), the top three suggestions (73.3% to 78.7%) and among the top six suggestion (77.9% to 83.5%)

 

4 Critical assessment and future directions

There has been an immense amount of research into the development of grammatical error correction system.

 

・There is a need for efficient and reliable annotation of learner corpora for system training and evaluation.

・there is also a need to develop error-correction resources for learners of other languages.

・tailoring the error-detection systems to the native language of the writer.

・mainly focused on developing error-specific modules, one for each error type.

 

What is needed by the NLP research community is learner corpora that identify the range of error types and corrections for each error.

 

References

石井卓巳(2014)「日本の英語コーパス言語学の研究課題・手法の変遷:『英語コーパス研究』掲載論文を用いた基礎的検討」外国語教育メディア学会(LET)関西支部メソドロジー研究部会2014年度第1回研究会発表資料.

石井雄隆(2015)「データマイニングの手法を用いた英語ライティング研究―プロセスとプロダクトの観点から―」全国英語教育学会熊本研究大会発表資料.

投野由紀夫・望月源(2012)「編集距離を用いた英文自動エラータグ付与ツールの開発と評価」『コーパスに基づく言語学教育研究報告』9, 71-92.

辻井潤一(2012)「合理主義と経験主義のはざまで―内的な処理の計算モデル―」人工知能学会誌, 27(3), 273-283.