Corpus Linguistics & Statistics @ UoBham, 11/02/16 by John Williams

Corpus Linguistics & Statistics @ UoBham, 11/02/16 by John Williams https://padlet.com/johnxwilliams1/y7v657kn6p3o Scroll down and along to see all notes en-us 2016-02-11 13:58:30 UTC 2025-11-30 21:24:49 UTC hello@padlet.com Michaela Mahlberg: Intro johnxwilliams1 https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94590102

Language as a social phenomenon

Meaning and form are associated (lexico-grammar)

CL prioritizes lexis

- in texts and between texts

Meaning is based on evidence of interaction: selection of linguistically relevant patterns, depends on RQ

"Scholars don't pay enough attention to what non-scholars think about the world." (Proctor 2012)

]]> 2016-02-11 14:07:27 UTC https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94590102 Simon Preston: Corpus Analysis from Math perspective johnxwilliams1 https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94595397

Use old existing (easier?) solutions
Corpus analysis in terms of input (math representation of corpus X) & output - f(X)
Deciding on X (eg. 'bag of words' representation --> matrix) forces us to decide what we retain and discard
Try to represent it as a relationship between simpler matrices ('matrix factorization')
'bag of words' representation discards information about the order of words
get round that with co-occurrence matrix --> network visualization
Challenges:
- How to analyse time-structured corpora & co-occurrence networks (or combinations thereof - 'time-dependent networks')

]]> 2016-02-11 14:22:34 UTC https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94595397 Thompson, Murakami, Hunston: Topic Modelling johnxwilliams1 https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94603737

'Bottom-up' approach, no prior model of corpus
'Topic' defined in terms of probability distribution of fixed vocabulary
We can model a document in terms of rolling a 'die' for choosing a topic, then rolling second die for the choice of words within that. 'Topic modelling' sort of reverses that process and tries to recreate the (irregular) shape of these dies.
Tested on research papers in the environmental domain, with a roster of 60 topics ('documents' may be component parts of papers, 'text chunks' --> topics may be recurring at characteristic points in research papers)
Topics may be prominent at particular time periods
Topics can be envisaged in terms of individual words,co-occurrence of invididual words, or n-grams

]]> 2016-02-11 14:43:26 UTC https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94603737 Viola Wiegand: Identifying surveillance discourses johnxwilliams1 https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94613171

The journal 'Surveillance & Society' mined for key words in the domain

Looked for common features for common patterns across the 13 volumes of the journal

Used 'key keywords', 'lockwords' & co-occurrence

'KKs' are keywords that occur across a large number of texts in the corpus

'Lockwords' - words that are stable in frequency across texts

Found 28 items that were both KK and lockwords --> looked at co-occurrent pairs, these can be mapped ('co-occurrence networks')

Linguistic interpretation still necessary - "No purely statistical analysis of language can reveal meaning".

]]> 2016-02-11 15:06:00 UTC https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94613171 Yves van Gennip: Graphical representations of a corpus, clustering johnxwilliams1 https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94629564

Co-occurrence: need to make decisions about window-size, directionality, weighting - before making a graph

Graph = network of nodes/vertices and edges/links

Thickness of edge can indicate weightedness

Using a block matrix to indicate strength of co-occurrence

]]> 2016-02-11 15:48:06 UTC https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94629564 Hennessy: Time-dependency in corpora johnxwilliams1 https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94638677

'binning' = grouping documents in an axis of your matrix (eg. corresponding to time periods, then making simpler matrices based on each bin

width of bins depends on RQ & data

bins can overlap

'kernel' - a kind of scaled bin to give a more 'realistic' view of effect size

this is a form of statistical smoothing

possible kernels come in a set of classic mathematical shapes

]]> 2016-02-11 16:11:33 UTC https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94638677 Smyth, Bull: The right to read = The right to mine johnxwilliams1 https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94644894 (Librarians)

Data-Asset-Method: Harnessing the Infinite Archive --> a set of protocols for storing corpora uniformly, available for networks of researchers
UK copyright legislation 2014: computational analysis and transformation of data does not infringe copyright ('the right to mine' - for non-commercial purposes; may not apply to some international collaboration)
lot of publishers making their content available digitally
CORE portal for accessing open-access articles
CROSSREF is making these articles available for data mining

]]> 2016-02-11 16:28:20 UTC https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94644894 Laurence Anthony: DIY corpus tools creations: pros and cons johnxwilliams1 https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94652968 Design of new corpus tools tends to be dominated by computer scientists
Corpus users tend to do research 'inside the box' --> AntConc downloads still going up --> Why aren't the new generation developing their own tools? Why are we not all programming?

LA presents both sides of the debate:
Pros:
- (Biber) if you learn programming, you can do what you want, you are in the driver's seat, it can be cheaper
- (Gries) "inflexible software creates inflexible research"
- (Davies) distinguishes 'corpus users' from 'corpus creators'
Advice:
- Pick a popular language, eg. Python, Scratch (on Raspberry Pi), Java, R (very ugly)
- Read a programming book
- Join Stack Overflow

Against:
- Most corpus users can 'get by' with current tools
- Researchers in many fields do not develop their own tools: it tends to be the fruit of collaboration between researchers & engineers
- Programmers are a different world. DIY tools risk being less accurate, slower
Advice:
- Decide your research question before selecting your tool/method
- Learn to use a *good* text editor, eg, Notepad++, TextWrangler
- Read the user guide
- Be proactive in contacting specialists
- Provide motivation for getting specialists involved, treat them as part of the team
- Understand the limitations of your tools and potential alternatives
"Life is short"

LA comes down on pro-programming side, but with teams integrated from the start. Even if users use only standard tools, they would benefit from a bit of programming nous.

Check out: wordwanderer.org]]> 2016-02-11 16:47:37 UTC https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94652968 Programme johnxwilliams1 https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94722307 Accuracy of my notes not guaranteed, esp where statistics is concerned ! Apologies for any misrepresentation]]> 2016-02-11 20:00:16 UTC https://padlet.com/johnxwilliams1/y7v657kn6p3o/wish/94722307