Edward Gibson: Human Language, Psycholinguistics, Syntax, Grammar & LLMs | Lex Fridman Podcast #426

Edward Gibson: Human Language, Psycholinguistics, Syntax, Grammar & LLMs | Lex Fridman Podcast #426

Introduction (00:00:00)

  • Edward Gibson (Ted) is a professor of Psycholinguistics at MIT.
  • He heads the MIT language lab that researches why human languages have their specific characteristics.
  • Ted investigates the relationship between culture, language, and how people process and learn language.
  • His book titled "Syntax: A Cognitive Approach" will be published by MIT Press in the coming fall.
  • The Paha language lacks words for exact counting.
  • There are no words for one, two, three, or four in the Paha language.
  • This concept often surprises people and challenges their understanding of language.
  • Without words for numbers, it's impossible to ask for specific quantities in the Paha language.
  • Psycholinguistics studies how people process and produce language.
  • It examines the mental processes involved in understanding, speaking, and acquiring language.
  • Psycholinguistics investigates how language is represented in the brain and how it interacts with other cognitive processes.
  • The field aims to understand the relationship between language and thought.

Human language (00:01:13)

  • Edward Gibson was fascinated by human language as a child, finding the process of structuring sentences in English grammar interesting and puzzling.
  • He approached language as a mathematical puzzle, leading him to pursue a master's program in computational linguistics at Cambridge despite not having any prior language classes.
  • Gibson's background was in mathematics and computer science, and he initially found natural language processing in AI unimpressive, lacking a solid theoretical foundation.
  • Gibson was less interested in the philosophical angle of logic, which focused on extracting underlying meaning from language and compressing it in a computer-representable way.
  • He found syntax, the forms of language, to be an easier problem to tackle compared to semantics, the meaning of language.
  • Gibson believes there is a significant gap between form and meaning, which is evident in the performance of large language models (LLMs).
  • LLMs excel at generating grammatically correct text (form) but struggle with understanding and conveying meaning.
  • Gibson suggests that studying form can provide insights into the structure of thought and meaning behind communication.

Generalizations in language (00:05:19)

  • Human languages exhibit remarkable generalizations across different languages.
  • Languages can be categorized into two main word orders: subject-verb-object (SVO) or verb-final (VSO).
  • SVO languages tend to have prepositions, while VSO languages tend to have postpositions.
  • Approximately 95% of the world's languages follow either SVO or VSO word order, suggesting a universal principle of minimizing dependencies between words in a sentence.
  • Languages tend to minimize the length of dependencies between words, as observed by renowned typologist Joseph Greenberg from Stanford.

Dependency grammar (00:11:06)

  • Language consists of sounds, words, and combinations of words (grammar or syntax).
  • Sentences are tree structures where each word connects to one other word.
  • Dependency grammar constructs trees by connecting words with dependency relationships.
  • The root of a sentence tree is usually the verb, representing events or states.
  • Nouns refer to people, places, things, or events, and their category (part of speech) is determined by usage, not meaning.
  • Russian has freer word order and uses case markers on nouns, allowing for flexible sentence structures without changing meaning.
  • Linguistic terms like "agent" and "patient" describe word meanings, while "subject" and "object" describe their positions in a sentence.
  • Sentence tree diagrams can be automated by identifying morphemes (minimal meaning units) and words.
  • English has less inflectional morphology compared to languages like Russian, but nouns and verbs can be marked for singular/plural and past tense.
  • Regular English verbs form the past tense by adding "-ed," while irregular verbs have unique past tense forms.
  • High-frequency English words tend to be irregular, while low-frequency words tend to be regular.
  • Irregular forms in English may have evolved from slang that broke the rules and gained widespread acceptance.

Morphology (00:21:05)

  • Morphemes are the smallest units of meaning in language and are connected to roots to form words.
  • Languages have different morphological structures, such as suffixes, prefixes, or infixes, and the number of morphemes per word can vary across languages.
  • The evolution of language is influenced by communication effectiveness, cultural and social interactions, and contact between different language groups.
  • Languages may have a limited number of words for certain concepts, such as colors, with different cultures having different numbers of basic color words.
  • The evolution of language is driven by the need to communicate, with color words emerging as a way to efficiently convey information about objects.

Evolution of languages (00:29:40)

  • There is limited data on the evolution of languages because most languages do not have a writing system or have a modern writing system.
  • Mandarin Chinese and English have a lot of evidence of language evolution due to their long history of writing.
  • Rapid communication on platforms like Reddit can provide insights into language evolution through the creation of slang and deviations from standard language.
  • The Queen's English has changed over time, with her vowels shifting significantly during her reign.
  • The word order of English has also changed over time, evolving from a verb-final language with case marking to a verb-medial language with no case marking.
  • Language evolution can be slow over short periods like 20 years but significant changes can occur over 50 to 100 years.

Noam Chomsky (00:33:00)

  • Syntax refers to the rules governing sentence structures, while grammar encompasses the entire system of rules for a language, including syntax, morphology, and semantics.
  • Phrase structure grammar, introduced by Noam Chomsky, represents sentences as hierarchical structures with a root (S) expanding into noun phrases (NP) and verb phrases (VP).
  • Formal language theory studies languages using formal methods, applicable to both human and computer languages.
  • Dependency grammar represents sentences as networks of words connected by grammatical relations, with a root node and implicit rules.
  • Chomsky's theory of grammar involves the concept of movement, where words or phrases are moved from one structure to another, such as from declarative to interrogative sentences.
  • The lexical copying theory suggests that declarative sentences are the source, and interrogative sentences are copied from them, solving the learnability problem and predicting that not all auxiliaries can move.
  • Dependency grammar and phrase structure grammar are equivalent in generating sentences, but dependency grammar emphasizes the mathematical distance of dependence between words.
  • Context-free grammars are insufficient for human languages, requiring at least context-sensitive grammars to allow for long-distance dependencies and recursion.
  • Edward Gibson believes language is more complicated than Chomsky's theories suggest, as words and combinations of words are processed in the same brain areas, challenging Chomsky's distinction between them.
  • The primary difference between Gibson and Chomsky lies in their methodological approach to studying language, with Gibson emphasizing experiments and corpus analysis, while Chomsky relies on thought experiments and intuitions.

Thinking and language (01:17:06)

  • Language areas in the brain can be localized using methods like fMRI.
  • The language network in the brain is stable over time, but the exact point of stabilization during human development is still being studied.
  • Different thinking tasks activate different brain networks, with language tasks activating the language network.
  • The language network is activated when comprehending spoken or written language, but not when processing music or nonsense words.
  • The same language network is involved in both language production and comprehension.
  • Constructed languages like Klingon activate the language network because they share structural similarities with human languages.
  • The language network may be involved in translating thoughts into deeper concepts, but the relationship between thoughts and words is not fully understood.
  • Language comprehension appears to be separate from thinking, as the language network only lights up when processing words or word combinations.
  • Global aphasic patients, who have suffered a massive stroke on the left side of their brain, can perform various tasks but have lost their language abilities.
  • Symbolic processing, such as math, does not occur in the language area of the brain.
  • Language is not necessary for thinking, but it allows for a lot of expression.
  • Language is a separate system from thinking, which has implications for large language models.

LLMs (01:30:36)

  • Large language models (LLMs) are the current best theories of human language, but they are not perfect.
  • LLMs are black boxes that lack simplicity and may not use simple explanations of language like dependency grammar for meaning.
  • Construction-based theories of language, such as dependency grammar, are closest to LLMs and focus on form-meaning pairs and usage-based ideas.
  • LLMs are good at understanding the surface form of language but not the deeper meaning.
  • LLMs are easily fooled by changes in problem formulation and may be over-reliant on patterns in their training data, leading to errors.
  • Unlike humans, LLMs may not recognize and correct errors even with additional information.
  • LLMs can generate text that appears to demonstrate understanding but may not possess a deep understanding or the ability to reason abstractly.
  • Humans are less likely to make certain types of errors and can often recognize and correct them.
  • LLMs have impressive form in generating human-like text but struggle with meaning and context.
  • LLMs are not explicitly trained on "bad" sentences, leading to occasional nonsensical completions.
  • The form of LLM-generated text is often correct, but the content may not always be true due to the overrepresentation of truth on the internet in their training data.
  • There is an ongoing debate about whether LLMs lack a fundamental "thinking" or reasoning capability compared to humans.
  • The limits of LLMs lie in their ability to complete central embeddings and generate perfectly human-like text.
  • LLMs have the potential to model the form and processing of human language, making them useful for studying language processing and generation.

Center embedding (01:43:35)

  • Dependency grammar reveals that the cognitive cost of processing long-distance connections between words increases with the distance between them.
  • Measuring cognitive cost can involve methods like sentence sound ratings, reading time analysis, and brain activation pattern examination.
  • LegalEase, characterized by nested structures and low-frequency words, deviates from the local dependency rule in natural languages.
  • Legal English is complex, featuring center embedding, low-frequency words, and passive voice, with center embedding being the most significant comprehension barrier.
  • Lawyers prefer simpler writing styles but often use center embedding in legal contracts, possibly due to its performative meaning, suggesting legal validity.
  • The "magic spell hypothesis" proposes that center embedding in legal writing resembles a magic spell, conveying truth and certainty, which aligns with the goal of legal contracts.
  • The complexity of legal writing likely stems from systemic factors and incentives within the legal system rather than individual lawyers' intentions.
  • Center-embedded sentences can be simplified by moving definitions outside the subject-verb relationship.
  • Noisy channels, such as errors, background noise, and receiver-side issues, impact communication, influencing language evolution and word order optimization for effective transmission.
  • Claude Shannon's work on communication theory and information theory, particularly his study of noisy channels and communication systems, influenced the understanding of language as a communication system.

Learning a new language (02:10:02)

  • There is no evidence that any human language is better than any other in terms of optimizing dependency lengths.
  • All languages have regularity in their rules, which suggests that learning is a factor in language development.
  • Languages with more rigid word order rules may be easier to learn than languages with free word order.
  • The difficulty of learning a second language depends on how close it is to the first language.
  • Babies do not seem to have any difficulty learning any human language.

Nature vs nurture (02:13:54)

  • Edward Gibson argues that much of language can be learned, and the brain's language-processing abilities can develop through learning, challenging Chomsky's theory of innate language structures.
  • Gibson suggests that modularization in the brain may result from learning rather than indicating innate structures and highlights cases where individuals with brain damage develop language abilities in other brain regions.
  • Gibson emphasizes the importance of studying natural experiments, such as individuals with brain injuries, to gain insights into language development and brain organization.
  • Gibson presents evidence from neuroimaging studies suggesting that language processing occurs in distinct brain networks separate from those involved in thought, challenging the traditional view that language underpins thought.
  • The findings are based on studies involving individuals with brain damage and are consistent across multiple subjects, suggesting robustness.

Culture and language (02:20:30)

  • The study of language should consider diverse cultures, including isolated language groups like the Chimani and the Pan in the Amazon rainforest, to understand the relationship between culture and language.
  • Language is invented based on the need to communicate and convey information.
  • The Pirahã people in the Amazon jungle don't have words for specific colors or exact counting, challenging the idea that all human languages have words for exact counting and that numbers are universally represented in language.
  • AI models can perform exact matching tasks but struggle with tasks that require encoding a set of objects, suggesting that language, specifically the words for exact counts, is a limiting factor in their abilities.
  • The origin of counting systems is linked to farming for efficient livestock management, while hunter-gatherer societies may not require numbers for tracking their children due to the importance of individual identities.

Universal language (02:34:58)

  • Languages survive when they serve a function in a community.
  • Languages die when they lose value to the local people.
  • The popularity of a language is driven by economic factors.
  • There is a tension between the convenience of trade and the national identity associated with a language.
  • Dialects emerge as a form of identity for people.
  • Languages are dying all around the world due to lack of value in the local community.
  • An example is the Mosan language, which is dying due to the value of Spanish for economic opportunities.
  • People learn languages primarily for communication and economic reasons, not for fun.
  • The motivation to learn a language is driven by its economic value and the number of speakers worldwide.

Language translation (02:39:21)

  • Machine translation can break down language barriers.
  • Translating from one language to another can be challenging due to:
    • Different concepts existing in one language and not the other.
    • Lack of words or ways to express certain concepts in a language.
  • Translating form (writing style) is difficult because:
    • Good writing involves more than just content, it includes rhythm and flow.
    • Different authors have distinct writing styles that can be challenging to capture in translation.
  • Hemingway's writing is notable for its short, simple sentences, resulting in a low average dependency length per sentence.

Animal communication (02:42:36)

  • Edward Gibson believes that animals may have complex communication systems similar to human language, and the argument for human language's uniqueness lacks evidence.
  • Gibson suggests that if we approach the possibility of communication with other living beings, including plants, with intellectual humility, there is potential for communication.
  • Gibson's exceptional language learning abilities allowed him to surpass previous translators in understanding a remote tribe's language, which he attributes to his social nature and willingness to engage with people from different backgrounds.
  • Gibson advises pursuing interesting opportunities and taking risks to achieve what others haven't for a fulfilling career or life.

Overwhelmed by Endless Content?