Toastmasters Scrapers Guild - AI Craftspeople Guild

Abstract

Large-scale training corpora for contemporary AI systems are assembled primarily through automated web scraping. The corpora inherit, and the resulting models subsequently reproduce, a specific class of artifact: meaning detected by humans in stimuli that contain no deliberate meaning. The canonical popular example is the perception of religious iconography in burnt toast - the "face on the toast" phenomenon. Such artifacts are pareidolic: the pattern is generated by the observer's perceptual system rather than by the stimulus itself. When scraped into training data at scale, pareidolic artifacts become indistinguishable from intentional content, and the resulting model treats them as meaningful. This paper defines the problem, proposes a professional framework for detecting and annotating pareidolic artifacts during scraping, and specifies a provenance protocol that allows downstream users to distinguish between intentional and emergent meaning in training data. We frame the work within the AI Craftspeople Guild manifesto commitments to quality, transparency, and informed consent.

Keywords: training data provenance, pareidolia, web scraping, data quality, AI Craftspeople Guild.

1. Introduction

In 2004 a grilled cheese sandwich reported to bear the image of a religious figure was sold at auction for $28,000. The sandwich's seller described the image as intentional communication; buyers and observers treated it as such. The sandwich itself was, by any physical account, a normal grilled cheese. The image was produced inside the observer, not inside the bread.

This is pareidolia: the perception of meaningful patterns - faces, words, symbols - in stimuli that contain no intentional structure. It is a well-documented property of the human visual and semantic systems and is not, in itself, a malfunction. It is how those systems work.

It becomes a problem when pareidolic artifacts are scraped into training corpora for AI systems. A scraper cannot distinguish between an image captioned "Yeshua appeared on my toast" and an image of a renaissance painting captioned "Christ Pantocrator, 12th century mosaic." Both are ingested. Both are used to train the association between a visual pattern and a linguistic label. In aggregate, the resulting model's representations of religious iconography, faces in food, and countless other categories become a blend of deliberate art and perceptual artifact, with no internal marker distinguishing the two.

The AI Craftspeople Guild manifesto commits its signatories to "maintaining rigorous standards for verification, testing, and validation of AI-assisted outputs." We argue that validation of outputs without validation of inputs is incomplete, and that a specific class of input - pareidolic artifacts - has been systematically overlooked. This paper is a proposed remedy.

2. The Problem in Scope

2.1 Scale

Modern training corpora contain billions of image-text pairs and trillions of text tokens. At that scale, pareidolic artifacts are not rare outliers; they are a persistent background rate. Every category of human perception prone to pareidolia - faces, words, religious symbols, human forms in clouds and rock - is represented at scale in any corpus scraped from the open web.

2.2 Silent Incorporation

The artifacts are ingested without annotation. A conventional scraper preserves image, caption, source URL, and basic metadata. It does not preserve whether the caption describes an intentional feature of the image or a perceptual event in the captioner's visual cortex. Downstream models therefore treat both identically.

2.3 Downstream Consequences

The consequences are subtle and cumulative rather than dramatic. Models trained on corpora containing unmarked pareidolic artifacts will, when prompted, confidently reproduce the pattern: generating "religious figures in toast" on request, describing clouds as containing faces that are not there, and completing prompts like "the hidden figure in this rock formation is" with high confidence in nonexistent figures. The model is not malfunctioning; it is faithfully reproducing the statistical structure of its training data, which includes a systematic confusion between intentional and perceptual content.

2.4 Why This Matters to ACG

Three ACG manifesto commitments are directly implicated:

Demanding Quality. Unannotated pareidolic content is a known, characterizable quality defect in training corpora.
Respecting User Agency and Informed Consent. Downstream users of models have a right to know what fraction of the model's training data describes perceptual events versus intentional content.
Protecting Humans. Confident model reproduction of pareidolic artifacts can reinforce magical thinking, delusional framing, or commercial exploitation of the same, for example the continued auction market in food-based religious imagery.

3. Definitions

Intentional artifact. A stimulus in which a pattern was deliberately placed by an agent with the intention of being perceived as that pattern. Example: a renaissance mosaic of a religious figure.

Pareidolic artifact. A stimulus in which a pattern is perceived by an observer but was not placed there by any agent. Example: a burnt toast surface perceived as a religious figure.

Caption contamination. A caption attached to a pareidolic artifact that describes the perceived pattern as if it were an intentional artifact. Example: captioning a photograph of burnt toast "Image of Yeshua."

Provenance marker. Metadata attached to a scraped item that records which category the item falls into and what evidence supports that classification.

4. Proposed Framework: The Toastmasters Scrapers Guild

We propose a voluntary professional guild, operating under the ACG manifesto, whose members commit to a shared protocol for detecting, annotating, and disclosing pareidolic artifacts in training corpora. The guild is named the Toastmasters Scrapers Guild (TSG) after the canonical example; the name is retained to keep the problem memorable rather than euphemized.

4.1 Guild Commitments

Members of the TSG commit to:

Applying a pareidolia detection pass to all scraped corpora prior to release.
Attaching provenance markers to each item classified as a candidate pareidolic artifact.
Publishing the detection methodology used.
Publishing the false-positive and false-negative rates of the methodology on a standard evaluation set.
Refusing to release corpora for which detection was not performed, or for which detection rates are not disclosed.

These commitments are modeled directly on the ACG manifesto's commitments to transparency, verification, and the right to refuse.

4.2 Relationship to ACG

The TSG is not a competing organization. It is a specialist guild within the broader AI Craftspeople Guild, addressing a specific quality defect. Members of the TSG are necessarily members of ACG. The TSG's protocols implement ACG's manifesto in one narrow but significant domain.

5. Detection Methodology

We outline a three-pass detection protocol. The protocol is open and implementations may vary; the guild standard is the protocol, not any specific implementation.

5.1 Pass One: Caption-Image Mismatch

For image-caption pairs, we compare the caption's claimed content against automated object and scene detection on the image. High-confidence mismatches - captions that describe objects, people, or structures that automated detectors do not find - are flagged as candidates.

This is not a definitive test. Automated detectors miss real content. The mismatch flag is a candidate signal, not a verdict.

5.2 Pass Two: Category Prior Analysis

Certain caption categories are empirically overrepresented in pareidolic artifacts: religious figures on food, faces in natural formations, human forms in clouds, hidden figures in geological features, words or symbols in static noise. Items whose captions fall into these categories and whose images fall outside the expected intentional-artifact distribution, for example food photographs rather than iconographic art, are flagged.

5.3 Pass Three: Source Context

Scraped items carry source URLs. Sources known to specialize in pareidolic content - news-of-the-weird aggregators, auction listings for "miraculous" objects, certain subreddits - receive a higher prior for pareidolic classification. Sources known to specialize in intentional artifacts - museum catalogs, art-history databases - receive a lower prior.

5.4 Combined Classification

The three passes produce a composite score. Items above a published threshold receive a pareidolia_candidate = true marker in provenance metadata, along with the component scores. Items below the threshold are released without the marker. In both cases, the item is preserved in the corpus; detection does not imply removal. The goal is annotation, not censorship.

6. Provenance Protocol

Each scraped item, after processing, carries a provenance record with at minimum:

{
  source_url: string,
  scraped_at: iso8601,
  caption_image_mismatch_score: float,
  category_prior_score: float,
  source_context_score: float,
  pareidolia_candidate: boolean,
  detector_version: string,
  detector_false_positive_rate: float,
  detector_false_negative_rate: float
}

The record is attached to the item in whatever format the corpus uses. It is not optional; an item released by a TSG-conformant scraper without this record is a protocol violation.

Downstream users of the corpus may then choose how to handle flagged items: exclude them, downweight them, train separate models with and without them, or use them directly with full knowledge of their classification. The point is that the choice is available.

7. What This Does Not Solve

We are explicit about scope limits.

Not all misleading content is pareidolic. Deliberate misinformation, hoaxes, and fabricated imagery are separate problems requiring separate treatment.
Detection is imperfect. The methodology will have non-zero false-positive and false-negative rates. Publishing those rates is part of the commitment; eliminating them is not currently feasible.
Cultural variation matters. What counts as pareidolic in one cultural context may count as intentional in another, for example religious visual traditions that explicitly value emergent imagery. Guild protocols should be culturally literate and should not flag intentional artifacts from traditions the detector was not trained to recognize.
This is a data quality framework, not an epistemological framework. We are not claiming to determine what is "real"; we are claiming to annotate a specific, characterizable category of training-data artifact.

8. Connection to Broader Data Quality Work

The TSG framework is one instance of a more general principle: training data should carry its own provenance, and provenance should include information about how the data came to be captioned, not only about what it depicts. Pareidolic artifacts are a compelling first case because the popular example, the face on the toast, is vivid and widely recognized. Other cases - optical illusions captioned as reality, artistic exercises captioned as observation, satirical content captioned as sincere - require similar frameworks.

We propose that the ACG community treat pareidolia detection as a pilot for a broader provenance protocol, and that the Toastmasters Scrapers Guild serve as the working group for the pilot.

9. Governance and Membership

Following ACG precedent, TSG membership is open to any ACG signatory who agrees to the guild commitments in section 4.1. The guild will maintain:

A public detection protocol specification, versioned.
A public evaluation set with known intentional and known pareidolic items.
A public registry of conformant scrapers and their latest detection rates.
A public tribunal for disputes over classification, modeled on the cross-guild arbitration mechanism used in other ACG-adjacent structures.

No single implementation is endorsed. The guild standard is the protocol and the reporting discipline.

10. Conclusion

The face on the toast is a useful example precisely because it is obvious. Nobody reading this paper believes that a religious figure literally appeared in a piece of bread. But the same perceptual mechanism that produced that image produced billions of similar artifacts that were scraped, captioned, and used to train models now deployed at scale. The problem is not that anyone was fooled; the problem is that nobody was asked.

The Toastmasters Scrapers Guild is a voluntary professional commitment to ask. It does not remove content, it does not judge content, and it does not replace any existing data-quality practice. It annotates. It discloses. It allows downstream users to make informed choices about training data that currently passes through unmarked.

This is what professional standards in scraping look like under the ACG manifesto. We invite critique, implementation, and extension.

References

Sagan, C. (1995). The demon-haunted world: Science as a candle in the dark. Random House.
Voss, J. L., & Federmeier, K. D. (2011). FN400 potentials are functionally identical to N400 potentials and reflect semantic processing during recognition testing. Psychophysiology, 48(4), 532-546.
Liu, J., Li, J., Feng, L., Li, L., Tian, J., & Lee, K. (2014). Seeing Jesus in toast: Neural and behavioral correlates of face pareidolia. Cortex, 53, 60-77.
AI Craftspeople Guild. (2026). ACG Manifesto. https://aicraftspeopleguild.github.io/aicraftspeopleguild-manifesto.html

Correspondence: draft circulated for ACG community review.

Conflicts of interest: None declared.

Funding: None.

Data availability: Protocol specification and evaluation set to be released concurrent with final version.

Abstract

1. Introduction

2. The Problem in Scope

2.1 Scale

2.2 Silent Incorporation

2.3 Downstream Consequences

2.4 Why This Matters to ACG

3. Definitions

4. Proposed Framework: The Toastmasters Scrapers Guild

4.1 Guild Commitments

4.2 Relationship to ACG

5. Detection Methodology

5.1 Pass One: Caption-Image Mismatch

5.2 Pass Two: Category Prior Analysis

5.3 Pass Three: Source Context

5.4 Combined Classification

6. Provenance Protocol

7. What This Does Not Solve

8. Connection to Broader Data Quality Work

9. Governance and Membership

10. Conclusion

References

Read the White Papers