GitPedia

Switchboard Corpus

Utilities for Processing the Switchboard Dialogue Act Corpus

From NathanDuran·Updated January 27, 2026·View on GitHub·

Utilities for processing the [Switchboard Dialogue Act Corpus](https://web.stanford.edu/~jurafsky/ws97/) for the purpose of dialogue act (DA) classification. The data is split into the original [training](https://web.stanford.edu/~jurafsky/ws97/ws97-train-convs.list) and [test](https://web.stanford.edu/~jurafsky/ws97/ws97-test-convs.list) sets suggested by the authors (1115 training and 19 test). The remaining 21 dialogues have been used as a validation set. The project is written primarily in Python, distributed under the GNU General Public License v3.0 license, first published in 2018. Key topics include: corpus, corpus-data, corpus-processing, corpus-tools, dialogue.

Processing the Switchboard Dialogue Act Corpus

Utilities for processing the Switchboard Dialogue Act Corpus
for the purpose of dialogue act (DA) classification. The data is split into the original training
and test sets suggested by the authors (1115 training and 19 test).
The remaining 21 dialogues have been used as a validation set.

Scripts

The swda_to_text.py script processes all dialogues into a plain text format. Individual dialogues are saved into directories corresponding
to the set they belong to (train, test, etc). All utterances in a particular set are also saved to a text file.

The utilities.py script contains various helper functions for loading/saving the data.

The process_transcript.py includes functions for processing each dialogue.

The swda_metadata.py generates various metadata from the processed dialogues and saves them as a dictionary to a pickle file.
The words, labels and frequencies are also saved as plain text files in the /metadata directory.

Thanks to Christopher Potts for providing the raw data in .csv format and the swda.py script for processing the .csv data, both of which can be found here

Data Format

Utterance are tagged with the SWBD-DAMSL DA.

By default:

  • Utterances are written one per line in the format Speaker | Utterance Text | Dialogue Act Tag.
  • Setting the utterance_only_flag == True, will change the default output to only one utterance per line i.e. no speaker or DA tags.
  • Utterances marked as Non-verbal ('x' tags) are removed i.e. 'Laughter' or 'Throat_clearing'.
  • Utterances marked as Interrupted ('+' tags) and continued later are concatenated to make un-interrupted sentences.
  • All disfluency annotations are removed i.e. '#', '<', '>', etc.

Example Utterances

A|What is the nature of your company's business?|qw

B|Well, it's actually, uh,|^h

B|we do oil well services.|sd

Dialogue Acts

Dialogue ActLabelsCount%Train CountTrain %Test CountTest %Val CountVal %
Statement-non-opinionsd7513637.627254937.71131732.30127038.81
Acknowledge (Backchannel)b3828119.173695019.2176418.7356717.33
Statement-opinionsv2642113.232508713.0471817.6161618.83
Uninterpretable%151957.61145977.593498.562497.61
Agree/Acceptaa111235.57107705.602075.081464.46
Appreciationba47572.3846192.40761.86621.89
Yes-No-Questionqy47252.3745942.39842.06471.44
Yes Answersny30301.5229181.52731.79391.19
Conventional-closingfc25811.2924801.29811.99200.61
Wh-Questionqw19760.9918960.99551.35250.76
No Answersnn13740.6913340.69260.64140.43
Response Acknowledgementbk13060.6512710.66280.6970.21
Hedgeh12260.6111810.61230.56220.67
Declarative Yes-No-Questionqy^d12180.6111670.61360.88150.46
Backchannel in Question Formbh10530.5310150.53210.51170.52
Quotation^q9830.499310.48170.42351.07
Summarize/Reformulatebf9520.489050.47230.56240.73
Otherfo_o_fw_"_by_bc8790.448570.45150.3770.21
Affirmative Non-yes Answersna8470.428310.43100.2560.18
Action-directivead7450.377120.37270.6660.18
Collaborative Completion^27230.366900.36190.47140.43
Repeat-phraseb^m6870.346550.34210.51110.34
Open-Questionqo6560.336310.33160.3990.28
Rhetorical-Questionqh5750.295540.29120.2990.28
Hold Before Answer/Agreement^h5560.285390.2870.17100.31
Rejectar3440.173370.1830.0740.12
Negative Non-no Answersng3020.152900.1560.1560.18
Signal-non-understandingbr2980.152860.1590.2230.09
Other Answersno2840.142770.1460.1510.03
Conventional-openingfp2250.112200.1150.1200.00
Or-Clauseqrr2090.102060.1120.0510.03
Dispreferred Answersarp_nd2070.102040.1130.0700.00
3rd-party-talkt31170.061150.0600.0020.06
Offers, Options Commitsoo_co_cc1100.061090.0600.0010.03
Maybe/Accept-partaap_am1040.05970.0570.1700.00
Downplayert11030.051020.0510.0200.00
Self-talkbd1030.051000.0510.0220.06
Tag-Question^g920.05920.0500.0000.00
Declarative Wh-Questionqw^d800.04790.0410.0200.00
Apologyfa790.04760.0420.0510.03
Thankingft780.04670.0370.1740.12

Label Frequencies

Metadata

  • Total number of utterances: 199740
  • Max utterance length: 132
  • Mean utterance length: 9.62
  • Total Number of dialogues: 1155
  • Max dialogue length: 457
  • Mean dialogue length: 172.94
  • Vocabulary size: 22302
  • Number of labels: 41
  • Number of speakers: 2

Train set

  • Number of dialogues: 1115
  • Max dialogue length: 457
  • Mean dialogue length: 172.55
  • Number of utterances: 192390

Test set

  • Number of dialogues: 19
  • Max dialogue length: 330
  • Mean dialogue length: 214.63
  • Number of utterances: 4078

Val set

  • Number of dialogues: 21
  • Max dialogue length: 299
  • Mean dialogue length: 155.81
  • Number of utterances: 3272

Keys and values for the metadata dictionary

  • num_utterances = Total number of utterance in the full corpus.
  • max_utterance_len = Number of words in the longest utterance in the corpus.
  • mean_utterance_len = Average number of words in utterances.
  • num_dialogues = Total number of dialogues in the corpus.
  • max_dialogues_len = Number of utterances in the longest dialogue in the corpus.
  • mean_dialogues_len = Average number of utterances in dialogues.
  • word_freq = Dataframe with Word and Count columns.
  • vocabulary = List of all words in vocabulary.
  • vocabulary_size = Number of words in the vocabulary.
  • label_freq = Dataframe containing all data in the Dialogue Acts table above.
  • labels = List of all DA labels.
  • num_labels = Number of labels used from the Switchboard data.
  • speakers = List of all speakers.
  • num_speakers = Number of speakers in the Switchboard data.

Each data set also has:

  • <setname>_num_utterances = Number of utterances in the set.
  • <setname>_num_dialogues = Number of dialogues in the set.
  • <setname>_max_dialogue_len = Length of the longest dialogue in the set.
  • <setname>_mean_dialogue_len = Mean length of dialogues in the set.
This article is auto-generated from NathanDuran/Switchboard-Corpus via the GitHub API.Last fetched: 6/18/2026