Large pretrained language models like GPT-3 have acquired a surprising ability to perform zero-shot classification (ZSC). For example, to classify review sentiments, we can"prompt"the language model with the review and the question"Is the review positive?``as the context, and ask it to predict whether the next word is"Yes"or"No". However, these models are not specialized for answering these prompts . To address this weakness, we propose meta-tuning, which trains the model to specialize in answering prompts but still generalize to unseen tasks . To create the training data, we aggregated 43 existing datasets, annotated 441 label descriptions in total, and unified them into the above question answering (QA) format . After meta-tuning, our model outperforms a same-sized QA model for most labels on unseen tasks, and we forecast that the performance would improve for even larger models . Therefore, measuring ZSC performance on non-specialized language models might underestimate their true capability, and community-wide efforts on aggregating datasets and unifying their formats can help build models that understand prompts better.