Introduction: Studies involving patients with gender identity disorder (GID) are inconsistent with regard to outcomes and often difficult to compare because of the vague descriptions of the diagnostic process. A multisite study is needed to scrutinize the utility and generality of different aspects of the diagnostic criteria for GID. Aim: To investigate the way in which the diagnosis-specific Diagnostic and Statistical Manual of Mental Disorders, 4th Edition, Text Revision criteria for GID were used to reach a psychiatric diagnosis in four European countries: the Netherlands (Amsterdam), Norway (Oslo), Germany (Hamburg), and Belgium (Ghent). The main goal was to compare item (symptom) characteristics across countries. Methods: The current study included all new applicants to the four GID clinics who were seen between January 2007 and March 2009, were at least 16 years of age at their first visit, and had completed the diagnostic assessment (N = 214, mean age = 32 ± 12.2 years). Mokken scale analysis, a form of Nonparametric Item Response Theory (NIRT) was performed. Main Outcome Measures: Operationalization and quantification of the core criteria A and B resulted in a 23-item score sheet that was filled out by the participating clinicians after they had made a diagnosis. Results: We found that, when ordering the 23 items according to their means for each country separately, the rank ordering was similar among the four countries for 21 of the items. Furthermore, only one scale emerged, which combined criteria A and B when all data were analyzed together. Conclusions: Our results indicate that patients' symptoms were interpreted in a similar fashion in all four countries. However, we did not find support for the treatment of A and B as two separate criteria. We recommend the use of NIRT in future studies, especially in studies with small sample sizes and/or with data that show a poor fit to parametric IRT models.