This study carried out an analysis of item- and test-level data from the Grade 8rnBiology Test of the Ethiopian Third National Learning Assessment (ETNLA). A totalrnof 10,795 students sat for the biology test ill 2007, of these 9,552 were used for thernstudy. The test was originally prepared in English and was then translated into threernlanguage versions (Afan Oromo, Somali and Tigrigna). The main purpose was to seernhow the items worked across language groups. A two Parameter Logistic Modelrn(2PLM) based on Item Response Theory was used to investigate latent traits and thernmain statistics generated were IRT ability scores and IRT parameter estimatesrn(difficulty level and discrimination index). Item Characteristic Curves (ICC) and ItemrnPerson Dual Plots were generated for all 40 items by language groups. Based on thernIRT ability scores, language groups were compared using one-way anova andrnrecursive partitioning analysis. Item and test statistics were also computed followingrnClassical Test Theory (CTT) model and results were compared with that of IRT ThernItem Characteristic Curves (ICC) differed from the expected ogive shape and variedrnacross language groups. The Test Information Function (TIF) also varied acrossrnlanguage groups indicating the test as a whole and items in particular did not workrnthe same way for the subgroups. A recursive partitioning analysis result based on IRTrnability scores showed 20% (R 2 =0.20, F(3. 9518), P < .001) of the variations inrnachievement score was accounted by differences in language of instruction. Thernvariance explained using CTT procedure was 13.4% (R 2 =0.134, F(3, 9548). P < .001).rnThe number of problem items (items which were too difficult and or with very lowrndiscrimination power) by language group hased on CTT were: Somali (19), AfanrnOromo, (J 2), English (10) and Tigrigna (8). The highest test score (20) was forrnTigrigna, followed by Afan Oromo (18). The English language group students scoredrnthe least (15). The performance of Somali language group students were about equalrnto that of English group ones. The finding show that there were a number oj itemsrnwhich did not work the same way across the Jour language groups which make themrnas language Differential Item Functioning (DIF) suspects. Based on the findings it isrnrecommended that in the future detailed item and test analysis following the IRTrnmodel shouLd be employed across subgroups 011 the pilot as well as on the operationalrntests. This will help to Jurther explore DIF ill future administrations oj the test inrnorder to determine whether these patterns represent real differences in achievementrnlevels or a systematic bias that is inappropriately impacting on the scores ofrnparticular student groups.