The arXiv is a repository of over 1 million preprints. It is truly open access, and excellent for testing language modelling / machine learning prototypes.
Computers consist of on/off switches and process meaningless symbols. So how is it that we can hope that machines learn meaning of words and documents?
If a machine is to learn about humans from Wikipedia, it must experience the corpus as a human sees it and ignore the mass of robot-generated pages.