LibGuides: Find data for research: Open & licensed data sets

Data and statistics

Abacus (Restricted to UVic Users)

Abacus holds UVic Libraries' collection of licensed datasets, including public use microdata (PUMFs) from StatCan censuses, other social and health surveys, public opinion polls, and spatial data for GIS. Access is restricted to UVic users.

Datasets in Abacus include:

Statistics Canada's Data Liberation Initiative (DLI)
Statistics Canada Open Data, including Public Use Microdata Files (PUMFs)
Spatial data from DMTISpatial
Inter-University Consortium for Political and Social Research (ICPSR)
Postal Code Conversion Files

Open research data

Borealis (UVic Dataverse Collection)

Borealis, the Canadian Dataverse Repository is a national data repository for research data. The service, supported by UVic Libraries, is free for UVic researchers to deposit their datasets, which are registered with DOIs and are stored in a secure environment on Canadian servers. Researchers can choose to make their datasets available to the public, to specific individuals, or to keep it private.

With Borealis, researchers can search across research data from over 65 Canadian universities.

Lunaris

Lunaris provides a single point of search for research data held in Canadian data repositories, including academic institutions, departments at all levels of government, and research organizations. There are over 80,000 datasets from over 100 Canadian repositories and data collections currently indexed by Lunaris.

Confidential data

UVic Research Data Centre

For access to confidential microdata from Statistics Canada census and surveys, contact the UVic Research Data Centre.

The UVic Research Data Centre is located on the basement floor of the McPherson Library and offers 6 workstations.
Visit the Statistics Canada Microdata Access Portal to apply for access.
Search available data sources via the Canadian Research Data Centre Network website.

The UVic RDC provides access, for approved projects, to a growing variety of Statistics Canada confidential microdata household, population and workplace files. The microdata used by researchers come primarily from Statistics Canada Survey Master files. Increasingly, the Research Data Centres (RDCs) are repositories of administrative records from a variety of sources including tax, employment insurance, social assistance, and hospitalization records.

Finding datasets in our catalogue

UVic catalogues or has access to thousands of data sources. To specifically search for datasets in "Library Search" (Primo):

1. Enter a search term in the search box as you would for any other resource.

2. Once you are directed to the results page, use the "Refine Results" filter on the left-hand side of the page. Go to "Content Type" and then "Show More"

3. Now choose "Datasets" and then click "Apply Filters" (green button)

4. You will now see the datasets that have been added to our catalogue (note: not all datasets are catalogued).

Web data

The UVic Libraries collects hundreds of websites as part of its web archiving efforts using Archive-It. UVic Libraries can help researchers access a variety of data related to these collections via the to Archive-It's Research Services, including:

WARC

WARC and their predecessor ARC files are the files into which data crawled using Archive-It is stored. Each file may contain multiple digital objects, including HTML, images, and videos. (Note that collection data can consist of both WARC and ARC files depending upon when they were archived through our service. Throughout these guides, the term “WARC files” refers to both WARC and ARC files.)

WAT

WAT stands for Web Archive Transformation, and are composed of key metadata such as provenance/capture information, essential text and link data, and other information. They are extracted from WARCs for every resource; because WAT files map one-to-one to WARC files, a collection's WARC files will have corresponding WAT files. WAT formats metadata into JavaScript Object Notation (JSON). The benefit is WATs are around 5%-20% the size of corresponding WARCs.

LGA

Longitudinal Graph Analysis files are archival web graph files that include a complete list of what URIs link to what URIs, along with a timestamp, from a collection’s origin through present. They are ~1% the size of a collection's aggregate WARC files, and deliver as a ZIP container of two files:

ID-Map:

Contains one line per each URL in a collection and assigns a UID (unique identifier) to each URL.
Each line contains a JSON object with three fields: The URL's UID ("id"), the URL ("url") and the URL in SURT form ("surt_url")

ID-Graph:

Each line contains a JSON object with three fields: The URL's UID ("id"), the timestamp associated with the capture of this URL ("timestamp"), and the set of the UIDs of the URLs linked to by this URL at that given timestamp ("outlink_ids")

WANE

Web Archive Named Entities are files that use named-entity recognition tools to generate a list of all the people, places, and organizations mentioned in each URI in a web archive, with a timestamp of when the URI was captured. The purpose is to link people, places, and organizations to time. A WANE dataset is generated using the Stanford Named Entity Recognizer software (http://nlp.stanford.edu/software/CRF-NER.shtml) to extract named entities from each textual resource in a collection. The analyzer uses an English model 3-class classifier to extract names that correspond to recognized Persons, Organizations, and Locations. WANE files are less than 1% the size of their corresponding WARC files, and are structured as a JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".

Please contact Corey Davis for more information.