URI Resolver Service

cpLogD dataset

Study Abstract

Lipophilicity is a major determinant of ADMET properties and overall suitability of drug candidates. We have developed large-scale models for prediction of chemical compound water-octanol distribution coefficient (logD), aiding drug discovery projects. Models are created and evaluated by Support Vector Machines (SVM) with a linear kernel using conformal prediction methodology, outputting prediction intervals at a specified confidence level. Models are based on ACD/logD data for 1.6 million compounds from the ChEMBL database, show the predictive ability Q 2 =0.973 and with the best performing nonconformity measure having median prediction interval width of ±0.39 log units at 80 % confidence and ±0.60 log units at 90 % confidence. The model is available as an online service via an OpenAPI interface, a web page with a molecular editor, and we also publish predictive values at 90 % confidence level for 91 M PubChem structures in RDF format for download and as an URI resolver service.

Notes on the data format

Compound URIs are on the form[number] where [number] is the pubchem compound id (CID) of the compound. The confidence values for the lower-, mid- and upper-values in the confidence interval, are added as predicates, etc, from which the actual numerical value is linked via the predicate. Each compound is also linked to its counterpart in PubChem's RDF service, and includes its molecular structure in SMILES format. See the code example below for more details.

Example compounds:

Code example

A small code example, demonstrating the data structure of the dataset, in turtle format:

@prefix r: <> .
@prefix w: <> .
@prefix o: <> .
@prefix s: <> .
@prefix p: <> .
@prefix i: <> .
@prefix c: <> .
@prefix x: <> .

s:has-unit a w:annotationProperty .
s:has-value a w:annotationProperty .
c:hasConfidence a w:annotationProperty .
c:Compound a w:Class .
c:Confidence a w:Class .

c:Confidence0p90 a c:Confidence ;
s:has-unit o:UO_0000190 ;
s:has-value 0.9^^x:float .

c:ValuePoint a w:Class ;
s:has-unit o:UO_0000190 .
c:0p90ConfidenceValuePoint a c:ValuePoint ;
c:hasConfidence c:Confidence0p90 .

c:LowerPoint0p90 a c:0p90ConfidenceValuePoint .
c:MidPoint0p90 a c:0p90ConfidenceValuePoint .
c:UpperPoint0p90 a c:0p90ConfidenceValuePoint .

c:Compound1 a c:Compound ;
w:sameAs p:CID1 ;
i:CHEMINF_000376 "CC(=O)OC(CC(=O)O)C[N+](C)(C)C" ;
c:hasLowerPoint c:C1LowerPoint0p90 ;
c:hasMidPoint c:C1MidPoint0p90 ;
c:hasUpperPoint c:C1UpperPoint0p90 .

c:C1LowerPoint0p90 a c:LowerPoint0p90 ;
s:has-value -4.331^^x:float .

c:C1MidPoint0p90 a c:MidPoint0p90 ;
s:has-value -3.741^^x:float .

c:C1UpperPoint0p90 a c:UpperPoint0p90 ;
s:has-value -3.151^^x:float .

Download this dataset

This dataset can be downloaded via DOI:10.5281/zenodo.1091111